From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A31D4C43334 for ; Tue, 28 Jun 2022 17:58:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233621AbiF1R6a (ORCPT ); Tue, 28 Jun 2022 13:58:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44706 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233609AbiF1R6O (ORCPT ); Tue, 28 Jun 2022 13:58:14 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F80A6599 for ; Tue, 28 Jun 2022 10:58:00 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 0C69A619E1 for ; Tue, 28 Jun 2022 17:58:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F616C3411D; Tue, 28 Jun 2022 17:57:57 +0000 (UTC) Date: Tue, 28 Jun 2022 18:57:53 +0100 From: Catalin Marinas To: Peter Collingbourne Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier , kvm@vger.kernel.org, Andy Lutomirski , Linux ARM , Michael Roth , Chao Peng , Will Deacon , Evgenii Stepanov , Steven Price Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled Message-ID: References: <20220623234944.141869-1-pcc@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote: > On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > > wrote: > > > > + Steven as he added the KVM and swap support for MTE. > > > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > > on user page allocation, copy them on write. For swap we can scan and if > > > > all tags are 0 and just skip saving them. > > > > > > A problem with this approach is that it would conflict with any > > > potential future changes that we might make that would require the > > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > > vma available where it matters. We can keep PG_mte_tagged around but > > always set it on page allocation (e.g. when zeroing or CoW) and check > > VM_MTE_ALLOWED rather than VM_MTE. > > Right, but for avoiding tagging we would like that to apply to as many > pages as possible. If we check VM_MTE_ALLOWED then the heap pages of > those processes that are not using MTE would not be covered, which on > a mostly non-MTE system would be a majority of pages. By non-MTE system, I guess you mean a system that supports MTE but most of the user apps don't use it. That's why it would be interesting to see the effect of using DC GZVA instead of DC ZVA for page zeroing. I suspect on Android you'd notice the fork() penalty a bit more with all the copy-on-write having to copy tags. But we can't tell until we do some benchmarks. If the penalty is indeed significant, we'll go back to assessing the races here. Another thing that won't happen for PG_mte_tagged currently is KSM page merging. I had a patch to allow comparing the tags but eventually dropped it (can dig it out). > Over the weekend I thought of another policy, which would be similar > to your original one. We can always tag pages which are mapped as > MAP_SHARED. These pages are much less common than private pages, so > the impact would be less. So the if statement in > alloc_zeroed_user_highpage_movable would become: > > if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && > (vma->vm_flags & VM_SHARED))) > > That would allow us to put basically any shared mapping in the guest > address space without needing to deal with races in sanitise_mte_tags. It's not just about VM_SHARED. A page can be effectively shared as a result of a fork(). It is read-only in all processes but still shared and one task may call mprotect(PROT_MTE). Another case of sharing is between the VMM and the guest though I think an mprotect() in the VMM would trigger the unmapping of the guest address and pages mapped into guests already have PG_mte_tagged set. We probably need to draw a state machine of all the cases. AFAICT, we need to take into account a few of the below (it's probably incomplete; I've been with Steven through most of them IIRC): 1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent mprotect(PROT_MTE). That's one of the things I dislike is that a late tag clearing via set_pte_at() can happen without breaking the CoW mapping. It's a bit counter-intuitive if you treat the tags as data (rather than some cache), you don't expect a read-only page to have some (tag) updated. 2. Shared mappings with concurrent mprotect(PROT_MTE). 3. Shared mapping restoring from swap. 4. Private mapping restoring from swap into CoW mapping. 5. KVM faults. 6. Concurrent ptrace accesses (or KVM tag copying) What currently risks failing I think is breaking a CoW mapping with concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we do this the other way around - clear tags first and then set the flag. I think using another bit as a lock may solve most (all) of these but another option is to treat the tags as data and make sure they are set before mapping. > We may consider going further than this and require all pages mapped > into guests with MTE enabled to be PROT_MTE. We discussed this when upstreaming KVM support and the idea got pushed back. The main problem is that the VMM may use MTE for itself but can no longer access the guest memory without the risk of taking a fault. We don't have a match-all tag in user space and we can't teach the VMM to use the PSTATE.TCO bit since driver emulation can be fairly generic. And, of course, there's also the ABI change now. > I think it would allow > dropping sanitise_mte_tags entirely. This would not be a relaxation of > the ABI but perhaps we can get away with it if, as Cornelia mentioned, > QEMU does not currently support MTE, and since crosvm doesn't > currently support it either there's no userspace to break AFAIK. This > would also address a current weirdness in the API where it is possible > for the underlying pages of a MAP_SHARED file mapping to become tagged > via KVM, said tags are exposed to the guest and are discarded when the > underlying page is paged out. Ah, good point, shared file mappings is another reason we did not allow MAP_SHARED and MTE for guest memory. BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED irrespective of whether we allow MAP_SHARED or not. > We can perhaps accomplish it by dropping > support for KVM_CAP_ARM_MTE in the kernel and introducing something > like a KVM_CAP_ARM_MTE_V2 with the new restriction. That's an option for the ABI upgrade but we still need to solve the potential races. -- Catalin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A2C5C43334 for ; Tue, 28 Jun 2022 17:58:06 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 6FA804B431; Tue, 28 Jun 2022 13:58:06 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TT1gy6tgprw9; Tue, 28 Jun 2022 13:58:05 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 08CA04B47E; Tue, 28 Jun 2022 13:58:05 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 7FBFB4B431 for ; Tue, 28 Jun 2022 13:58:03 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZqFqoEJKXVhj for ; Tue, 28 Jun 2022 13:58:02 -0400 (EDT) Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by mm01.cs.columbia.edu (Postfix) with ESMTPS id 0EC504B21A for ; Tue, 28 Jun 2022 13:58:01 -0400 (EDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 136D161A01; Tue, 28 Jun 2022 17:58:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F616C3411D; Tue, 28 Jun 2022 17:57:57 +0000 (UTC) Date: Tue, 28 Jun 2022 18:57:53 +0100 From: Catalin Marinas To: Peter Collingbourne Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled Message-ID: References: <20220623234944.141869-1-pcc@google.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: Cc: kvm@vger.kernel.org, Marc Zyngier , Andy Lutomirski , Evgenii Stepanov , Michael Roth , Chao Peng , Steven Price , Will Deacon , kvmarm@lists.cs.columbia.edu, Linux ARM X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote: > On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > > wrote: > > > > + Steven as he added the KVM and swap support for MTE. > > > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > > on user page allocation, copy them on write. For swap we can scan and if > > > > all tags are 0 and just skip saving them. > > > > > > A problem with this approach is that it would conflict with any > > > potential future changes that we might make that would require the > > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > > vma available where it matters. We can keep PG_mte_tagged around but > > always set it on page allocation (e.g. when zeroing or CoW) and check > > VM_MTE_ALLOWED rather than VM_MTE. > > Right, but for avoiding tagging we would like that to apply to as many > pages as possible. If we check VM_MTE_ALLOWED then the heap pages of > those processes that are not using MTE would not be covered, which on > a mostly non-MTE system would be a majority of pages. By non-MTE system, I guess you mean a system that supports MTE but most of the user apps don't use it. That's why it would be interesting to see the effect of using DC GZVA instead of DC ZVA for page zeroing. I suspect on Android you'd notice the fork() penalty a bit more with all the copy-on-write having to copy tags. But we can't tell until we do some benchmarks. If the penalty is indeed significant, we'll go back to assessing the races here. Another thing that won't happen for PG_mte_tagged currently is KSM page merging. I had a patch to allow comparing the tags but eventually dropped it (can dig it out). > Over the weekend I thought of another policy, which would be similar > to your original one. We can always tag pages which are mapped as > MAP_SHARED. These pages are much less common than private pages, so > the impact would be less. So the if statement in > alloc_zeroed_user_highpage_movable would become: > > if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && > (vma->vm_flags & VM_SHARED))) > > That would allow us to put basically any shared mapping in the guest > address space without needing to deal with races in sanitise_mte_tags. It's not just about VM_SHARED. A page can be effectively shared as a result of a fork(). It is read-only in all processes but still shared and one task may call mprotect(PROT_MTE). Another case of sharing is between the VMM and the guest though I think an mprotect() in the VMM would trigger the unmapping of the guest address and pages mapped into guests already have PG_mte_tagged set. We probably need to draw a state machine of all the cases. AFAICT, we need to take into account a few of the below (it's probably incomplete; I've been with Steven through most of them IIRC): 1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent mprotect(PROT_MTE). That's one of the things I dislike is that a late tag clearing via set_pte_at() can happen without breaking the CoW mapping. It's a bit counter-intuitive if you treat the tags as data (rather than some cache), you don't expect a read-only page to have some (tag) updated. 2. Shared mappings with concurrent mprotect(PROT_MTE). 3. Shared mapping restoring from swap. 4. Private mapping restoring from swap into CoW mapping. 5. KVM faults. 6. Concurrent ptrace accesses (or KVM tag copying) What currently risks failing I think is breaking a CoW mapping with concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we do this the other way around - clear tags first and then set the flag. I think using another bit as a lock may solve most (all) of these but another option is to treat the tags as data and make sure they are set before mapping. > We may consider going further than this and require all pages mapped > into guests with MTE enabled to be PROT_MTE. We discussed this when upstreaming KVM support and the idea got pushed back. The main problem is that the VMM may use MTE for itself but can no longer access the guest memory without the risk of taking a fault. We don't have a match-all tag in user space and we can't teach the VMM to use the PSTATE.TCO bit since driver emulation can be fairly generic. And, of course, there's also the ABI change now. > I think it would allow > dropping sanitise_mte_tags entirely. This would not be a relaxation of > the ABI but perhaps we can get away with it if, as Cornelia mentioned, > QEMU does not currently support MTE, and since crosvm doesn't > currently support it either there's no userspace to break AFAIK. This > would also address a current weirdness in the API where it is possible > for the underlying pages of a MAP_SHARED file mapping to become tagged > via KVM, said tags are exposed to the guest and are discarded when the > underlying page is paged out. Ah, good point, shared file mappings is another reason we did not allow MAP_SHARED and MTE for guest memory. BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED irrespective of whether we allow MAP_SHARED or not. > We can perhaps accomplish it by dropping > support for KVM_CAP_ARM_MTE in the kernel and introducing something > like a KVM_CAP_ARM_MTE_V2 with the new restriction. That's an option for the ABI upgrade but we still need to solve the potential races. -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C1473C43334 for ; Tue, 28 Jun 2022 17:59:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=K5Lvtnu4pE2PI4C2pRXT2EJEn9x8guwTfG7+Z6Idtbg=; b=dA/N7TQNJ0nN9e tLOrxvd1tYw2XYVR+BlrVp7D5pAhEFlUp2KDqWolfTB2yROFPBhqV8TpzijcP5/y2ytTbQ4HDsu1/ BRN6yoEuSB1oehyjEVef9G8EdYUPAW9oljcbqkRiuJ9v4Ze2D5OQmcq1O5oXoPGuF8KW/xqFlExzn vjBmzpglFV50gHh4GwIg+QoxNLX5Tlol6kfZFgZE7PGvIwHS/45X5slrcb5yNAkgGRlc1Xa82c1Z6 t/1/+c3vmwAaJMMjrQW6lhmtdQbqWIYk1BOYQnDTiTekn6kXiP6zwwTTFNLeTUXUtKhWNYzlNVx/m b8UV3lkBQevm4oWbvncg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1o6FTS-007UgE-Jn; Tue, 28 Jun 2022 17:58:06 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1o6FTN-007Uef-1b for linux-arm-kernel@lists.infradead.org; Tue, 28 Jun 2022 17:58:02 +0000 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 136D161A01; Tue, 28 Jun 2022 17:58:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3F616C3411D; Tue, 28 Jun 2022 17:57:57 +0000 (UTC) Date: Tue, 28 Jun 2022 18:57:53 +0100 From: Catalin Marinas To: Peter Collingbourne Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier , kvm@vger.kernel.org, Andy Lutomirski , Linux ARM , Michael Roth , Chao Peng , Will Deacon , Evgenii Stepanov , Steven Price Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled Message-ID: References: <20220623234944.141869-1-pcc@google.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220628_105801_210173_88A0061F X-CRM114-Status: GOOD ( 54.77 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Jun 27, 2022 at 11:16:17AM -0700, Peter Collingbourne wrote: > On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > > wrote: > > > > + Steven as he added the KVM and swap support for MTE. > > > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > > on user page allocation, copy them on write. For swap we can scan and if > > > > all tags are 0 and just skip saving them. > > > > > > A problem with this approach is that it would conflict with any > > > potential future changes that we might make that would require the > > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > > vma available where it matters. We can keep PG_mte_tagged around but > > always set it on page allocation (e.g. when zeroing or CoW) and check > > VM_MTE_ALLOWED rather than VM_MTE. > > Right, but for avoiding tagging we would like that to apply to as many > pages as possible. If we check VM_MTE_ALLOWED then the heap pages of > those processes that are not using MTE would not be covered, which on > a mostly non-MTE system would be a majority of pages. By non-MTE system, I guess you mean a system that supports MTE but most of the user apps don't use it. That's why it would be interesting to see the effect of using DC GZVA instead of DC ZVA for page zeroing. I suspect on Android you'd notice the fork() penalty a bit more with all the copy-on-write having to copy tags. But we can't tell until we do some benchmarks. If the penalty is indeed significant, we'll go back to assessing the races here. Another thing that won't happen for PG_mte_tagged currently is KSM page merging. I had a patch to allow comparing the tags but eventually dropped it (can dig it out). > Over the weekend I thought of another policy, which would be similar > to your original one. We can always tag pages which are mapped as > MAP_SHARED. These pages are much less common than private pages, so > the impact would be less. So the if statement in > alloc_zeroed_user_highpage_movable would become: > > if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && > (vma->vm_flags & VM_SHARED))) > > That would allow us to put basically any shared mapping in the guest > address space without needing to deal with races in sanitise_mte_tags. It's not just about VM_SHARED. A page can be effectively shared as a result of a fork(). It is read-only in all processes but still shared and one task may call mprotect(PROT_MTE). Another case of sharing is between the VMM and the guest though I think an mprotect() in the VMM would trigger the unmapping of the guest address and pages mapped into guests already have PG_mte_tagged set. We probably need to draw a state machine of all the cases. AFAICT, we need to take into account a few of the below (it's probably incomplete; I've been with Steven through most of them IIRC): 1. Private mappings with mixed PROT_MTE, CoW sharing and concurrent mprotect(PROT_MTE). That's one of the things I dislike is that a late tag clearing via set_pte_at() can happen without breaking the CoW mapping. It's a bit counter-intuitive if you treat the tags as data (rather than some cache), you don't expect a read-only page to have some (tag) updated. 2. Shared mappings with concurrent mprotect(PROT_MTE). 3. Shared mapping restoring from swap. 4. Private mapping restoring from swap into CoW mapping. 5. KVM faults. 6. Concurrent ptrace accesses (or KVM tag copying) What currently risks failing I think is breaking a CoW mapping with concurrent mprotect(PROT_MTE) - we set PG_mte_tagged before zeroing the tags. A concurrent copy may read stale tags. In sanitise_mte_tags() we do this the other way around - clear tags first and then set the flag. I think using another bit as a lock may solve most (all) of these but another option is to treat the tags as data and make sure they are set before mapping. > We may consider going further than this and require all pages mapped > into guests with MTE enabled to be PROT_MTE. We discussed this when upstreaming KVM support and the idea got pushed back. The main problem is that the VMM may use MTE for itself but can no longer access the guest memory without the risk of taking a fault. We don't have a match-all tag in user space and we can't teach the VMM to use the PSTATE.TCO bit since driver emulation can be fairly generic. And, of course, there's also the ABI change now. > I think it would allow > dropping sanitise_mte_tags entirely. This would not be a relaxation of > the ABI but perhaps we can get away with it if, as Cornelia mentioned, > QEMU does not currently support MTE, and since crosvm doesn't > currently support it either there's no userspace to break AFAIK. This > would also address a current weirdness in the API where it is possible > for the underlying pages of a MAP_SHARED file mapping to become tagged > via KVM, said tags are exposed to the guest and are discarded when the > underlying page is paged out. Ah, good point, shared file mappings is another reason we did not allow MAP_SHARED and MTE for guest memory. BTW, in user_mem_abort() we should probably check for VM_MTE_ALLOWED irrespective of whether we allow MAP_SHARED or not. > We can perhaps accomplish it by dropping > support for KVM_CAP_ARM_MTE in the kernel and introducing something > like a KVM_CAP_ARM_MTE_V2 with the new restriction. That's an option for the ABI upgrade but we still need to solve the potential races. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel