From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A18EBC433EF for ; Mon, 27 Jun 2022 18:16:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240145AbiF0SQg (ORCPT ); Mon, 27 Jun 2022 14:16:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38024 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234890AbiF0SQd (ORCPT ); Mon, 27 Jun 2022 14:16:33 -0400 Received: from mail-wr1-x42e.google.com (mail-wr1-x42e.google.com [IPv6:2a00:1450:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7D8B7DF65 for ; Mon, 27 Jun 2022 11:16:31 -0700 (PDT) Received: by mail-wr1-x42e.google.com with SMTP id i25so8704245wrc.13 for ; Mon, 27 Jun 2022 11:16:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1LlJtaVzVIPbITR1tJ88VD+jcE++J+LsNM5MudmiK28=; b=ah1xQBJc+xEd1R0W2HGpRMCsaYZJDKR0IEIpQLbDCyo+7zewg9Q9ruwKsiQJi8zDDr xkaPmSRmSDagG8t876mkd+0kPUkntNCoVD5qCO0c2WFqtPVh8e3NOfVdr2gcmaSIGzWv UYgYTcf0m0bKJVQ6fV2fJjzL2/0/yGh72iup5k4Y1JCymUSMEHI+fITbCqNe6L7uSEkY tChiBA5KCj1ngwGjKjgASrDzFNysmKLUrROFKHgqUTJXHFLG2yDyNmdlHOBGEmGZdh/z zqdRlbiYXBj7cNnoHObc1dS4vUPkTd/9bkzn+/Gm+M0rJICtAdjXqkrLdxvR/xap8co9 8HKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1LlJtaVzVIPbITR1tJ88VD+jcE++J+LsNM5MudmiK28=; b=Q1LIGwuk0a2mNMsojqKBibamEsCGxduamy21jx8Jbt3rhEwFAY1g6vs5NgzllDwgjS 17LMVJUAwPd6k76CU+plVUfg3XJ97qCPBA7Jg77/Sax5zmc0VZ70qL0rFxVJbU/pUU/y peNNBtPXng0SfMl7dXwWf378VmZ0rRwAUBYFlThCWYOgJkll/pSWjYkgC7xHbASpm4Qu v/3Dm6sw6N9NIC8wu9KjrPx86xW+L9U+jBnIDal0KHleginBBnHlnuJ2TZ566DkrhZkM Eiy08RJW/z1GwdOPmdqZzvlpbVFB9Vozm3+GKOloggRkBeh/+KPX91ZqvUQQy7lSMV8Y uzlQ== X-Gm-Message-State: AJIora8d3stYh97g66CHEdNVwrVOReR9hROAI0ND7kz0limsqRI/DF6U d2vQ1TOHCxQEJp/S3NoCVloLnHjmsqPh3bKBkRzLvRSM0qU6yQ== X-Google-Smtp-Source: AGRyM1seKdGGzR2nopVunzukx19s+tFcWxVOYunQqO+UaE1iq3u9pesjp5GTynZZ50K9elCTOUy0F5yovwsdhktAOFU= X-Received: by 2002:a05:6000:2a8:b0:21b:bdd9:569a with SMTP id l8-20020a05600002a800b0021bbdd9569amr12390206wry.170.1656353789831; Mon, 27 Jun 2022 11:16:29 -0700 (PDT) MIME-Version: 1.0 References: <20220623234944.141869-1-pcc@google.com> In-Reply-To: From: Peter Collingbourne Date: Mon, 27 Jun 2022 11:16:17 -0700 Message-ID: Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled To: Catalin Marinas Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier , kvm@vger.kernel.org, Andy Lutomirski , Linux ARM , Michael Roth , Chao Peng , Will Deacon , Evgenii Stepanov , Steven Price Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > wrote: > > > + Steven as he added the KVM and swap support for MTE. > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > on user page allocation, copy them on write. For swap we can scan and if > > > all tags are 0 and just skip saving them. > > > > A problem with this approach is that it would conflict with any > > potential future changes that we might make that would require the > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > vma available where it matters. We can keep PG_mte_tagged around but > always set it on page allocation (e.g. when zeroing or CoW) and check > VM_MTE_ALLOWED rather than VM_MTE. Right, but for avoiding tagging we would like that to apply to as many pages as possible. If we check VM_MTE_ALLOWED then the heap pages of those processes that are not using MTE would not be covered, which on a mostly non-MTE system would be a majority of pages. Over the weekend I thought of another policy, which would be similar to your original one. We can always tag pages which are mapped as MAP_SHARED. These pages are much less common than private pages, so the impact would be less. So the if statement in alloc_zeroed_user_highpage_movable would become: if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && (vma->vm_flags & VM_SHARED))) That would allow us to put basically any shared mapping in the guest address space without needing to deal with races in sanitise_mte_tags. We may consider going further than this and require all pages mapped into guests with MTE enabled to be PROT_MTE. I think it would allow dropping sanitise_mte_tags entirely. This would not be a relaxation of the ABI but perhaps we can get away with it if, as Cornelia mentioned, QEMU does not currently support MTE, and since crosvm doesn't currently support it either there's no userspace to break AFAIK. This would also address a current weirdness in the API where it is possible for the underlying pages of a MAP_SHARED file mapping to become tagged via KVM, said tags are exposed to the guest and are discarded when the underlying page is paged out. We can perhaps accomplish it by dropping support for KVM_CAP_ARM_MTE in the kernel and introducing something like a KVM_CAP_ARM_MTE_V2 with the new restriction. > I'm not sure how Linux can deal with pages that do not support MTE. > Currently we only handle this at the vm_flags level. Assuming that we > somehow manage to, we can still use PG_mte_tagged to mark the pages that > supported tags on allocation (and they have been zeroed or copied). I > guess if you want to move a page around, you'd need to go through > something like try_to_unmap() (or set all mappings to PROT_NONE like in > NUMA migration). Then you can either check whether the page is PROT_MTE > anywhere and maybe read the tags to see whether all are zero after > unmapping. > > Deferring tag zeroing/restoring to set_pte_at() can be racy without a > lock (or your approach with another flag) but I'm not sure it's worth > the extra logic if zeroing or copying the tags doesn't have a > significant overhead for non-PROT_MTE pages. > > Another issue with set_pte_at() is that it can write tags on mprotect() > even if the page is mapped read-only. So far I couldn't find a problem > with this but it adds to the complexity. > > > Thinking about this some more, another idea that I had was to only > > allow MAP_SHARED mappings in a guest with MTE enabled if the mapping > > is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous > > mappings I don't think it's possible to create a non-PROT_MTE alias in > > another mm (since you can't turn off PROT_MTE with mprotect), and for > > memfd maybe we could introduce a flag that requires PROT_MTE on all > > mappings. That way, we are guaranteed that either the page has been > > tagged prior to fault or we have exclusive access to it so it can be > > tagged on demand without racing. Let me see what effect that has on > > crosvm. > > You could still have all initial shared mappings as !PROT_MTE and some > mprotect() afterwards setting PG_mte_tagged and clearing the tags and > this can race. AFAICT, the easiest way to avoid the race is to set > PG_mte_tagged on allocation before it ends up in set_pte_at(). For shared pages we will be safe with my new proposal because the pages will always be tagged. For private ones we might need to move pages to an MTE supported region in response to the mprotect, as you discussed above. Peter From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id F2559C433EF for ; Mon, 27 Jun 2022 18:17:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:Subject:Message-ID:Date:From: In-Reply-To:References:MIME-Version:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=+ZIov+zKu+UaIuNjxO/1W6d4XxjyEgS+wojWSH+rlkM=; b=PITZ03Zp6lgHAQ ZmPJGm0vmrVjb0VtI1BntLIlmmuxC64McFBqNzdqW2jw+o5C4ehWWY+lv8UrAYk5Xcox/exgULag3 RonJSx5vt+v3LLeWexC73HD19izADnoltsHmAXIVUq8iRoOR2qA61xGapKFBn4nwNYsuPLQUWhVMh TM2QbAzMGyqN4KwLxo6JsS530B8M2tplHtDY3KBL+Nn9QhEcAU0yjrP0ySZxzAxJc1e4qH6hQUgik 4WPzGScqFmVLJ6LQpUuhbf2mlAWwULNGQcV57CsAgkRD3y4gdCbn3LYXjofmhY/qno1EjjEnRuebF aOl7Fctb7EmEXJ4Vf5bA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1o5tHp-002NAM-BP; Mon, 27 Jun 2022 18:16:37 +0000 Received: from mail-wr1-x434.google.com ([2a00:1450:4864:20::434]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1o5tHl-002N9K-N0 for linux-arm-kernel@lists.infradead.org; Mon, 27 Jun 2022 18:16:35 +0000 Received: by mail-wr1-x434.google.com with SMTP id o16so14239495wra.4 for ; Mon, 27 Jun 2022 11:16:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1LlJtaVzVIPbITR1tJ88VD+jcE++J+LsNM5MudmiK28=; b=ah1xQBJc+xEd1R0W2HGpRMCsaYZJDKR0IEIpQLbDCyo+7zewg9Q9ruwKsiQJi8zDDr xkaPmSRmSDagG8t876mkd+0kPUkntNCoVD5qCO0c2WFqtPVh8e3NOfVdr2gcmaSIGzWv UYgYTcf0m0bKJVQ6fV2fJjzL2/0/yGh72iup5k4Y1JCymUSMEHI+fITbCqNe6L7uSEkY tChiBA5KCj1ngwGjKjgASrDzFNysmKLUrROFKHgqUTJXHFLG2yDyNmdlHOBGEmGZdh/z zqdRlbiYXBj7cNnoHObc1dS4vUPkTd/9bkzn+/Gm+M0rJICtAdjXqkrLdxvR/xap8co9 8HKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1LlJtaVzVIPbITR1tJ88VD+jcE++J+LsNM5MudmiK28=; b=GJUz0whT9pVgI0iseIA0ZC8CbAt16i1ELCB98/Odx/6EshJ4DgRsHLXqW3R6TTPJ1V mh3QA+b6GB7w2b7E0+uBfQWlhT5Ot68ct3elKB8lIEJf7OEWgR5N4C+uCi3aPDxsBM5U 4d+I1fZ7QB08297uXUArNjCPcPGIMwe9MXeVPouIm6U+NTfQrM+EzudNXxCXkCBvMrCz oxXOwBWEZ8xkuxJ3TXNJwEKKmz2bXazJXm2Vg6x5wWjhBbdweqno7Na80Ch1X8fo+63C 4q1sApBOuFMfS8D6fr05WpSYDOhDlc5FL5a60iSP6WV8vdQaNZS9guS5EWKbT9soGuBO Q5sw== X-Gm-Message-State: AJIora+Fq8vaL1x6AHBMoVuQvqmMC5z3iuy6pRJed+adQsdHCo/W3vR9 AfE0qJwYmm3VHRSqUGcrlRicDkxl6vPRJftPkUX91Q== X-Google-Smtp-Source: AGRyM1seKdGGzR2nopVunzukx19s+tFcWxVOYunQqO+UaE1iq3u9pesjp5GTynZZ50K9elCTOUy0F5yovwsdhktAOFU= X-Received: by 2002:a05:6000:2a8:b0:21b:bdd9:569a with SMTP id l8-20020a05600002a800b0021bbdd9569amr12390206wry.170.1656353789831; Mon, 27 Jun 2022 11:16:29 -0700 (PDT) MIME-Version: 1.0 References: <20220623234944.141869-1-pcc@google.com> In-Reply-To: From: Peter Collingbourne Date: Mon, 27 Jun 2022 11:16:17 -0700 Message-ID: Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled To: Catalin Marinas Cc: kvmarm@lists.cs.columbia.edu, Marc Zyngier , kvm@vger.kernel.org, Andy Lutomirski , Linux ARM , Michael Roth , Chao Peng , Will Deacon , Evgenii Stepanov , Steven Price X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220627_111633_814866_DD90FE9E X-CRM114-Status: GOOD ( 51.01 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > wrote: > > > + Steven as he added the KVM and swap support for MTE. > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > on user page allocation, copy them on write. For swap we can scan and if > > > all tags are 0 and just skip saving them. > > > > A problem with this approach is that it would conflict with any > > potential future changes that we might make that would require the > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > vma available where it matters. We can keep PG_mte_tagged around but > always set it on page allocation (e.g. when zeroing or CoW) and check > VM_MTE_ALLOWED rather than VM_MTE. Right, but for avoiding tagging we would like that to apply to as many pages as possible. If we check VM_MTE_ALLOWED then the heap pages of those processes that are not using MTE would not be covered, which on a mostly non-MTE system would be a majority of pages. Over the weekend I thought of another policy, which would be similar to your original one. We can always tag pages which are mapped as MAP_SHARED. These pages are much less common than private pages, so the impact would be less. So the if statement in alloc_zeroed_user_highpage_movable would become: if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && (vma->vm_flags & VM_SHARED))) That would allow us to put basically any shared mapping in the guest address space without needing to deal with races in sanitise_mte_tags. We may consider going further than this and require all pages mapped into guests with MTE enabled to be PROT_MTE. I think it would allow dropping sanitise_mte_tags entirely. This would not be a relaxation of the ABI but perhaps we can get away with it if, as Cornelia mentioned, QEMU does not currently support MTE, and since crosvm doesn't currently support it either there's no userspace to break AFAIK. This would also address a current weirdness in the API where it is possible for the underlying pages of a MAP_SHARED file mapping to become tagged via KVM, said tags are exposed to the guest and are discarded when the underlying page is paged out. We can perhaps accomplish it by dropping support for KVM_CAP_ARM_MTE in the kernel and introducing something like a KVM_CAP_ARM_MTE_V2 with the new restriction. > I'm not sure how Linux can deal with pages that do not support MTE. > Currently we only handle this at the vm_flags level. Assuming that we > somehow manage to, we can still use PG_mte_tagged to mark the pages that > supported tags on allocation (and they have been zeroed or copied). I > guess if you want to move a page around, you'd need to go through > something like try_to_unmap() (or set all mappings to PROT_NONE like in > NUMA migration). Then you can either check whether the page is PROT_MTE > anywhere and maybe read the tags to see whether all are zero after > unmapping. > > Deferring tag zeroing/restoring to set_pte_at() can be racy without a > lock (or your approach with another flag) but I'm not sure it's worth > the extra logic if zeroing or copying the tags doesn't have a > significant overhead for non-PROT_MTE pages. > > Another issue with set_pte_at() is that it can write tags on mprotect() > even if the page is mapped read-only. So far I couldn't find a problem > with this but it adds to the complexity. > > > Thinking about this some more, another idea that I had was to only > > allow MAP_SHARED mappings in a guest with MTE enabled if the mapping > > is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous > > mappings I don't think it's possible to create a non-PROT_MTE alias in > > another mm (since you can't turn off PROT_MTE with mprotect), and for > > memfd maybe we could introduce a flag that requires PROT_MTE on all > > mappings. That way, we are guaranteed that either the page has been > > tagged prior to fault or we have exclusive access to it so it can be > > tagged on demand without racing. Let me see what effect that has on > > crosvm. > > You could still have all initial shared mappings as !PROT_MTE and some > mprotect() afterwards setting PG_mte_tagged and clearing the tags and > this can race. AFAICT, the easiest way to avoid the race is to set > PG_mte_tagged on allocation before it ends up in set_pte_at(). For shared pages we will be safe with my new proposal because the pages will always be tagged. For private ones we might need to move pages to an MTE supported region in response to the mprotect, as you discussed above. Peter _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BBEBC43334 for ; Tue, 28 Jun 2022 11:24:23 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 3753D4B3DC; Tue, 28 Jun 2022 07:24:23 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Authentication-Results: mm01.cs.columbia.edu (amavisd-new); dkim=softfail (fail, message has been altered) header.i=@google.com Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id f4g0HSXagnHt; Tue, 28 Jun 2022 07:24:21 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 904724B3F0; Tue, 28 Jun 2022 07:24:19 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 584574B1B0 for ; Mon, 27 Jun 2022 14:16:32 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MowtLytYaung for ; Mon, 27 Jun 2022 14:16:31 -0400 (EDT) Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) by mm01.cs.columbia.edu (Postfix) with ESMTPS id 097F349F29 for ; Mon, 27 Jun 2022 14:16:30 -0400 (EDT) Received: by mail-wr1-f51.google.com with SMTP id o4so10264613wrh.3 for ; Mon, 27 Jun 2022 11:16:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1LlJtaVzVIPbITR1tJ88VD+jcE++J+LsNM5MudmiK28=; b=ah1xQBJc+xEd1R0W2HGpRMCsaYZJDKR0IEIpQLbDCyo+7zewg9Q9ruwKsiQJi8zDDr xkaPmSRmSDagG8t876mkd+0kPUkntNCoVD5qCO0c2WFqtPVh8e3NOfVdr2gcmaSIGzWv UYgYTcf0m0bKJVQ6fV2fJjzL2/0/yGh72iup5k4Y1JCymUSMEHI+fITbCqNe6L7uSEkY tChiBA5KCj1ngwGjKjgASrDzFNysmKLUrROFKHgqUTJXHFLG2yDyNmdlHOBGEmGZdh/z zqdRlbiYXBj7cNnoHObc1dS4vUPkTd/9bkzn+/Gm+M0rJICtAdjXqkrLdxvR/xap8co9 8HKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1LlJtaVzVIPbITR1tJ88VD+jcE++J+LsNM5MudmiK28=; b=grX80ZxNRYYWbt4QoxB/bMzBo6NoD698rkwpQF/AQSKQr1AX12Ac0vAypv5tLojNzX trM1K8qpgfaGa9p3NkPJWo4Vt67NRN8UigQFy0HR5kU0RDURruwgo0pr5LDN9FazxxYU TERouIc21nwM8PYrc1DdfFV3a8lNKJ8JG+CUvKhrFk4yuG2kJIuAQwRVOJFPNgZmFIgC 4BceFXnh9QPuIsIcgSADoz66GXqI4if5nm8rbCCnV3SDWdbSKO6nPpGgqjgLLZ9P0k1j Qx9XUrzVTGtAJyY215q0+89Bug6ga4uUemSnlol/RsaWmavedIm0zbDDBZUGxFnstugn lO2A== X-Gm-Message-State: AJIora+MQIZ1kScwgM3oXJ7/Pyrs7JEwUpNpb6bzfJ7KM0MAXcWeHfqN M0+O6YRhmxExcxvPE7pqioL7Er9wk5rE/J9m4V9KMg== X-Google-Smtp-Source: AGRyM1seKdGGzR2nopVunzukx19s+tFcWxVOYunQqO+UaE1iq3u9pesjp5GTynZZ50K9elCTOUy0F5yovwsdhktAOFU= X-Received: by 2002:a05:6000:2a8:b0:21b:bdd9:569a with SMTP id l8-20020a05600002a800b0021bbdd9569amr12390206wry.170.1656353789831; Mon, 27 Jun 2022 11:16:29 -0700 (PDT) MIME-Version: 1.0 References: <20220623234944.141869-1-pcc@google.com> In-Reply-To: From: Peter Collingbourne Date: Mon, 27 Jun 2022 11:16:17 -0700 Message-ID: Subject: Re: [PATCH] KVM: arm64: permit MAP_SHARED mappings with MTE enabled To: Catalin Marinas X-Mailman-Approved-At: Tue, 28 Jun 2022 07:24:17 -0400 Cc: kvm@vger.kernel.org, Marc Zyngier , Andy Lutomirski , Evgenii Stepanov , Michael Roth , Chao Peng , Steven Price , Will Deacon , kvmarm@lists.cs.columbia.edu, Linux ARM X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On Mon, Jun 27, 2022 at 4:43 AM Catalin Marinas wrote: > > On Fri, Jun 24, 2022 at 02:50:53PM -0700, Peter Collingbourne wrote: > > On Fri, Jun 24, 2022 at 10:05 AM Catalin Marinas > > wrote: > > > + Steven as he added the KVM and swap support for MTE. > > > > > > On Thu, Jun 23, 2022 at 04:49:44PM -0700, Peter Collingbourne wrote: > > > > Certain VMMs such as crosvm have features (e.g. sandboxing, pmem) that > > > > depend on being able to map guest memory as MAP_SHARED. The current > > > > restriction on sharing MAP_SHARED pages with the guest is preventing > > > > the use of those features with MTE. Therefore, remove this restriction. > > > > > > We already have some corner cases where the PG_mte_tagged logic fails > > > even for MAP_PRIVATE (but page shared with CoW). Adding this on top for > > > KVM MAP_SHARED will potentially make things worse (or hard to reason > > > about; for example the VMM sets PROT_MTE as well). I'm more inclined to > > > get rid of PG_mte_tagged altogether, always zero (or restore) the tags > > > on user page allocation, copy them on write. For swap we can scan and if > > > all tags are 0 and just skip saving them. > > > > A problem with this approach is that it would conflict with any > > potential future changes that we might make that would require the > > kernel to avoid modifying the tags for non-PROT_MTE pages. > > Not if in all those cases we check VM_MTE_ALLOWED. We seem to have the > vma available where it matters. We can keep PG_mte_tagged around but > always set it on page allocation (e.g. when zeroing or CoW) and check > VM_MTE_ALLOWED rather than VM_MTE. Right, but for avoiding tagging we would like that to apply to as many pages as possible. If we check VM_MTE_ALLOWED then the heap pages of those processes that are not using MTE would not be covered, which on a mostly non-MTE system would be a majority of pages. Over the weekend I thought of another policy, which would be similar to your original one. We can always tag pages which are mapped as MAP_SHARED. These pages are much less common than private pages, so the impact would be less. So the if statement in alloc_zeroed_user_highpage_movable would become: if ((vma->vm_flags & VM_MTE) || (system_supports_mte() && (vma->vm_flags & VM_SHARED))) That would allow us to put basically any shared mapping in the guest address space without needing to deal with races in sanitise_mte_tags. We may consider going further than this and require all pages mapped into guests with MTE enabled to be PROT_MTE. I think it would allow dropping sanitise_mte_tags entirely. This would not be a relaxation of the ABI but perhaps we can get away with it if, as Cornelia mentioned, QEMU does not currently support MTE, and since crosvm doesn't currently support it either there's no userspace to break AFAIK. This would also address a current weirdness in the API where it is possible for the underlying pages of a MAP_SHARED file mapping to become tagged via KVM, said tags are exposed to the guest and are discarded when the underlying page is paged out. We can perhaps accomplish it by dropping support for KVM_CAP_ARM_MTE in the kernel and introducing something like a KVM_CAP_ARM_MTE_V2 with the new restriction. > I'm not sure how Linux can deal with pages that do not support MTE. > Currently we only handle this at the vm_flags level. Assuming that we > somehow manage to, we can still use PG_mte_tagged to mark the pages that > supported tags on allocation (and they have been zeroed or copied). I > guess if you want to move a page around, you'd need to go through > something like try_to_unmap() (or set all mappings to PROT_NONE like in > NUMA migration). Then you can either check whether the page is PROT_MTE > anywhere and maybe read the tags to see whether all are zero after > unmapping. > > Deferring tag zeroing/restoring to set_pte_at() can be racy without a > lock (or your approach with another flag) but I'm not sure it's worth > the extra logic if zeroing or copying the tags doesn't have a > significant overhead for non-PROT_MTE pages. > > Another issue with set_pte_at() is that it can write tags on mprotect() > even if the page is mapped read-only. So far I couldn't find a problem > with this but it adds to the complexity. > > > Thinking about this some more, another idea that I had was to only > > allow MAP_SHARED mappings in a guest with MTE enabled if the mapping > > is PROT_MTE and there are no non-PROT_MTE aliases. For anonymous > > mappings I don't think it's possible to create a non-PROT_MTE alias in > > another mm (since you can't turn off PROT_MTE with mprotect), and for > > memfd maybe we could introduce a flag that requires PROT_MTE on all > > mappings. That way, we are guaranteed that either the page has been > > tagged prior to fault or we have exclusive access to it so it can be > > tagged on demand without racing. Let me see what effect that has on > > crosvm. > > You could still have all initial shared mappings as !PROT_MTE and some > mprotect() afterwards setting PG_mte_tagged and clearing the tags and > this can race. AFAICT, the easiest way to avoid the race is to set > PG_mte_tagged on allocation before it ends up in set_pte_at(). For shared pages we will be safe with my new proposal because the pages will always be tagged. For private ones we might need to move pages to an MTE supported region in response to the mprotect, as you discussed above. Peter _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm