From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7DBC0C07E94 for ; Fri, 4 Jun 2021 11:37:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 67CE26142E for ; Fri, 4 Jun 2021 11:37:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230226AbhFDLiv (ORCPT ); Fri, 4 Jun 2021 07:38:51 -0400 Received: from mail.kernel.org ([198.145.29.99]:37036 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229916AbhFDLiu (ORCPT ); Fri, 4 Jun 2021 07:38:50 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id B1C7B61423; Fri, 4 Jun 2021 11:37:01 +0000 (UTC) Date: Fri, 4 Jun 2021 12:36:59 +0100 From: Catalin Marinas To: Steven Price Cc: Marc Zyngier , Will Deacon , James Morse , Julien Thierry , Suzuki K Poulose , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Dave Martin , Mark Rutland , Thomas Gleixner , qemu-devel@nongnu.org, Juan Quintela , "Dr. David Alan Gilbert" , Richard Henderson , Peter Maydell , Haibo Xu , Andrew Jones Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature Message-ID: <20210604113658.GD31173@arm.com> References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: > On 03/06/2021 17:00, Catalin Marinas wrote: > > On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: > >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > >> index c5d1f3c87dbd..226035cf7d6c 100644 > >> --- a/arch/arm64/kvm/mmu.c > >> +++ b/arch/arm64/kvm/mmu.c > >> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, > >> return PAGE_SIZE; > >> } > >> > >> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, > >> + unsigned long size) > >> +{ > >> + if (kvm_has_mte(kvm)) { > >> + /* > >> + * The page will be mapped in stage 2 as Normal Cacheable, so > >> + * the VM will be able to see the page's tags and therefore > >> + * they must be initialised first. If PG_mte_tagged is set, > >> + * tags have already been initialised. > >> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages > >> + * that may not support tags. > >> + */ > >> + unsigned long i, nr_pages = size >> PAGE_SHIFT; > >> + struct page *page = pfn_to_online_page(pfn); > >> + > >> + if (!page) > >> + return -EFAULT; > >> + > >> + for (i = 0; i < nr_pages; i++, page++) { > >> + /* > >> + * There is a potential (but very unlikely) race > >> + * between two VMs which are sharing a physical page > >> + * entering this at the same time. However by splitting > >> + * the test/set the only risk is tags being overwritten > >> + * by the mte_clear_page_tags() call. > >> + */ > > > > And I think the real risk here is when the page is writable by at least > > one of the VMs sharing the page. This excludes KSM, so it only leaves > > the MAP_SHARED mappings. > > > >> + if (!test_bit(PG_mte_tagged, &page->flags)) { > >> + mte_clear_page_tags(page_address(page)); > >> + set_bit(PG_mte_tagged, &page->flags); > >> + } > >> + } > > > > If we want to cover this race (I'd say in a separate patch), we can call > > mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I > > got the arguments right). We can avoid the big lock in most cases if > > kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) > > and __alloc_zeroed_user_highpage() clears the tags on allocation (as we > > do for VM_MTE but the new flag would not affect the stage 1 VMM page > > attributes). > > To be honest I'm coming round to just exporting a > mte_prepare_page_tags() function which does the clear/set with the lock > held. I doubt it's such a performance critical path that it will cause > any noticeable issues. Then if we run into performance problems in the > future we can start experimenting with extra VM flags etc as necessary. It works for me. > And from your later email: > > Another idea: if VM_SHARED is found for any vma within a region in > > kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE > > for the guest or reject the memory slot if MTE was already enabled. > > > > An alternative here would be to clear VM_MTE_ALLOWED so that any > > subsequent mprotect(PROT_MTE) in the VMM would fail in > > arch_validate_flags(). MTE would still be allowed in the guest but in > > the VMM for the guest memory regions. We can probably do this > > irrespective of VM_SHARED. Of course, the VMM can still mmap() the > > memory initially with PROT_MTE but that's not an issue IIRC, only the > > concurrent mprotect(). > > This could work, but I worry that it's potential fragile. Also the rules > for what user space can do are not obvious and may be surprising. I'd > also want to look into the likes of mremap() to see how easy it would be > to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) > memory sneaking into a memslot. > > Unless you think it's worth complicating the ABI in the hope of avoiding > the big lock overhead I think it's probably best to stick with the big > lock at least until we have more data on the overhead. It's up to Marc but I think for now just make it safe and once we get our hands on hardware, we can assess the impact. For example, starting multiple VMs simultaneously will contend on such big lock but we have an option to optimise it by setting PG_mte_tagged on allocation via a new VM_* flag. For my last suggestion above, changing the VMM ABI afterwards is a bit tricky, so we could state now that VM_SHARED and MTE are not allowed (though it needs a patch to enforce it). That's assuming that mprotect() in the VMM cannot race with the user_mem_abort() on another CPU which makes the lock necessary anyway. > >> + } > >> + > >> + return 0; > >> +} > >> + > >> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> struct kvm_memory_slot *memslot, unsigned long hva, > >> unsigned long fault_status) > >> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> if (writable) > >> prot |= KVM_PGTABLE_PROT_W; > >> > >> - if (fault_status != FSC_PERM && !device) > >> + if (fault_status != FSC_PERM && !device) { > >> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); > >> + if (ret) > >> + goto out_unlock; > > > > Maybe it was discussed in a previous version, why do we need this in > > addition to kvm_set_spte_gfn()? > > kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a > memslot is changed by the VMM). For the initial access we will normally > fault the page into stage 2 with user_mem_abort(). Right. Can we move the sanitise_mte_tags() call to kvm_pgtable_stage2_map() instead or we don't have the all the information needed? -- Catalin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87D5EC07E94 for ; Fri, 4 Jun 2021 11:37:45 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 1233A61423 for ; Fri, 4 Jun 2021 11:37:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1233A61423 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:36598 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lp892-00007k-6F for qemu-devel@archiver.kernel.org; Fri, 04 Jun 2021 07:37:44 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:38320) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lp88T-0007up-ST for qemu-devel@nongnu.org; Fri, 04 Jun 2021 07:37:09 -0400 Received: from mail.kernel.org ([198.145.29.99]:53974) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lp88Q-0004NR-88 for qemu-devel@nongnu.org; Fri, 04 Jun 2021 07:37:09 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id B1C7B61423; Fri, 4 Jun 2021 11:37:01 +0000 (UTC) Date: Fri, 4 Jun 2021 12:36:59 +0100 From: Catalin Marinas To: Steven Price Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature Message-ID: <20210604113658.GD31173@arm.com> References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Received-SPF: pass client-ip=198.145.29.99; envelope-from=cmarinas@kernel.org; helo=mail.kernel.org X-Spam_score_int: -66 X-Spam_score: -6.7 X-Spam_bar: ------ X-Spam_report: (-6.7 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, RCVD_IN_DNSWL_HI=-5, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , Peter Maydell , "Dr. David Alan Gilbert" , Andrew Jones , Haibo Xu , Suzuki K Poulose , qemu-devel@nongnu.org, Marc Zyngier , Juan Quintela , Richard Henderson , linux-kernel@vger.kernel.org, Dave Martin , James Morse , linux-arm-kernel@lists.infradead.org, Thomas Gleixner , Will Deacon , kvmarm@lists.cs.columbia.edu, Julien Thierry Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: > On 03/06/2021 17:00, Catalin Marinas wrote: > > On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: > >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > >> index c5d1f3c87dbd..226035cf7d6c 100644 > >> --- a/arch/arm64/kvm/mmu.c > >> +++ b/arch/arm64/kvm/mmu.c > >> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, > >> return PAGE_SIZE; > >> } > >> > >> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, > >> + unsigned long size) > >> +{ > >> + if (kvm_has_mte(kvm)) { > >> + /* > >> + * The page will be mapped in stage 2 as Normal Cacheable, so > >> + * the VM will be able to see the page's tags and therefore > >> + * they must be initialised first. If PG_mte_tagged is set, > >> + * tags have already been initialised. > >> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages > >> + * that may not support tags. > >> + */ > >> + unsigned long i, nr_pages = size >> PAGE_SHIFT; > >> + struct page *page = pfn_to_online_page(pfn); > >> + > >> + if (!page) > >> + return -EFAULT; > >> + > >> + for (i = 0; i < nr_pages; i++, page++) { > >> + /* > >> + * There is a potential (but very unlikely) race > >> + * between two VMs which are sharing a physical page > >> + * entering this at the same time. However by splitting > >> + * the test/set the only risk is tags being overwritten > >> + * by the mte_clear_page_tags() call. > >> + */ > > > > And I think the real risk here is when the page is writable by at least > > one of the VMs sharing the page. This excludes KSM, so it only leaves > > the MAP_SHARED mappings. > > > >> + if (!test_bit(PG_mte_tagged, &page->flags)) { > >> + mte_clear_page_tags(page_address(page)); > >> + set_bit(PG_mte_tagged, &page->flags); > >> + } > >> + } > > > > If we want to cover this race (I'd say in a separate patch), we can call > > mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I > > got the arguments right). We can avoid the big lock in most cases if > > kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) > > and __alloc_zeroed_user_highpage() clears the tags on allocation (as we > > do for VM_MTE but the new flag would not affect the stage 1 VMM page > > attributes). > > To be honest I'm coming round to just exporting a > mte_prepare_page_tags() function which does the clear/set with the lock > held. I doubt it's such a performance critical path that it will cause > any noticeable issues. Then if we run into performance problems in the > future we can start experimenting with extra VM flags etc as necessary. It works for me. > And from your later email: > > Another idea: if VM_SHARED is found for any vma within a region in > > kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE > > for the guest or reject the memory slot if MTE was already enabled. > > > > An alternative here would be to clear VM_MTE_ALLOWED so that any > > subsequent mprotect(PROT_MTE) in the VMM would fail in > > arch_validate_flags(). MTE would still be allowed in the guest but in > > the VMM for the guest memory regions. We can probably do this > > irrespective of VM_SHARED. Of course, the VMM can still mmap() the > > memory initially with PROT_MTE but that's not an issue IIRC, only the > > concurrent mprotect(). > > This could work, but I worry that it's potential fragile. Also the rules > for what user space can do are not obvious and may be surprising. I'd > also want to look into the likes of mremap() to see how easy it would be > to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) > memory sneaking into a memslot. > > Unless you think it's worth complicating the ABI in the hope of avoiding > the big lock overhead I think it's probably best to stick with the big > lock at least until we have more data on the overhead. It's up to Marc but I think for now just make it safe and once we get our hands on hardware, we can assess the impact. For example, starting multiple VMs simultaneously will contend on such big lock but we have an option to optimise it by setting PG_mte_tagged on allocation via a new VM_* flag. For my last suggestion above, changing the VMM ABI afterwards is a bit tricky, so we could state now that VM_SHARED and MTE are not allowed (though it needs a patch to enforce it). That's assuming that mprotect() in the VMM cannot race with the user_mem_abort() on another CPU which makes the lock necessary anyway. > >> + } > >> + > >> + return 0; > >> +} > >> + > >> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> struct kvm_memory_slot *memslot, unsigned long hva, > >> unsigned long fault_status) > >> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> if (writable) > >> prot |= KVM_PGTABLE_PROT_W; > >> > >> - if (fault_status != FSC_PERM && !device) > >> + if (fault_status != FSC_PERM && !device) { > >> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); > >> + if (ret) > >> + goto out_unlock; > > > > Maybe it was discussed in a previous version, why do we need this in > > addition to kvm_set_spte_gfn()? > > kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a > memslot is changed by the VMM). For the initial access we will normally > fault the page into stage 2 with user_mem_abort(). Right. Can we move the sanitise_mte_tags() call to kvm_pgtable_stage2_map() instead or we don't have the all the information needed? -- Catalin From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F38F1C07E94 for ; Fri, 4 Jun 2021 11:37:13 +0000 (UTC) Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by mail.kernel.org (Postfix) with ESMTP id 740B56141C for ; Fri, 4 Jun 2021 11:37:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 740B56141C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvmarm-bounces@lists.cs.columbia.edu Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id F0A3440895; Fri, 4 Jun 2021 07:37:12 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YwQIqezv4g+P; Fri, 4 Jun 2021 07:37:09 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 06C124B0B8; Fri, 4 Jun 2021 07:37:09 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 89E2C49E5F for ; Fri, 4 Jun 2021 07:37:07 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id A3qa-K0tPFdI for ; Fri, 4 Jun 2021 07:37:06 -0400 (EDT) Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by mm01.cs.columbia.edu (Postfix) with ESMTPS id CC22340895 for ; Fri, 4 Jun 2021 07:37:05 -0400 (EDT) Received: by mail.kernel.org (Postfix) with ESMTPSA id B1C7B61423; Fri, 4 Jun 2021 11:37:01 +0000 (UTC) Date: Fri, 4 Jun 2021 12:36:59 +0100 From: Catalin Marinas To: Steven Price Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature Message-ID: <20210604113658.GD31173@arm.com> References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Cc: "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Marc Zyngier , Juan Quintela , Richard Henderson , linux-kernel@vger.kernel.org, Dave Martin , linux-arm-kernel@lists.infradead.org, Thomas Gleixner , Will Deacon , kvmarm@lists.cs.columbia.edu X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: > On 03/06/2021 17:00, Catalin Marinas wrote: > > On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: > >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > >> index c5d1f3c87dbd..226035cf7d6c 100644 > >> --- a/arch/arm64/kvm/mmu.c > >> +++ b/arch/arm64/kvm/mmu.c > >> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, > >> return PAGE_SIZE; > >> } > >> > >> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, > >> + unsigned long size) > >> +{ > >> + if (kvm_has_mte(kvm)) { > >> + /* > >> + * The page will be mapped in stage 2 as Normal Cacheable, so > >> + * the VM will be able to see the page's tags and therefore > >> + * they must be initialised first. If PG_mte_tagged is set, > >> + * tags have already been initialised. > >> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages > >> + * that may not support tags. > >> + */ > >> + unsigned long i, nr_pages = size >> PAGE_SHIFT; > >> + struct page *page = pfn_to_online_page(pfn); > >> + > >> + if (!page) > >> + return -EFAULT; > >> + > >> + for (i = 0; i < nr_pages; i++, page++) { > >> + /* > >> + * There is a potential (but very unlikely) race > >> + * between two VMs which are sharing a physical page > >> + * entering this at the same time. However by splitting > >> + * the test/set the only risk is tags being overwritten > >> + * by the mte_clear_page_tags() call. > >> + */ > > > > And I think the real risk here is when the page is writable by at least > > one of the VMs sharing the page. This excludes KSM, so it only leaves > > the MAP_SHARED mappings. > > > >> + if (!test_bit(PG_mte_tagged, &page->flags)) { > >> + mte_clear_page_tags(page_address(page)); > >> + set_bit(PG_mte_tagged, &page->flags); > >> + } > >> + } > > > > If we want to cover this race (I'd say in a separate patch), we can call > > mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I > > got the arguments right). We can avoid the big lock in most cases if > > kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) > > and __alloc_zeroed_user_highpage() clears the tags on allocation (as we > > do for VM_MTE but the new flag would not affect the stage 1 VMM page > > attributes). > > To be honest I'm coming round to just exporting a > mte_prepare_page_tags() function which does the clear/set with the lock > held. I doubt it's such a performance critical path that it will cause > any noticeable issues. Then if we run into performance problems in the > future we can start experimenting with extra VM flags etc as necessary. It works for me. > And from your later email: > > Another idea: if VM_SHARED is found for any vma within a region in > > kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE > > for the guest or reject the memory slot if MTE was already enabled. > > > > An alternative here would be to clear VM_MTE_ALLOWED so that any > > subsequent mprotect(PROT_MTE) in the VMM would fail in > > arch_validate_flags(). MTE would still be allowed in the guest but in > > the VMM for the guest memory regions. We can probably do this > > irrespective of VM_SHARED. Of course, the VMM can still mmap() the > > memory initially with PROT_MTE but that's not an issue IIRC, only the > > concurrent mprotect(). > > This could work, but I worry that it's potential fragile. Also the rules > for what user space can do are not obvious and may be surprising. I'd > also want to look into the likes of mremap() to see how easy it would be > to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) > memory sneaking into a memslot. > > Unless you think it's worth complicating the ABI in the hope of avoiding > the big lock overhead I think it's probably best to stick with the big > lock at least until we have more data on the overhead. It's up to Marc but I think for now just make it safe and once we get our hands on hardware, we can assess the impact. For example, starting multiple VMs simultaneously will contend on such big lock but we have an option to optimise it by setting PG_mte_tagged on allocation via a new VM_* flag. For my last suggestion above, changing the VMM ABI afterwards is a bit tricky, so we could state now that VM_SHARED and MTE are not allowed (though it needs a patch to enforce it). That's assuming that mprotect() in the VMM cannot race with the user_mem_abort() on another CPU which makes the lock necessary anyway. > >> + } > >> + > >> + return 0; > >> +} > >> + > >> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> struct kvm_memory_slot *memslot, unsigned long hva, > >> unsigned long fault_status) > >> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> if (writable) > >> prot |= KVM_PGTABLE_PROT_W; > >> > >> - if (fault_status != FSC_PERM && !device) > >> + if (fault_status != FSC_PERM && !device) { > >> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); > >> + if (ret) > >> + goto out_unlock; > > > > Maybe it was discussed in a previous version, why do we need this in > > addition to kvm_set_spte_gfn()? > > kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a > memslot is changed by the VMM). For the initial access we will normally > fault the page into stage 2 with user_mem_abort(). Right. Can we move the sanitise_mte_tags() call to kvm_pgtable_stage2_map() instead or we don't have the all the information needed? -- Catalin _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96EA1C07E94 for ; Fri, 4 Jun 2021 11:39:01 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 68A3B61423 for ; Fri, 4 Jun 2021 11:39:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 68A3B61423 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=+1dh4Ss4BkQlQzp/FSLkEhl0JexK9CRxeftB0jHY1lo=; b=Je9PcCCw8XzsLx SgV0Zhyn9BXvKdlw9vazKHER5hPm7zR/U8zeaQkRcIT4sP0ixfvB1krs1z1Dq5nRzTCrnw/xHw+2e 7vdQ1PaOtZzI89CfoFVbfYlYA0xs+oKrmc4M5a1zZYuzsRQH5R1UPwYGsPCZlZWxKM9Y+MX6oqsOt /BcjC/ndfIi4+DE8UYGAI6dErEYXGZA5CmS77bIcLzKpGYXWDDkLiyOLiXyBcr7Jj7U6bjJaeFnkX CcOaQj+wSi62fwl6mA/PK1JjlPziqyEn/LV/dAoxYna6hY80kFS2QkG5n/rRsBvBB7tKBuqgYZmMg A0XgRgXNzLhMzUy4i2/A==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1lp88Y-00DMCa-MS; Fri, 04 Jun 2021 11:37:15 +0000 Received: from mail.kernel.org ([198.145.29.99]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1lp88P-00DM9q-9i for linux-arm-kernel@lists.infradead.org; Fri, 04 Jun 2021 11:37:07 +0000 Received: by mail.kernel.org (Postfix) with ESMTPSA id B1C7B61423; Fri, 4 Jun 2021 11:37:01 +0000 (UTC) Date: Fri, 4 Jun 2021 12:36:59 +0100 From: Catalin Marinas To: Steven Price Cc: Marc Zyngier , Will Deacon , James Morse , Julien Thierry , Suzuki K Poulose , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Dave Martin , Mark Rutland , Thomas Gleixner , qemu-devel@nongnu.org, Juan Quintela , "Dr. David Alan Gilbert" , Richard Henderson , Peter Maydell , Haibo Xu , Andrew Jones Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature Message-ID: <20210604113658.GD31173@arm.com> References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210604_043705_436321_8332CC64 X-CRM114-Status: GOOD ( 54.89 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: > On 03/06/2021 17:00, Catalin Marinas wrote: > > On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: > >> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > >> index c5d1f3c87dbd..226035cf7d6c 100644 > >> --- a/arch/arm64/kvm/mmu.c > >> +++ b/arch/arm64/kvm/mmu.c > >> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, > >> return PAGE_SIZE; > >> } > >> > >> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, > >> + unsigned long size) > >> +{ > >> + if (kvm_has_mte(kvm)) { > >> + /* > >> + * The page will be mapped in stage 2 as Normal Cacheable, so > >> + * the VM will be able to see the page's tags and therefore > >> + * they must be initialised first. If PG_mte_tagged is set, > >> + * tags have already been initialised. > >> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages > >> + * that may not support tags. > >> + */ > >> + unsigned long i, nr_pages = size >> PAGE_SHIFT; > >> + struct page *page = pfn_to_online_page(pfn); > >> + > >> + if (!page) > >> + return -EFAULT; > >> + > >> + for (i = 0; i < nr_pages; i++, page++) { > >> + /* > >> + * There is a potential (but very unlikely) race > >> + * between two VMs which are sharing a physical page > >> + * entering this at the same time. However by splitting > >> + * the test/set the only risk is tags being overwritten > >> + * by the mte_clear_page_tags() call. > >> + */ > > > > And I think the real risk here is when the page is writable by at least > > one of the VMs sharing the page. This excludes KSM, so it only leaves > > the MAP_SHARED mappings. > > > >> + if (!test_bit(PG_mte_tagged, &page->flags)) { > >> + mte_clear_page_tags(page_address(page)); > >> + set_bit(PG_mte_tagged, &page->flags); > >> + } > >> + } > > > > If we want to cover this race (I'd say in a separate patch), we can call > > mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I > > got the arguments right). We can avoid the big lock in most cases if > > kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) > > and __alloc_zeroed_user_highpage() clears the tags on allocation (as we > > do for VM_MTE but the new flag would not affect the stage 1 VMM page > > attributes). > > To be honest I'm coming round to just exporting a > mte_prepare_page_tags() function which does the clear/set with the lock > held. I doubt it's such a performance critical path that it will cause > any noticeable issues. Then if we run into performance problems in the > future we can start experimenting with extra VM flags etc as necessary. It works for me. > And from your later email: > > Another idea: if VM_SHARED is found for any vma within a region in > > kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE > > for the guest or reject the memory slot if MTE was already enabled. > > > > An alternative here would be to clear VM_MTE_ALLOWED so that any > > subsequent mprotect(PROT_MTE) in the VMM would fail in > > arch_validate_flags(). MTE would still be allowed in the guest but in > > the VMM for the guest memory regions. We can probably do this > > irrespective of VM_SHARED. Of course, the VMM can still mmap() the > > memory initially with PROT_MTE but that's not an issue IIRC, only the > > concurrent mprotect(). > > This could work, but I worry that it's potential fragile. Also the rules > for what user space can do are not obvious and may be surprising. I'd > also want to look into the likes of mremap() to see how easy it would be > to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) > memory sneaking into a memslot. > > Unless you think it's worth complicating the ABI in the hope of avoiding > the big lock overhead I think it's probably best to stick with the big > lock at least until we have more data on the overhead. It's up to Marc but I think for now just make it safe and once we get our hands on hardware, we can assess the impact. For example, starting multiple VMs simultaneously will contend on such big lock but we have an option to optimise it by setting PG_mte_tagged on allocation via a new VM_* flag. For my last suggestion above, changing the VMM ABI afterwards is a bit tricky, so we could state now that VM_SHARED and MTE are not allowed (though it needs a patch to enforce it). That's assuming that mprotect() in the VMM cannot race with the user_mem_abort() on another CPU which makes the lock necessary anyway. > >> + } > >> + > >> + return 0; > >> +} > >> + > >> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> struct kvm_memory_slot *memslot, unsigned long hva, > >> unsigned long fault_status) > >> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > >> if (writable) > >> prot |= KVM_PGTABLE_PROT_W; > >> > >> - if (fault_status != FSC_PERM && !device) > >> + if (fault_status != FSC_PERM && !device) { > >> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); > >> + if (ret) > >> + goto out_unlock; > > > > Maybe it was discussed in a previous version, why do we need this in > > addition to kvm_set_spte_gfn()? > > kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a > memslot is changed by the VMM). For the initial access we will normally > fault the page into stage 2 with user_mem_abort(). Right. Can we move the sanitise_mte_tags() call to kvm_pgtable_stage2_map() instead or we don't have the all the information needed? -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel