From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D13A4C07E94 for ; Fri, 4 Jun 2021 12:51:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A8AC261028 for ; Fri, 4 Jun 2021 12:51:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230142AbhFDMx3 (ORCPT ); Fri, 4 Jun 2021 08:53:29 -0400 Received: from foss.arm.com ([217.140.110.172]:38346 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229718AbhFDMx2 (ORCPT ); Fri, 4 Jun 2021 08:53:28 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BB57D2B; Fri, 4 Jun 2021 05:51:41 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2AA943F774; Fri, 4 Jun 2021 05:51:39 -0700 (PDT) Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature To: Catalin Marinas Cc: Marc Zyngier , Will Deacon , James Morse , Julien Thierry , Suzuki K Poulose , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Dave Martin , Mark Rutland , Thomas Gleixner , qemu-devel@nongnu.org, Juan Quintela , "Dr. David Alan Gilbert" , Richard Henderson , Peter Maydell , Haibo Xu , Andrew Jones References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> <20210604113658.GD31173@arm.com> From: Steven Price Message-ID: <2265cbf6-d643-9122-79a8-90198ea16c64@arm.com> Date: Fri, 4 Jun 2021 13:51:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210604113658.GD31173@arm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/06/2021 12:36, Catalin Marinas wrote: > On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: >> On 03/06/2021 17:00, Catalin Marinas wrote: >>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: >>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>> index c5d1f3c87dbd..226035cf7d6c 100644 >>>> --- a/arch/arm64/kvm/mmu.c >>>> +++ b/arch/arm64/kvm/mmu.c >>>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, >>>> return PAGE_SIZE; >>>> } >>>> >>>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, >>>> + unsigned long size) >>>> +{ >>>> + if (kvm_has_mte(kvm)) { >>>> + /* >>>> + * The page will be mapped in stage 2 as Normal Cacheable, so >>>> + * the VM will be able to see the page's tags and therefore >>>> + * they must be initialised first. If PG_mte_tagged is set, >>>> + * tags have already been initialised. >>>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages >>>> + * that may not support tags. >>>> + */ >>>> + unsigned long i, nr_pages = size >> PAGE_SHIFT; >>>> + struct page *page = pfn_to_online_page(pfn); >>>> + >>>> + if (!page) >>>> + return -EFAULT; >>>> + >>>> + for (i = 0; i < nr_pages; i++, page++) { >>>> + /* >>>> + * There is a potential (but very unlikely) race >>>> + * between two VMs which are sharing a physical page >>>> + * entering this at the same time. However by splitting >>>> + * the test/set the only risk is tags being overwritten >>>> + * by the mte_clear_page_tags() call. >>>> + */ >>> >>> And I think the real risk here is when the page is writable by at least >>> one of the VMs sharing the page. This excludes KSM, so it only leaves >>> the MAP_SHARED mappings. >>> >>>> + if (!test_bit(PG_mte_tagged, &page->flags)) { >>>> + mte_clear_page_tags(page_address(page)); >>>> + set_bit(PG_mte_tagged, &page->flags); >>>> + } >>>> + } >>> >>> If we want to cover this race (I'd say in a separate patch), we can call >>> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I >>> got the arguments right). We can avoid the big lock in most cases if >>> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) >>> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we >>> do for VM_MTE but the new flag would not affect the stage 1 VMM page >>> attributes). >> >> To be honest I'm coming round to just exporting a >> mte_prepare_page_tags() function which does the clear/set with the lock >> held. I doubt it's such a performance critical path that it will cause >> any noticeable issues. Then if we run into performance problems in the >> future we can start experimenting with extra VM flags etc as necessary. > > It works for me. > >> And from your later email: >>> Another idea: if VM_SHARED is found for any vma within a region in >>> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE >>> for the guest or reject the memory slot if MTE was already enabled. >>> >>> An alternative here would be to clear VM_MTE_ALLOWED so that any >>> subsequent mprotect(PROT_MTE) in the VMM would fail in >>> arch_validate_flags(). MTE would still be allowed in the guest but in >>> the VMM for the guest memory regions. We can probably do this >>> irrespective of VM_SHARED. Of course, the VMM can still mmap() the >>> memory initially with PROT_MTE but that's not an issue IIRC, only the >>> concurrent mprotect(). >> >> This could work, but I worry that it's potential fragile. Also the rules >> for what user space can do are not obvious and may be surprising. I'd >> also want to look into the likes of mremap() to see how easy it would be >> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) >> memory sneaking into a memslot. >> >> Unless you think it's worth complicating the ABI in the hope of avoiding >> the big lock overhead I think it's probably best to stick with the big >> lock at least until we have more data on the overhead. > > It's up to Marc but I think for now just make it safe and once we get > our hands on hardware, we can assess the impact. For example, starting > multiple VMs simultaneously will contend on such big lock but we have an > option to optimise it by setting PG_mte_tagged on allocation via a new > VM_* flag. > > For my last suggestion above, changing the VMM ABI afterwards is a bit > tricky, so we could state now that VM_SHARED and MTE are not allowed > (though it needs a patch to enforce it). That's assuming that mprotect() > in the VMM cannot race with the user_mem_abort() on another CPU which > makes the lock necessary anyway. > >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> struct kvm_memory_slot *memslot, unsigned long hva, >>>> unsigned long fault_status) >>>> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> if (writable) >>>> prot |= KVM_PGTABLE_PROT_W; >>>> >>>> - if (fault_status != FSC_PERM && !device) >>>> + if (fault_status != FSC_PERM && !device) { >>>> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); >>>> + if (ret) >>>> + goto out_unlock; >>> >>> Maybe it was discussed in a previous version, why do we need this in >>> addition to kvm_set_spte_gfn()? >> >> kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a >> memslot is changed by the VMM). For the initial access we will normally >> fault the page into stage 2 with user_mem_abort(). > > Right. Can we move the sanitise_mte_tags() call to > kvm_pgtable_stage2_map() instead or we don't have the all the > information needed? I tried that before: kvm_pgtable_stage2_map() is shared with the hypervisor so sadly we can't go poking around in the host as this breaks on nVHE. I mentioned it in the v12 cover letter but it was in a wall of text: * Move the code to sanitise tags out of user_mem_abort() into its own function. Also call this new function from kvm_set_spte_gfn() as that path was missing the sanitising. Originally I was going to move the code all the way down to kvm_pgtable_stage2_map(). Sadly as that also part of the EL2 hypervisor this breaks nVHE as the code needs to perform actions in the host. The only other option I could see would be to provide a wrapper for kvm_pgtable_stage2_map() in mmu.c which could do the sanitising as necessary. But considering we know the call site in kvm_phys_addr_ioremap() doesn't need handling (PROT_DEVICE is always specified) and there's only two more, it seemed easier just to add the two calls necessary to the new sanitise_mte_tags(). We also have a direct pointer to 'kvm' this way which is much nicer than pointer chasing it out of the kvm_pgtable structure. Steve From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4688C07E94 for ; Fri, 4 Jun 2021 12:53:08 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 4959F61090 for ; Fri, 4 Jun 2021 12:53:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4959F61090 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:35864 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lp9Jz-0000H7-6c for qemu-devel@archiver.kernel.org; Fri, 04 Jun 2021 08:53:07 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:54188) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lp9Ih-0007Ct-MU for qemu-devel@nongnu.org; Fri, 04 Jun 2021 08:51:47 -0400 Received: from foss.arm.com ([217.140.110.172]:59236) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lp9Id-0003a6-Ke for qemu-devel@nongnu.org; Fri, 04 Jun 2021 08:51:46 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BB57D2B; Fri, 4 Jun 2021 05:51:41 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2AA943F774; Fri, 4 Jun 2021 05:51:39 -0700 (PDT) Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature To: Catalin Marinas References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> <20210604113658.GD31173@arm.com> From: Steven Price Message-ID: <2265cbf6-d643-9122-79a8-90198ea16c64@arm.com> Date: Fri, 4 Jun 2021 13:51:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210604113658.GD31173@arm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=217.140.110.172; envelope-from=steven.price@arm.com; helo=foss.arm.com X-Spam_score_int: -47 X-Spam_score: -4.8 X-Spam_bar: ---- X-Spam_report: (-4.8 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.59, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , Peter Maydell , "Dr. David Alan Gilbert" , Andrew Jones , Haibo Xu , Suzuki K Poulose , qemu-devel@nongnu.org, Marc Zyngier , Juan Quintela , Richard Henderson , linux-kernel@vger.kernel.org, Dave Martin , James Morse , linux-arm-kernel@lists.infradead.org, Thomas Gleixner , Will Deacon , kvmarm@lists.cs.columbia.edu, Julien Thierry Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On 04/06/2021 12:36, Catalin Marinas wrote: > On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: >> On 03/06/2021 17:00, Catalin Marinas wrote: >>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: >>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>> index c5d1f3c87dbd..226035cf7d6c 100644 >>>> --- a/arch/arm64/kvm/mmu.c >>>> +++ b/arch/arm64/kvm/mmu.c >>>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, >>>> return PAGE_SIZE; >>>> } >>>> >>>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, >>>> + unsigned long size) >>>> +{ >>>> + if (kvm_has_mte(kvm)) { >>>> + /* >>>> + * The page will be mapped in stage 2 as Normal Cacheable, so >>>> + * the VM will be able to see the page's tags and therefore >>>> + * they must be initialised first. If PG_mte_tagged is set, >>>> + * tags have already been initialised. >>>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages >>>> + * that may not support tags. >>>> + */ >>>> + unsigned long i, nr_pages = size >> PAGE_SHIFT; >>>> + struct page *page = pfn_to_online_page(pfn); >>>> + >>>> + if (!page) >>>> + return -EFAULT; >>>> + >>>> + for (i = 0; i < nr_pages; i++, page++) { >>>> + /* >>>> + * There is a potential (but very unlikely) race >>>> + * between two VMs which are sharing a physical page >>>> + * entering this at the same time. However by splitting >>>> + * the test/set the only risk is tags being overwritten >>>> + * by the mte_clear_page_tags() call. >>>> + */ >>> >>> And I think the real risk here is when the page is writable by at least >>> one of the VMs sharing the page. This excludes KSM, so it only leaves >>> the MAP_SHARED mappings. >>> >>>> + if (!test_bit(PG_mte_tagged, &page->flags)) { >>>> + mte_clear_page_tags(page_address(page)); >>>> + set_bit(PG_mte_tagged, &page->flags); >>>> + } >>>> + } >>> >>> If we want to cover this race (I'd say in a separate patch), we can call >>> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I >>> got the arguments right). We can avoid the big lock in most cases if >>> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) >>> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we >>> do for VM_MTE but the new flag would not affect the stage 1 VMM page >>> attributes). >> >> To be honest I'm coming round to just exporting a >> mte_prepare_page_tags() function which does the clear/set with the lock >> held. I doubt it's such a performance critical path that it will cause >> any noticeable issues. Then if we run into performance problems in the >> future we can start experimenting with extra VM flags etc as necessary. > > It works for me. > >> And from your later email: >>> Another idea: if VM_SHARED is found for any vma within a region in >>> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE >>> for the guest or reject the memory slot if MTE was already enabled. >>> >>> An alternative here would be to clear VM_MTE_ALLOWED so that any >>> subsequent mprotect(PROT_MTE) in the VMM would fail in >>> arch_validate_flags(). MTE would still be allowed in the guest but in >>> the VMM for the guest memory regions. We can probably do this >>> irrespective of VM_SHARED. Of course, the VMM can still mmap() the >>> memory initially with PROT_MTE but that's not an issue IIRC, only the >>> concurrent mprotect(). >> >> This could work, but I worry that it's potential fragile. Also the rules >> for what user space can do are not obvious and may be surprising. I'd >> also want to look into the likes of mremap() to see how easy it would be >> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) >> memory sneaking into a memslot. >> >> Unless you think it's worth complicating the ABI in the hope of avoiding >> the big lock overhead I think it's probably best to stick with the big >> lock at least until we have more data on the overhead. > > It's up to Marc but I think for now just make it safe and once we get > our hands on hardware, we can assess the impact. For example, starting > multiple VMs simultaneously will contend on such big lock but we have an > option to optimise it by setting PG_mte_tagged on allocation via a new > VM_* flag. > > For my last suggestion above, changing the VMM ABI afterwards is a bit > tricky, so we could state now that VM_SHARED and MTE are not allowed > (though it needs a patch to enforce it). That's assuming that mprotect() > in the VMM cannot race with the user_mem_abort() on another CPU which > makes the lock necessary anyway. > >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> struct kvm_memory_slot *memslot, unsigned long hva, >>>> unsigned long fault_status) >>>> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> if (writable) >>>> prot |= KVM_PGTABLE_PROT_W; >>>> >>>> - if (fault_status != FSC_PERM && !device) >>>> + if (fault_status != FSC_PERM && !device) { >>>> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); >>>> + if (ret) >>>> + goto out_unlock; >>> >>> Maybe it was discussed in a previous version, why do we need this in >>> addition to kvm_set_spte_gfn()? >> >> kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a >> memslot is changed by the VMM). For the initial access we will normally >> fault the page into stage 2 with user_mem_abort(). > > Right. Can we move the sanitise_mte_tags() call to > kvm_pgtable_stage2_map() instead or we don't have the all the > information needed? I tried that before: kvm_pgtable_stage2_map() is shared with the hypervisor so sadly we can't go poking around in the host as this breaks on nVHE. I mentioned it in the v12 cover letter but it was in a wall of text: * Move the code to sanitise tags out of user_mem_abort() into its own function. Also call this new function from kvm_set_spte_gfn() as that path was missing the sanitising. Originally I was going to move the code all the way down to kvm_pgtable_stage2_map(). Sadly as that also part of the EL2 hypervisor this breaks nVHE as the code needs to perform actions in the host. The only other option I could see would be to provide a wrapper for kvm_pgtable_stage2_map() in mmu.c which could do the sanitising as necessary. But considering we know the call site in kvm_phys_addr_ioremap() doesn't need handling (PROT_DEVICE is always specified) and there's only two more, it seemed easier just to add the two calls necessary to the new sanitise_mte_tags(). We also have a direct pointer to 'kvm' this way which is much nicer than pointer chasing it out of the kvm_pgtable structure. Steve From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0157BC07E94 for ; Fri, 4 Jun 2021 12:51:51 +0000 (UTC) Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by mail.kernel.org (Postfix) with ESMTP id 520606108D for ; Fri, 4 Jun 2021 12:51:50 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 520606108D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvmarm-bounces@lists.cs.columbia.edu Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 996D84B0F6; Fri, 4 Jun 2021 08:51:49 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YQQ5Clqwvurs; Fri, 4 Jun 2021 08:51:45 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id B082340821; Fri, 4 Jun 2021 08:51:45 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 293DD40821 for ; Fri, 4 Jun 2021 08:51:44 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ttkM3YcxqtSm for ; Fri, 4 Jun 2021 08:51:42 -0400 (EDT) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 92891406DD for ; Fri, 4 Jun 2021 08:51:42 -0400 (EDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BB57D2B; Fri, 4 Jun 2021 05:51:41 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2AA943F774; Fri, 4 Jun 2021 05:51:39 -0700 (PDT) Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature To: Catalin Marinas References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> <20210604113658.GD31173@arm.com> From: Steven Price Message-ID: <2265cbf6-d643-9122-79a8-90198ea16c64@arm.com> Date: Fri, 4 Jun 2021 13:51:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210604113658.GD31173@arm.com> Content-Language: en-GB Cc: "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Marc Zyngier , Juan Quintela , Richard Henderson , linux-kernel@vger.kernel.org, Dave Martin , linux-arm-kernel@lists.infradead.org, Thomas Gleixner , Will Deacon , kvmarm@lists.cs.columbia.edu X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On 04/06/2021 12:36, Catalin Marinas wrote: > On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: >> On 03/06/2021 17:00, Catalin Marinas wrote: >>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: >>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>> index c5d1f3c87dbd..226035cf7d6c 100644 >>>> --- a/arch/arm64/kvm/mmu.c >>>> +++ b/arch/arm64/kvm/mmu.c >>>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, >>>> return PAGE_SIZE; >>>> } >>>> >>>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, >>>> + unsigned long size) >>>> +{ >>>> + if (kvm_has_mte(kvm)) { >>>> + /* >>>> + * The page will be mapped in stage 2 as Normal Cacheable, so >>>> + * the VM will be able to see the page's tags and therefore >>>> + * they must be initialised first. If PG_mte_tagged is set, >>>> + * tags have already been initialised. >>>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages >>>> + * that may not support tags. >>>> + */ >>>> + unsigned long i, nr_pages = size >> PAGE_SHIFT; >>>> + struct page *page = pfn_to_online_page(pfn); >>>> + >>>> + if (!page) >>>> + return -EFAULT; >>>> + >>>> + for (i = 0; i < nr_pages; i++, page++) { >>>> + /* >>>> + * There is a potential (but very unlikely) race >>>> + * between two VMs which are sharing a physical page >>>> + * entering this at the same time. However by splitting >>>> + * the test/set the only risk is tags being overwritten >>>> + * by the mte_clear_page_tags() call. >>>> + */ >>> >>> And I think the real risk here is when the page is writable by at least >>> one of the VMs sharing the page. This excludes KSM, so it only leaves >>> the MAP_SHARED mappings. >>> >>>> + if (!test_bit(PG_mte_tagged, &page->flags)) { >>>> + mte_clear_page_tags(page_address(page)); >>>> + set_bit(PG_mte_tagged, &page->flags); >>>> + } >>>> + } >>> >>> If we want to cover this race (I'd say in a separate patch), we can call >>> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I >>> got the arguments right). We can avoid the big lock in most cases if >>> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) >>> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we >>> do for VM_MTE but the new flag would not affect the stage 1 VMM page >>> attributes). >> >> To be honest I'm coming round to just exporting a >> mte_prepare_page_tags() function which does the clear/set with the lock >> held. I doubt it's such a performance critical path that it will cause >> any noticeable issues. Then if we run into performance problems in the >> future we can start experimenting with extra VM flags etc as necessary. > > It works for me. > >> And from your later email: >>> Another idea: if VM_SHARED is found for any vma within a region in >>> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE >>> for the guest or reject the memory slot if MTE was already enabled. >>> >>> An alternative here would be to clear VM_MTE_ALLOWED so that any >>> subsequent mprotect(PROT_MTE) in the VMM would fail in >>> arch_validate_flags(). MTE would still be allowed in the guest but in >>> the VMM for the guest memory regions. We can probably do this >>> irrespective of VM_SHARED. Of course, the VMM can still mmap() the >>> memory initially with PROT_MTE but that's not an issue IIRC, only the >>> concurrent mprotect(). >> >> This could work, but I worry that it's potential fragile. Also the rules >> for what user space can do are not obvious and may be surprising. I'd >> also want to look into the likes of mremap() to see how easy it would be >> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) >> memory sneaking into a memslot. >> >> Unless you think it's worth complicating the ABI in the hope of avoiding >> the big lock overhead I think it's probably best to stick with the big >> lock at least until we have more data on the overhead. > > It's up to Marc but I think for now just make it safe and once we get > our hands on hardware, we can assess the impact. For example, starting > multiple VMs simultaneously will contend on such big lock but we have an > option to optimise it by setting PG_mte_tagged on allocation via a new > VM_* flag. > > For my last suggestion above, changing the VMM ABI afterwards is a bit > tricky, so we could state now that VM_SHARED and MTE are not allowed > (though it needs a patch to enforce it). That's assuming that mprotect() > in the VMM cannot race with the user_mem_abort() on another CPU which > makes the lock necessary anyway. > >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> struct kvm_memory_slot *memslot, unsigned long hva, >>>> unsigned long fault_status) >>>> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> if (writable) >>>> prot |= KVM_PGTABLE_PROT_W; >>>> >>>> - if (fault_status != FSC_PERM && !device) >>>> + if (fault_status != FSC_PERM && !device) { >>>> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); >>>> + if (ret) >>>> + goto out_unlock; >>> >>> Maybe it was discussed in a previous version, why do we need this in >>> addition to kvm_set_spte_gfn()? >> >> kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a >> memslot is changed by the VMM). For the initial access we will normally >> fault the page into stage 2 with user_mem_abort(). > > Right. Can we move the sanitise_mte_tags() call to > kvm_pgtable_stage2_map() instead or we don't have the all the > information needed? I tried that before: kvm_pgtable_stage2_map() is shared with the hypervisor so sadly we can't go poking around in the host as this breaks on nVHE. I mentioned it in the v12 cover letter but it was in a wall of text: * Move the code to sanitise tags out of user_mem_abort() into its own function. Also call this new function from kvm_set_spte_gfn() as that path was missing the sanitising. Originally I was going to move the code all the way down to kvm_pgtable_stage2_map(). Sadly as that also part of the EL2 hypervisor this breaks nVHE as the code needs to perform actions in the host. The only other option I could see would be to provide a wrapper for kvm_pgtable_stage2_map() in mmu.c which could do the sanitising as necessary. But considering we know the call site in kvm_phys_addr_ioremap() doesn't need handling (PROT_DEVICE is always specified) and there's only two more, it seemed easier just to add the two calls necessary to the new sanitise_mte_tags(). We also have a direct pointer to 'kvm' this way which is much nicer than pointer chasing it out of the kvm_pgtable structure. Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A1E6C07E94 for ; Fri, 4 Jun 2021 12:53:56 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 081E661090 for ; Fri, 4 Jun 2021 12:53:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 081E661090 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:From:References:Cc:To:Subject:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=wIqrtn78+b9TyB4t3KX6TGsTsHycGBkMKM6MAJscUvE=; b=jAc75466jAqBaIuFemjVqflzNY WinWjxc7/A+V933C2N4qg/OSLNtjKMo90m7YvPqgA6WyYYzSpwaWouGqQIa0o6J0pyTP888PYxmbG ziZWs6yTH/Xqaw/QmLE6SrW9YYFVMN/9UW0enqS6Mt3AG9dSfnUQzB0YMDviLy/7A4124Z/YBzsLZ UIsZtVWXunp3i+1Fo/Dvn3f1bX0RiTLR6c3PBdaT+Bmy0pFpkX8OFwcTxE7DMBNGhaFsFcFLTIINc tRAAad+BAq+aogZpn/JUlWwwtP6iWm4SyY8OtG+kGm3l81FTmTNQt4i1fy7RlSoWDBuybIP6yBpLB Ox+macdQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1lp9Im-00DZVf-AH; Fri, 04 Jun 2021 12:51:52 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1lp9Ih-00DZV5-Jg for linux-arm-kernel@lists.infradead.org; Fri, 04 Jun 2021 12:51:50 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BB57D2B; Fri, 4 Jun 2021 05:51:41 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2AA943F774; Fri, 4 Jun 2021 05:51:39 -0700 (PDT) Subject: Re: [PATCH v13 4/8] KVM: arm64: Introduce MTE VM feature To: Catalin Marinas Cc: Marc Zyngier , Will Deacon , James Morse , Julien Thierry , Suzuki K Poulose , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Dave Martin , Mark Rutland , Thomas Gleixner , qemu-devel@nongnu.org, Juan Quintela , "Dr. David Alan Gilbert" , Richard Henderson , Peter Maydell , Haibo Xu , Andrew Jones References: <20210524104513.13258-1-steven.price@arm.com> <20210524104513.13258-5-steven.price@arm.com> <20210603160031.GE20338@arm.com> <20210604113658.GD31173@arm.com> From: Steven Price Message-ID: <2265cbf6-d643-9122-79a8-90198ea16c64@arm.com> Date: Fri, 4 Jun 2021 13:51:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210604113658.GD31173@arm.com> Content-Language: en-GB X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210604_055147_787494_35CF368D X-CRM114-Status: GOOD ( 51.13 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 04/06/2021 12:36, Catalin Marinas wrote: > On Fri, Jun 04, 2021 at 11:42:11AM +0100, Steven Price wrote: >> On 03/06/2021 17:00, Catalin Marinas wrote: >>> On Mon, May 24, 2021 at 11:45:09AM +0100, Steven Price wrote: >>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c >>>> index c5d1f3c87dbd..226035cf7d6c 100644 >>>> --- a/arch/arm64/kvm/mmu.c >>>> +++ b/arch/arm64/kvm/mmu.c >>>> @@ -822,6 +822,42 @@ transparent_hugepage_adjust(struct kvm_memory_slot *memslot, >>>> return PAGE_SIZE; >>>> } >>>> >>>> +static int sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn, >>>> + unsigned long size) >>>> +{ >>>> + if (kvm_has_mte(kvm)) { >>>> + /* >>>> + * The page will be mapped in stage 2 as Normal Cacheable, so >>>> + * the VM will be able to see the page's tags and therefore >>>> + * they must be initialised first. If PG_mte_tagged is set, >>>> + * tags have already been initialised. >>>> + * pfn_to_online_page() is used to reject ZONE_DEVICE pages >>>> + * that may not support tags. >>>> + */ >>>> + unsigned long i, nr_pages = size >> PAGE_SHIFT; >>>> + struct page *page = pfn_to_online_page(pfn); >>>> + >>>> + if (!page) >>>> + return -EFAULT; >>>> + >>>> + for (i = 0; i < nr_pages; i++, page++) { >>>> + /* >>>> + * There is a potential (but very unlikely) race >>>> + * between two VMs which are sharing a physical page >>>> + * entering this at the same time. However by splitting >>>> + * the test/set the only risk is tags being overwritten >>>> + * by the mte_clear_page_tags() call. >>>> + */ >>> >>> And I think the real risk here is when the page is writable by at least >>> one of the VMs sharing the page. This excludes KSM, so it only leaves >>> the MAP_SHARED mappings. >>> >>>> + if (!test_bit(PG_mte_tagged, &page->flags)) { >>>> + mte_clear_page_tags(page_address(page)); >>>> + set_bit(PG_mte_tagged, &page->flags); >>>> + } >>>> + } >>> >>> If we want to cover this race (I'd say in a separate patch), we can call >>> mte_sync_page_tags(page, __pte(0), false, true) directly (hopefully I >>> got the arguments right). We can avoid the big lock in most cases if >>> kvm_arch_prepare_memory_region() sets a VM_MTE_RESET (tag clear etc.) >>> and __alloc_zeroed_user_highpage() clears the tags on allocation (as we >>> do for VM_MTE but the new flag would not affect the stage 1 VMM page >>> attributes). >> >> To be honest I'm coming round to just exporting a >> mte_prepare_page_tags() function which does the clear/set with the lock >> held. I doubt it's such a performance critical path that it will cause >> any noticeable issues. Then if we run into performance problems in the >> future we can start experimenting with extra VM flags etc as necessary. > > It works for me. > >> And from your later email: >>> Another idea: if VM_SHARED is found for any vma within a region in >>> kvm_arch_prepare_memory_region(), we either prevent the enabling of MTE >>> for the guest or reject the memory slot if MTE was already enabled. >>> >>> An alternative here would be to clear VM_MTE_ALLOWED so that any >>> subsequent mprotect(PROT_MTE) in the VMM would fail in >>> arch_validate_flags(). MTE would still be allowed in the guest but in >>> the VMM for the guest memory regions. We can probably do this >>> irrespective of VM_SHARED. Of course, the VMM can still mmap() the >>> memory initially with PROT_MTE but that's not an issue IIRC, only the >>> concurrent mprotect(). >> >> This could work, but I worry that it's potential fragile. Also the rules >> for what user space can do are not obvious and may be surprising. I'd >> also want to look into the likes of mremap() to see how easy it would be >> to ensure that we couldn't end up with VM_SHARED (or VM_MTE_ALLOWED) >> memory sneaking into a memslot. >> >> Unless you think it's worth complicating the ABI in the hope of avoiding >> the big lock overhead I think it's probably best to stick with the big >> lock at least until we have more data on the overhead. > > It's up to Marc but I think for now just make it safe and once we get > our hands on hardware, we can assess the impact. For example, starting > multiple VMs simultaneously will contend on such big lock but we have an > option to optimise it by setting PG_mte_tagged on allocation via a new > VM_* flag. > > For my last suggestion above, changing the VMM ABI afterwards is a bit > tricky, so we could state now that VM_SHARED and MTE are not allowed > (though it needs a patch to enforce it). That's assuming that mprotect() > in the VMM cannot race with the user_mem_abort() on another CPU which > makes the lock necessary anyway. > >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> struct kvm_memory_slot *memslot, unsigned long hva, >>>> unsigned long fault_status) >>>> @@ -971,8 +1007,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, >>>> if (writable) >>>> prot |= KVM_PGTABLE_PROT_W; >>>> >>>> - if (fault_status != FSC_PERM && !device) >>>> + if (fault_status != FSC_PERM && !device) { >>>> + ret = sanitise_mte_tags(kvm, pfn, vma_pagesize); >>>> + if (ret) >>>> + goto out_unlock; >>> >>> Maybe it was discussed in a previous version, why do we need this in >>> addition to kvm_set_spte_gfn()? >> >> kvm_set_spte_gfn() is only used for the MMU notifier path (e.g. if a >> memslot is changed by the VMM). For the initial access we will normally >> fault the page into stage 2 with user_mem_abort(). > > Right. Can we move the sanitise_mte_tags() call to > kvm_pgtable_stage2_map() instead or we don't have the all the > information needed? I tried that before: kvm_pgtable_stage2_map() is shared with the hypervisor so sadly we can't go poking around in the host as this breaks on nVHE. I mentioned it in the v12 cover letter but it was in a wall of text: * Move the code to sanitise tags out of user_mem_abort() into its own function. Also call this new function from kvm_set_spte_gfn() as that path was missing the sanitising. Originally I was going to move the code all the way down to kvm_pgtable_stage2_map(). Sadly as that also part of the EL2 hypervisor this breaks nVHE as the code needs to perform actions in the host. The only other option I could see would be to provide a wrapper for kvm_pgtable_stage2_map() in mmu.c which could do the sanitising as necessary. But considering we know the call site in kvm_phys_addr_ioremap() doesn't need handling (PROT_DEVICE is always specified) and there's only two more, it seemed easier just to add the two calls necessary to the new sanitise_mte_tags(). We also have a direct pointer to 'kvm' this way which is much nicer than pointer chasing it out of the kvm_pgtable structure. Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel