From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D110C43460 for ; Wed, 7 Apr 2021 15:52:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 67778610C7 for ; Wed, 7 Apr 2021 15:52:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347344AbhDGPw6 (ORCPT ); Wed, 7 Apr 2021 11:52:58 -0400 Received: from foss.arm.com ([217.140.110.172]:59648 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233917AbhDGPw4 (ORCPT ); Wed, 7 Apr 2021 11:52:56 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D779C1063; Wed, 7 Apr 2021 08:52:46 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F3F543F694; Wed, 7 Apr 2021 08:52:43 -0700 (PDT) Subject: Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature To: Catalin Marinas Cc: David Hildenbrand , Mark Rutland , Peter Maydell , "Dr. David Alan Gilbert" , Andrew Jones , Haibo Xu , Suzuki K Poulose , qemu-devel@nongnu.org, Marc Zyngier , Juan Quintela , Richard Henderson , linux-kernel@vger.kernel.org, Dave Martin , James Morse , linux-arm-kernel@lists.infradead.org, Thomas Gleixner , Will Deacon , kvmarm@lists.cs.columbia.edu, Julien Thierry References: <20210327152324.GA28167@arm.com> <20210328122131.GB17535@arm.com> <20210330103013.GD18075@arm.com> <8977120b-841d-4882-2472-6e403bc9c797@redhat.com> <20210331092109.GA21921@arm.com> <86a968c8-7a0e-44a4-28c3-bac62c2b7d65@arm.com> <20210331184311.GA10737@arm.com> <20210407151458.GC21451@arm.com> From: Steven Price Message-ID: <5e5bf772-1e4d-ca59-a9d8-058a72dfad4f@arm.com> Date: Wed, 7 Apr 2021 16:52:54 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <20210407151458.GC21451@arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F41D7C433B4 for ; Wed, 7 Apr 2021 15:54:44 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5C2B661262 for ; Wed, 7 Apr 2021 15:54:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5C2B661262 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:60336 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lUAVv-0000WH-DB for qemu-devel@archiver.kernel.org; Wed, 07 Apr 2021 11:54:43 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40936) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1lUAUA-0008Lw-Ub for qemu-devel@nongnu.org; Wed, 07 Apr 2021 11:52:54 -0400 Received: from foss.arm.com ([217.140.110.172]:52308) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1lUAU8-0006k5-5K for qemu-devel@nongnu.org; Wed, 07 Apr 2021 11:52:54 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D779C1063; Wed, 7 Apr 2021 08:52:46 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F3F543F694; Wed, 7 Apr 2021 08:52:43 -0700 (PDT) Subject: Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature To: Catalin Marinas References: <20210327152324.GA28167@arm.com> <20210328122131.GB17535@arm.com> <20210330103013.GD18075@arm.com> <8977120b-841d-4882-2472-6e403bc9c797@redhat.com> <20210331092109.GA21921@arm.com> <86a968c8-7a0e-44a4-28c3-bac62c2b7d65@arm.com> <20210331184311.GA10737@arm.com> <20210407151458.GC21451@arm.com> From: Steven Price Message-ID: <5e5bf772-1e4d-ca59-a9d8-058a72dfad4f@arm.com> Date: Wed, 7 Apr 2021 16:52:54 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <20210407151458.GC21451@arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=217.140.110.172; envelope-from=steven.price@arm.com; helo=foss.arm.com X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mark Rutland , Peter Maydell , Andrew Jones , Haibo Xu , David Hildenbrand , Marc Zyngier , Suzuki K Poulose , Richard Henderson , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Juan Quintela , James Morse , linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, Thomas Gleixner , Julien Thierry , Will Deacon , Dave Martin , linux-kernel@vger.kernel.org Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BFFDCC433ED for ; Wed, 7 Apr 2021 15:52:51 +0000 (UTC) Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253]) by mail.kernel.org (Postfix) with ESMTP id 432576138B for ; Wed, 7 Apr 2021 15:52:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 432576138B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvmarm-bounces@lists.cs.columbia.edu Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id C3FA24B919; Wed, 7 Apr 2021 11:52:50 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id vGnFDC29Fuax; Wed, 7 Apr 2021 11:52:49 -0400 (EDT) Received: from mm01.cs.columbia.edu (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 6FC004B8E4; Wed, 7 Apr 2021 11:52:49 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mm01.cs.columbia.edu (Postfix) with ESMTP id AD4714B8BF for ; Wed, 7 Apr 2021 11:52:48 -0400 (EDT) X-Virus-Scanned: at lists.cs.columbia.edu Received: from mm01.cs.columbia.edu ([127.0.0.1]) by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7ownYbcCtTF0 for ; Wed, 7 Apr 2021 11:52:47 -0400 (EDT) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mm01.cs.columbia.edu (Postfix) with ESMTP id 507E14B637 for ; Wed, 7 Apr 2021 11:52:47 -0400 (EDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D779C1063; Wed, 7 Apr 2021 08:52:46 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F3F543F694; Wed, 7 Apr 2021 08:52:43 -0700 (PDT) Subject: Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature To: Catalin Marinas References: <20210327152324.GA28167@arm.com> <20210328122131.GB17535@arm.com> <20210330103013.GD18075@arm.com> <8977120b-841d-4882-2472-6e403bc9c797@redhat.com> <20210331092109.GA21921@arm.com> <86a968c8-7a0e-44a4-28c3-bac62c2b7d65@arm.com> <20210331184311.GA10737@arm.com> <20210407151458.GC21451@arm.com> From: Steven Price Message-ID: <5e5bf772-1e4d-ca59-a9d8-058a72dfad4f@arm.com> Date: Wed, 7 Apr 2021 16:52:54 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <20210407151458.GC21451@arm.com> Content-Language: en-GB Cc: David Hildenbrand , Marc Zyngier , Richard Henderson , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Juan Quintela , linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, Thomas Gleixner , Will Deacon , Dave Martin , linux-kernel@vger.kernel.org X-BeenThere: kvmarm@lists.cs.columbia.edu X-Mailman-Version: 2.1.14 Precedence: list List-Id: Where KVM/ARM decisions are made List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: kvmarm-bounces@lists.cs.columbia.edu Sender: kvmarm-bounces@lists.cs.columbia.edu On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9303CC433ED for ; Wed, 7 Apr 2021 15:55:20 +0000 (UTC) Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 219A561262 for ; Wed, 7 Apr 2021 15:55:20 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 219A561262 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=desiato.20200630; h=Sender:Content-Type: Content-Transfer-Encoding:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date:Message-ID:From: References:Cc:To:Subject:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=cMpC+RtgUEuuT93/j29lBr1N+UNGD97Ajm1pyf6Vz94=; b=bu7TXthlF0CL2O16rHol2UF6J rSuiwY33qRaW8wlbI3bgIc53iImNh8cL3U37REaPMCGQ6wtIsf2RLAt2emvKPoSWpndCl6Cesj/l1 +DcpocuP3FPYrvnh1Zzx3aqXCIkfA9UzOl2Xo6m3HK1+NOGtnOoWY4WFa6FPzXYuJ4WrplP0mTVac 8a09oH9MNfIzzXMrQ4mpbF8qLK3CaxwF2IguZA8QJHbt4f//2yyJBmL9CVujIawhz/NJpyqCIUBJX Pe57p7iTWFt3TL6FjiagRawkx5BynHIR6P1JpjaUT+dXygkNtrYMtWjnmPOEv3pQUMsOXLStbmfaR FopyABT1Q==; Received: from localhost ([::1] helo=desiato.infradead.org) by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1lUAUL-005L2I-KA; Wed, 07 Apr 2021 15:53:06 +0000 Received: from foss.arm.com ([217.140.110.172]) by desiato.infradead.org with esmtp (Exim 4.94 #2 (Red Hat Linux)) id 1lUAU5-005Kx3-I8 for linux-arm-kernel@lists.infradead.org; Wed, 07 Apr 2021 15:52:51 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D779C1063; Wed, 7 Apr 2021 08:52:46 -0700 (PDT) Received: from [192.168.1.179] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id F3F543F694; Wed, 7 Apr 2021 08:52:43 -0700 (PDT) Subject: Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature To: Catalin Marinas Cc: David Hildenbrand , Mark Rutland , Peter Maydell , "Dr. David Alan Gilbert" , Andrew Jones , Haibo Xu , Suzuki K Poulose , qemu-devel@nongnu.org, Marc Zyngier , Juan Quintela , Richard Henderson , linux-kernel@vger.kernel.org, Dave Martin , James Morse , linux-arm-kernel@lists.infradead.org, Thomas Gleixner , Will Deacon , kvmarm@lists.cs.columbia.edu, Julien Thierry References: <20210327152324.GA28167@arm.com> <20210328122131.GB17535@arm.com> <20210330103013.GD18075@arm.com> <8977120b-841d-4882-2472-6e403bc9c797@redhat.com> <20210331092109.GA21921@arm.com> <86a968c8-7a0e-44a4-28c3-bac62c2b7d65@arm.com> <20210331184311.GA10737@arm.com> <20210407151458.GC21451@arm.com> From: Steven Price Message-ID: <5e5bf772-1e4d-ca59-a9d8-058a72dfad4f@arm.com> Date: Wed, 7 Apr 2021 16:52:54 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.1 MIME-Version: 1.0 In-Reply-To: <20210407151458.GC21451@arm.com> Content-Language: en-GB X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20210407_165250_183475_3ECBF38A X-CRM114-Status: GOOD ( 38.69 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 07/04/2021 16:14, Catalin Marinas wrote: > On Wed, Apr 07, 2021 at 11:20:18AM +0100, Steven Price wrote: >> On 31/03/2021 19:43, Catalin Marinas wrote: >>> When a slot is added by the VMM, if it asked for MTE in guest (I guess >>> that's an opt-in by the VMM, haven't checked the other patches), can we >>> reject it if it's is going to be mapped as Normal Cacheable but it is a >>> ZONE_DEVICE (i.e. !kvm_is_device_pfn() + one of David's suggestions to >>> check for ZONE_DEVICE)? This way we don't need to do more expensive >>> checks in set_pte_at(). >> >> The problem is that KVM allows the VMM to change the memory backing a slot >> while the guest is running. This is obviously useful for the likes of >> migration, but ultimately means that even if you were to do checks at the >> time of slot creation, you would need to repeat the checks at set_pte_at() >> time to ensure a mischievous VMM didn't swap the page for a problematic one. > > Does changing the slot require some KVM API call? Can we intercept it > and do the checks there? As David has already replied - KVM uses MMU notifiers, so there's not really a good place to intercept this before the fault. > Maybe a better alternative for the time being is to add a new > kvm_is_zone_device_pfn() and force KVM_PGTABLE_PROT_DEVICE if it returns > true _and_ the VMM asked for MTE in guest. We can then only set > PG_mte_tagged if !device. KVM already has a kvm_is_device_pfn(), and yes I agree restricting the MTE checks to only !kvm_is_device_pfn() makes sense (I have the fix in my branch locally). > We can later relax this further to Normal Non-cacheable for ZONE_DEVICE > memory (via a new KVM_PGTABLE_PROT_NORMAL_NC) or even Normal Cacheable > if we manage to change the behaviour of the architecture. Indeed, it'll be interesting to see whether people want to build MTE capable systems with significant quantities of non-MTE capable memory. But for a first stage let's stick with either all guest memory (except devices) is MTE or you disable MTE for the guest. >>> We could add another PROT_TAGGED or something which means PG_mte_tagged >>> set but still mapped as Normal Untagged. It's just that we are short of >>> pte bits for another flag. >> >> That could help here - although it's slightly odd as you're asking the >> kernel to track the tags, but not allowing user space (direct) access to >> them. Like you say using us the precious bits for this seems like it might >> be short-sighted. > > Yeah, let's scrap this idea. We set PG_mte_tagged in user_mem_abort(), > so we already know it's a page potentially containing tags. On > restoring from swap, we need to check for MTE metadata irrespective of > whether the user pte is tagged or not, as you already did in patch 1. > I'll get back to that and look at the potential races. > > BTW, after a page is restored from swap, how long do we keep the > metadata around? I think we can delete it as soon as it was restored and > PG_mte_tagged was set. Currently it looks like we only do this when the > actual page was freed or swapoff. I haven't convinced myself that it's > safe to do this for swapoff unless it guarantees that all the ptes > sharing a page have been restored. > My initial thought was to free the metadata immediately. However it turns out that the following sequence can happen: 1. Swap out a page 2. Swap the page in *read only* 3. Discard the page 4. Swap the page in again So there's no writing of the swap data again before (3). This works nicely with a swap device because after writing a page it stays there forever, so if you know it hasn't been modified it's pointless rewriting it. Sadly it's not quite so ideal with the MTE tags which are currently kept in RAM. Arguably it would make sense to modify the on-disk swap format to include the tags - but that would open a whole new can of worms! swapoff needs to ensure that all the PTEs have been restored because after the swapoff has completed the PTEs will be pointing at a swap entry which is no longer valid (and could even have been reallocated to point to a new swap device). When you issue you a swapoff, Linux will scan the mmlist and the page tables of every process to search for swap entry PTEs relating to the swap which is being removed (see try_to_unuse()). Steve _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel