linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Steven Price <steven.price@arm.com>,
	Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Andrew Jones <drjones@redhat.com>, Haibo Xu <Haibo.Xu@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	qemu-devel@nongnu.org, Marc Zyngier <maz@kernel.org>,
	Juan Quintela <quintela@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	linux-kernel@vger.kernel.org, Dave Martin <Dave.Martin@arm.com>,
	James Morse <james.morse@arm.com>,
	linux-arm-kernel@lists.infradead.org,
	Thomas Gleixner <tglx@linutronix.de>,
	Will Deacon <will@kernel.org>,
	kvmarm@lists.cs.columbia.edu,
	Julien Thierry <julien.thierry.kdev@gmail.com>
Subject: Re: [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature
Date: Wed, 31 Mar 2021 16:14:12 +0200	[thread overview]
Message-ID: <eebe69ad-1a2f-91ee-75b1-f295d7a3389c@redhat.com> (raw)
In-Reply-To: <86a968c8-7a0e-44a4-28c3-bac62c2b7d65@arm.com>

On 31.03.21 12:41, Steven Price wrote:
> On 31/03/2021 10:32, David Hildenbrand wrote:
>> On 31.03.21 11:21, Catalin Marinas wrote:
>>> On Wed, Mar 31, 2021 at 09:34:44AM +0200, David Hildenbrand wrote:
>>>> On 30.03.21 12:30, Catalin Marinas wrote:
>>>>> On Mon, Mar 29, 2021 at 05:06:51PM +0100, Steven Price wrote:
>>>>>> On 28/03/2021 13:21, Catalin Marinas wrote:
>>>>>>> On Sat, Mar 27, 2021 at 03:23:24PM +0000, Catalin Marinas wrote:
>>>>>>>> On Fri, Mar 12, 2021 at 03:18:58PM +0000, Steven Price wrote:
>>>>>>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>>>>>>> index 77cb2d28f2a4..b31b7a821f90 100644
>>>>>>>>> --- a/arch/arm64/kvm/mmu.c
>>>>>>>>> +++ b/arch/arm64/kvm/mmu.c
>>>>>>>>> @@ -879,6 +879,22 @@ static int user_mem_abort(struct kvm_vcpu
>>>>>>>>> *vcpu, phys_addr_t fault_ipa,
>>>>>>>>>          if (vma_pagesize == PAGE_SIZE && !force_pte)
>>>>>>>>>              vma_pagesize = transparent_hugepage_adjust(memslot,
>>>>>>>>> hva,
>>>>>>>>>                                     &pfn, &fault_ipa);
>>>>>>>>> +
>>>>>>>>> +    if (fault_status != FSC_PERM && kvm_has_mte(kvm) &&
>>>>>>>>> pfn_valid(pfn)) {
>>>>>>>>> +        /*
>>>>>>>>> +         * VM will be able to see the page's tags, so we must
>>>>>>>>> ensure
>>>>>>>>> +         * they have been initialised. if PG_mte_tagged is set,
>>>>>>>>> tags
>>>>>>>>> +         * have already been initialised.
>>>>>>>>> +         */
>>>>>>>>> +        struct page *page = pfn_to_page(pfn);
>>>>>>>>> +        unsigned long i, nr_pages = vma_pagesize >> PAGE_SHIFT;
>>>>>>>>> +
>>>>>>>>> +        for (i = 0; i < nr_pages; i++, page++) {
>>>>>>>>> +            if (!test_and_set_bit(PG_mte_tagged, &page->flags))
>>>>>>>>> +                mte_clear_page_tags(page_address(page));
>>>>>>>>> +        }
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>> This pfn_valid() check may be problematic. Following commit
>>>>>>>> eeb0753ba27b
>>>>>>>> ("arm64/mm: Fix pfn_valid() for ZONE_DEVICE based memory"), it
>>>>>>>> returns
>>>>>>>> true for ZONE_DEVICE memory but such memory is allowed not to
>>>>>>>> support
>>>>>>>> MTE.
>>>>>>>
>>>>>>> Some more thinking, this should be safe as any ZONE_DEVICE would be
>>>>>>> mapped as untagged memory in the kernel linear map. It could be
>>>>>>> slightly
>>>>>>> inefficient if it unnecessarily tries to clear tags in ZONE_DEVICE,
>>>>>>> untagged memory. Another overhead is pfn_valid() which will likely
>>>>>>> end
>>>>>>> up calling memblock_is_map_memory().
>>>>>>>
>>>>>>> However, the bigger issue is that Stage 2 cannot disable tagging for
>>>>>>> Stage 1 unless the memory is Non-cacheable or Device at S2. Is
>>>>>>> there a
>>>>>>> way to detect what gets mapped in the guest as Normal Cacheable
>>>>>>> memory
>>>>>>> and make sure it's only early memory or hotplug but no ZONE_DEVICE
>>>>>>> (or
>>>>>>> something else like on-chip memory)?  If we can't guarantee that all
>>>>>>> Cacheable memory given to a guest supports tags, we should disable
>>>>>>> the
>>>>>>> feature altogether.
>>>>>>
>>>>>> In stage 2 I believe we only have two types of mapping - 'normal' or
>>>>>> DEVICE_nGnRE (see stage2_map_set_prot_attr()). Filtering out the
>>>>>> latter is a
>>>>>> case of checking the 'device' variable, and makes sense to avoid the
>>>>>> overhead you describe.
>>>>>>
>>>>>> This should also guarantee that all stage-2 cacheable memory
>>>>>> supports tags,
>>>>>> as kvm_is_device_pfn() is simply !pfn_valid(), and pfn_valid()
>>>>>> should only
>>>>>> be true for memory that Linux considers "normal".
>>>>
>>>> If you think "normal" == "normal System RAM", that's wrong; see below.
>>>
>>> By "normal" I think both Steven and I meant the Normal Cacheable memory
>>> attribute (another being the Device memory attribute).
> 
> Sadly there's no good standardised terminology here. Aarch64 provides
> the "normal (cacheable)" definition. Memory which is mapped as "Normal
> Cacheable" is implicitly MTE capable when shared with a guest (because
> the stage 2 mappings don't allow restricting MTE other than mapping it
> as Device memory).
> 
> So MTE also forces us to have a definition of memory which is "bog
> standard memory"[1] separate from the mapping attributes. This is the
> main memory which fully supports MTE.
> 
> Separate from the "bog standard" we have the "special"[1] memory, e.g.
> ZONE_DEVICE memory may be mapped as "Normal Cacheable" at stage 1 but
> that memory may not support MTE tags. This memory can only be safely
> shared with a guest in the following situations:
> 
>    1. MTE is completely disabled for the guest
> 
>    2. The stage 2 mappings are 'device' (e.g. DEVICE_nGnRE)
> 
>    3. We have some guarantee that guest MTE access are in some way safe.
> 
> (1) is the situation today (without this patch series). But it prevents
> the guest from using MTE in any form.
> 
> (2) is pretty terrible for general memory, but is the get-out clause for
> mapping devices into the guest.
> 
> (3) isn't something we have any architectural way of discovering. We'd
> need to know what the device did with the MTE accesses (and any caches
> between the CPU and the device) to ensure there aren't any side-channels
> or h/w lockup issues. We'd also need some way of describing this memory
> to the guest.
> 
> So at least for the time being the approach is to avoid letting a guest
> with MTE enabled have access to this sort of memory.
> 
> [1] Neither "bog standard" nor "special" are real terms - like I said
> there's a lack of standardised terminology.
> 
>>>>> That's the problem. With Anshuman's commit I mentioned above,
>>>>> pfn_valid() returns true for ZONE_DEVICE mappings (e.g. persistent
>>>>> memory, not talking about some I/O mapping that requires Device_nGnRE).
>>>>> So kvm_is_device_pfn() is false for such memory and it may be mapped as
>>>>> Normal but it is not guaranteed to support tagging.
>>>>
>>>> pfn_valid() means "there is a struct page"; if you do pfn_to_page() and
>>>> touch the page, you won't fault. So Anshuman's commit is correct.
>>>
>>> I agree.
>>>
>>>> pfn_to_online_page() means, "there is a struct page and it's system RAM
>>>> that's in use; the memmap has a sane content"
>>>
>>> Does pfn_to_online_page() returns a valid struct page pointer for
>>> ZONE_DEVICE pages? IIUC, these are not guaranteed to be system RAM, for
>>> some definition of system RAM (I assume NVDIMM != system RAM). For
>>> example, pmem_attach_disk() calls devm_memremap_pages() and this would
>>> use the Normal Cacheable memory attribute without necessarily being
>>> system RAM.
>>
>> No, not for ZONE_DEVICE.
>>
>> However, if you expose PMEM via dax/kmem as System RAM to the system (->
>> add_memory_driver_managed()), then PMEM (managed via ZONE_NOMRAL or
>> ZONE_MOVABLE) would work with pfn_to_online_page() -- as the system
>> thinks it's "ordinary system RAM" and the memory is managed by the buddy.
> 
> So if I'm understanding this correctly for KVM we need to use
> pfn_to_online_pages() and reject if NULL is returned? In the case of
> dax/kmem there already needs to be validation that the memory supports
> MTE (otherwise we break user space) before it's allowed into the
> "ordinary system RAM" bucket.

That should work.

1. One alternative is

if (!pfn_valid(pfn))
	return false;
#ifdef CONFIG_ZONE_DEVICE
page = pfn_to_page(pfn);
if (page_zonenum(page) == ZONE_DEVICE)
	return false;
#endif
return true;


Note that when you are dealing with random PFNs, this approach is in 
general not safe; the memmap could be uninitialized and contain garbage. 
You can have false positives for ZONE_DEVICE.


2. Yet another (slower?) variant to detect (some?) ZONE_DEVICE is

pgmap = get_dev_pagemap(pfn, NULL);
put_dev_pagemap(pgmap);

if (pgmap)
	return false;
return true;



I know that /dev/mem mappings can be problematic ... because the memmap 
could be in any state and actually we shouldn't even touch/rely on any 
"struct pages" at all, as we have a pure PFN mapping ...

-- 
Thanks,

David / dhildenb


  reply	other threads:[~2021-03-31 14:15 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-12 15:18 [PATCH v10 0/6] MTE support for KVM guest Steven Price
2021-03-12 15:18 ` [PATCH v10 1/6] arm64: mte: Sync tags for pages where PTE is untagged Steven Price
2021-03-26 18:56   ` Catalin Marinas
2021-03-29 15:55     ` Steven Price
2021-03-30 10:13       ` Catalin Marinas
2021-03-31 10:09         ` Steven Price
2021-03-12 15:18 ` [PATCH v10 2/6] arm64: kvm: Introduce MTE VM feature Steven Price
2021-03-27 15:23   ` Catalin Marinas
2021-03-28 12:21     ` Catalin Marinas
2021-03-29 16:06       ` Steven Price
2021-03-30 10:30         ` Catalin Marinas
2021-03-31  7:34           ` David Hildenbrand
2021-03-31  9:21             ` Catalin Marinas
2021-03-31  9:32               ` David Hildenbrand
2021-03-31 10:41                 ` Steven Price
2021-03-31 14:14                   ` David Hildenbrand [this message]
2021-03-31 18:43                   ` Catalin Marinas
2021-04-07 10:20                     ` Steven Price
2021-04-07 15:14                       ` Catalin Marinas
2021-04-07 15:30                         ` David Hildenbrand
2021-04-07 15:52                         ` Steven Price
2021-04-08 14:18                           ` Catalin Marinas
2021-04-08 18:16                             ` David Hildenbrand
2021-04-08 18:21                               ` Catalin Marinas
2021-03-12 15:18 ` [PATCH v10 3/6] arm64: kvm: Save/restore MTE registers Steven Price
2021-03-12 15:19 ` [PATCH v10 4/6] arm64: kvm: Expose KVM_ARM_CAP_MTE Steven Price
2021-03-12 15:19 ` [PATCH v10 5/6] KVM: arm64: ioctl to fetch/store tags in a guest Steven Price
2021-03-12 15:19 ` [PATCH v10 6/6] KVM: arm64: Document MTE capability and ioctl Steven Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eebe69ad-1a2f-91ee-75b1-f295d7a3389c@redhat.com \
    --to=david@redhat.com \
    --cc=Dave.Martin@arm.com \
    --cc=Haibo.Xu@arm.com \
    --cc=catalin.marinas@arm.com \
    --cc=dgilbert@redhat.com \
    --cc=drjones@redhat.com \
    --cc=james.morse@arm.com \
    --cc=julien.thierry.kdev@gmail.com \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=richard.henderson@linaro.org \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tglx@linutronix.de \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).