Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA

From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev,
	will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com,
	kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com,
	acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com,
	danw@nvidia.com, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
Date: Thu, 12 Oct 2023 17:16:17 +0100	[thread overview]
Message-ID: <ZSgb0WBSsXHHYJT0@arm.com> (raw)
In-Reply-To: <20231011183839.GC3952@nvidia.com>

On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:
> > But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> > characteristics of the main RAM. I think that's what memremap_pages()
> > gives us. I'm not too familiar with this part of the kernel but IIUC
> > that falls under the HMM category, so not interchangeable with the
> > normal RAM (hotplugged or not).
> 
> DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".
> 
> They are not fully interchangable, but they get into the page cache,
> they can back .data segements, they could be subject atomics/etc. So
> they should be fully functional like DDR.

Unfortunately the Arm architecture makes the distinction between
"cacheable" and "cacheable tagged". We don't currently have any way of
describing this in firmware tables, so we rely on the hardware or
firmware not advertising MTE if such memory ends up as general purpose.

That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED
set, so no mmap(PROT_MTE) possible.

> > I don't see the mm code doing this but I haven't looked deep enough.
> > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> > allocating ZONE_DEVICE pages and passing them to the user.
> 
> Not ZONE_DEVICE. One popular coherent GPU approach is to use
> ZONE_MOVABLE pages.

OK, so presumably the driver could tell on which architecture it is
running and plug in the memory appropriately (or refuse to). It's a bit
more arch knowledge in a (generic) driver that I'd like but we don't
have a way to describe or probe this yet. Maybe a firmware config would
just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken
bit or tie-off for the ID registers not to advertise MTE).

> > If a VMM wants to mmap() such GPU memory and give it to the guest as
> > general purpose RAM, it should make sure it has all the characteristics
> > as advertised by the CPU or disable certain features (if it can).
> 
> This is the VFIO flow we are talking about here, I think. PFNMAP
> memory that goes into a VM that is cachable.
> 
> > Currently we don't have a way to tell what such memory supports (neither
> > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> > that it doesn't.
> 
> Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
> driver turned in into PFNMAP.. So these things seem incompatible.

So can we end up with the same pfn mapped in two different vmas, one
backed by struct page and another with VM_PFNMAP (implying no struct
page)? I don't know how this is supposed to work. Even if the memory
supports MTE, we rely on the struct page to track the status of the tags
(the PG_mte_tagged flag). We may get away with this if user_mem_abort()
-> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd
actually prevent this path altogether if VM_PFNMAP (well, we may
actually have a bug if we don't already do this).

> > From the earlier discussions, we can probably ignore VM_IO
> > since we won't have a cacheable mapping with this flag. Not sure about
> > VM_PFNMAP.
> 
> PFNMAP is the interesting one for VFIO, at least. Can we use the same
> reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
> discussion.
> 
> Currently no VFIO driver is doing cachable that has memory that is
> different from DDR memory. So this is sort of theoretical discussion
> about future cachable HW that does use VFIO that does have a
> non-uniformity.
> 
> Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
> use ZONE_MOVABLE?

I think we should only keep VM_IO for memory mapped I/O with side
effects. Other ranges can be VM_PFNMAP if not backed by struct page.

> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> and check for ZONE_MOVABLE struct pages as one of the conditions for
> NORMAL?

I think we should keep it as simple as possible and, looking at it
again, maybe even ignore vm_page_prot. Two questions though:

1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
   MT_NORMAL_TAGGED?

2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
   which end up in user_mem_abort()) imply VM_IO?

If yes to both, I think something like below would do:

	mte_allowed = kvm_vma_mte_allowed(vma);
	noncacheable = false;				// or 'device' as in user_mem_abort()
	...
	if (vma->flags & VM_IO)				// replaces !pfn_is_map_memory()
		noncacheable = true;
	else if (!mte_allowed && kvm_has_mte())
		noncaheable = true;
	...
	if (noncacheable)
		prot |= KVM_PGTABLE_PROT_DEVICE;	// or the new KVM_PGTABLE_PROT_NC

mte_allowed would cover DAX mappings (and, who knows, some future DAX
mapping may allow MTE and the driver explicitly set the flag). Anything
else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
MTE disabled altogether.

-- 
Catalin