All of lore.kernel.org
 help / color / mirror / Atom feed
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev,
	will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com,
	kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com,
	acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com,
	danw@nvidia.com, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
Date: Thu, 12 Oct 2023 17:16:17 +0100	[thread overview]
Message-ID: <ZSgb0WBSsXHHYJT0@arm.com> (raw)
In-Reply-To: <20231011183839.GC3952@nvidia.com>

On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:
> > But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> > characteristics of the main RAM. I think that's what memremap_pages()
> > gives us. I'm not too familiar with this part of the kernel but IIUC
> > that falls under the HMM category, so not interchangeable with the
> > normal RAM (hotplugged or not).
> 
> DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".
> 
> They are not fully interchangable, but they get into the page cache,
> they can back .data segements, they could be subject atomics/etc. So
> they should be fully functional like DDR.

Unfortunately the Arm architecture makes the distinction between
"cacheable" and "cacheable tagged". We don't currently have any way of
describing this in firmware tables, so we rely on the hardware or
firmware not advertising MTE if such memory ends up as general purpose.

That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED
set, so no mmap(PROT_MTE) possible.

> > I don't see the mm code doing this but I haven't looked deep enough.
> > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> > allocating ZONE_DEVICE pages and passing them to the user.
> 
> Not ZONE_DEVICE. One popular coherent GPU approach is to use
> ZONE_MOVABLE pages.

OK, so presumably the driver could tell on which architecture it is
running and plug in the memory appropriately (or refuse to). It's a bit
more arch knowledge in a (generic) driver that I'd like but we don't
have a way to describe or probe this yet. Maybe a firmware config would
just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken
bit or tie-off for the ID registers not to advertise MTE).

> > If a VMM wants to mmap() such GPU memory and give it to the guest as
> > general purpose RAM, it should make sure it has all the characteristics
> > as advertised by the CPU or disable certain features (if it can).
> 
> This is the VFIO flow we are talking about here, I think. PFNMAP
> memory that goes into a VM that is cachable.
> 
> > Currently we don't have a way to tell what such memory supports (neither
> > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> > that it doesn't.
> 
> Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
> driver turned in into PFNMAP.. So these things seem incompatible.

So can we end up with the same pfn mapped in two different vmas, one
backed by struct page and another with VM_PFNMAP (implying no struct
page)? I don't know how this is supposed to work. Even if the memory
supports MTE, we rely on the struct page to track the status of the tags
(the PG_mte_tagged flag). We may get away with this if user_mem_abort()
-> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd
actually prevent this path altogether if VM_PFNMAP (well, we may
actually have a bug if we don't already do this).

> > From the earlier discussions, we can probably ignore VM_IO
> > since we won't have a cacheable mapping with this flag. Not sure about
> > VM_PFNMAP.
> 
> PFNMAP is the interesting one for VFIO, at least. Can we use the same
> reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
> discussion.
> 
> Currently no VFIO driver is doing cachable that has memory that is
> different from DDR memory. So this is sort of theoretical discussion
> about future cachable HW that does use VFIO that does have a
> non-uniformity.
> 
> Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
> use ZONE_MOVABLE?

I think we should only keep VM_IO for memory mapped I/O with side
effects. Other ranges can be VM_PFNMAP if not backed by struct page.

> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> and check for ZONE_MOVABLE struct pages as one of the conditions for
> NORMAL?

I think we should keep it as simple as possible and, looking at it
again, maybe even ignore vm_page_prot. Two questions though:

1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
   MT_NORMAL_TAGGED?

2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
   which end up in user_mem_abort()) imply VM_IO?

If yes to both, I think something like below would do:

	mte_allowed = kvm_vma_mte_allowed(vma);
	noncacheable = false;				// or 'device' as in user_mem_abort()
	...
	if (vma->flags & VM_IO)				// replaces !pfn_is_map_memory()
		noncacheable = true;
	else if (!mte_allowed && kvm_has_mte())
		noncaheable = true;
	...
	if (noncacheable)
		prot |= KVM_PGTABLE_PROT_DEVICE;	// or the new KVM_PGTABLE_PROT_NC

mte_allowed would cover DAX mappings (and, who knows, some future DAX
mapping may allow MTE and the driver explicitly set the flag). Anything
else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
MTE disabled altogether.

-- 
Catalin

WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev,
	will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com,
	kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com,
	acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com,
	danw@nvidia.com, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA
Date: Thu, 12 Oct 2023 17:16:17 +0100	[thread overview]
Message-ID: <ZSgb0WBSsXHHYJT0@arm.com> (raw)
In-Reply-To: <20231011183839.GC3952@nvidia.com>

On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote:
> > But for ZONE_DEVICE ranges, these are not guaranteed to support all the
> > characteristics of the main RAM. I think that's what memremap_pages()
> > gives us. I'm not too familiar with this part of the kernel but IIUC
> > that falls under the HMM category, so not interchangeable with the
> > normal RAM (hotplugged or not).
> 
> DAX pages use ZONE_DEVICE and they are cachable, and not "HMM".
> 
> They are not fully interchangable, but they get into the page cache,
> they can back .data segements, they could be subject atomics/etc. So
> they should be fully functional like DDR.

Unfortunately the Arm architecture makes the distinction between
"cacheable" and "cacheable tagged". We don't currently have any way of
describing this in firmware tables, so we rely on the hardware or
firmware not advertising MTE if such memory ends up as general purpose.

That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED
set, so no mmap(PROT_MTE) possible.

> > I don't see the mm code doing this but I haven't looked deep enough.
> > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel
> > allocating ZONE_DEVICE pages and passing them to the user.
> 
> Not ZONE_DEVICE. One popular coherent GPU approach is to use
> ZONE_MOVABLE pages.

OK, so presumably the driver could tell on which architecture it is
running and plug in the memory appropriately (or refuse to). It's a bit
more arch knowledge in a (generic) driver that I'd like but we don't
have a way to describe or probe this yet. Maybe a firmware config would
just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken
bit or tie-off for the ID registers not to advertise MTE).

> > If a VMM wants to mmap() such GPU memory and give it to the guest as
> > general purpose RAM, it should make sure it has all the characteristics
> > as advertised by the CPU or disable certain features (if it can).
> 
> This is the VFIO flow we are talking about here, I think. PFNMAP
> memory that goes into a VM that is cachable.
> 
> > Currently we don't have a way to tell what such memory supports (neither
> > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is
> > that it doesn't.
> 
> Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO
> driver turned in into PFNMAP.. So these things seem incompatible.

So can we end up with the same pfn mapped in two different vmas, one
backed by struct page and another with VM_PFNMAP (implying no struct
page)? I don't know how this is supposed to work. Even if the memory
supports MTE, we rely on the struct page to track the status of the tags
(the PG_mte_tagged flag). We may get away with this if user_mem_abort()
-> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd
actually prevent this path altogether if VM_PFNMAP (well, we may
actually have a bug if we don't already do this).

> > From the earlier discussions, we can probably ignore VM_IO
> > since we won't have a cacheable mapping with this flag. Not sure about
> > VM_PFNMAP.
> 
> PFNMAP is the interesting one for VFIO, at least. Can we use the same
> reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE
> discussion.
> 
> Currently no VFIO driver is doing cachable that has memory that is
> different from DDR memory. So this is sort of theoretical discussion
> about future cachable HW that does use VFIO that does have a
> non-uniformity.
> 
> Maybe that HW should set VM_IO on its VFIO PFN map and obviously not
> use ZONE_MOVABLE?

I think we should only keep VM_IO for memory mapped I/O with side
effects. Other ranges can be VM_PFNMAP if not backed by struct page.

> Where does that leave us for this patch? We check the VM_MTE_ALLOWED
> and check for ZONE_MOVABLE struct pages as one of the conditions for
> NORMAL?

I think we should keep it as simple as possible and, looking at it
again, maybe even ignore vm_page_prot. Two questions though:

1. Does VM_IO imply vm_page_prot never having MT_NORMAL or
   MT_NORMAL_TAGGED?

2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and
   which end up in user_mem_abort()) imply VM_IO?

If yes to both, I think something like below would do:

	mte_allowed = kvm_vma_mte_allowed(vma);
	noncacheable = false;				// or 'device' as in user_mem_abort()
	...
	if (vma->flags & VM_IO)				// replaces !pfn_is_map_memory()
		noncacheable = true;
	else if (!mte_allowed && kvm_has_mte())
		noncaheable = true;
	...
	if (noncacheable)
		prot |= KVM_PGTABLE_PROT_DEVICE;	// or the new KVM_PGTABLE_PROT_NC

mte_allowed would cover DAX mappings (and, who knows, some future DAX
mapping may allow MTE and the driver explicitly set the flag). Anything
else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or
MTE disabled altogether.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2023-10-12 16:16 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita
2023-09-07 18:14 ` ankita
2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
2023-09-07 18:14   ` ankita
2023-09-07 19:12   ` Jason Gunthorpe
2023-09-07 19:12     ` Jason Gunthorpe
2023-10-05 16:15   ` Catalin Marinas
2023-10-05 16:15     ` Catalin Marinas
2023-10-05 16:54     ` Jason Gunthorpe
2023-10-05 16:54       ` Jason Gunthorpe
2023-10-10 14:25       ` Catalin Marinas
2023-10-10 14:25         ` Catalin Marinas
2023-10-10 15:05         ` Jason Gunthorpe
2023-10-10 15:05           ` Jason Gunthorpe
2023-10-10 17:19           ` Catalin Marinas
2023-10-10 17:19             ` Catalin Marinas
2023-10-10 18:23             ` Jason Gunthorpe
2023-10-10 18:23               ` Jason Gunthorpe
2023-10-11 17:45               ` Catalin Marinas
2023-10-11 17:45                 ` Catalin Marinas
2023-10-11 18:38                 ` Jason Gunthorpe
2023-10-11 18:38                   ` Jason Gunthorpe
2023-10-12 16:16                   ` Catalin Marinas [this message]
2023-10-12 16:16                     ` Catalin Marinas
2024-03-10  3:49                     ` Ankit Agrawal
2024-03-10  3:49                       ` Ankit Agrawal
2024-03-19 13:38                       ` Jason Gunthorpe
2024-03-19 13:38                         ` Jason Gunthorpe
2023-10-23 13:20   ` Shameerali Kolothum Thodi
2023-10-23 13:20     ` Shameerali Kolothum Thodi
2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
2023-09-07 18:14   ` ankita
2023-09-08 16:40   ` Catalin Marinas
2023-09-08 16:40     ` Catalin Marinas
2023-09-11 14:57   ` Lorenzo Pieralisi
2023-09-11 14:57     ` Lorenzo Pieralisi
2023-09-11 17:20     ` Jason Gunthorpe
2023-09-11 17:20       ` Jason Gunthorpe
2023-09-13 15:26       ` Lorenzo Pieralisi
2023-09-13 15:26         ` Lorenzo Pieralisi
2023-09-13 18:54         ` Jason Gunthorpe
2023-09-13 18:54           ` Jason Gunthorpe
2023-09-26  8:31           ` Lorenzo Pieralisi
2023-09-26  8:31             ` Lorenzo Pieralisi
2023-09-26 12:25             ` Jason Gunthorpe
2023-09-26 12:25               ` Jason Gunthorpe
2023-09-26 13:52             ` Catalin Marinas
2023-09-26 13:52               ` Catalin Marinas
2023-09-26 16:12               ` Lorenzo Pieralisi
2023-09-26 16:12                 ` Lorenzo Pieralisi
2023-10-05  9:56               ` Lorenzo Pieralisi
2023-10-05  9:56                 ` Lorenzo Pieralisi
2023-10-05 11:56                 ` Jason Gunthorpe
2023-10-05 11:56                   ` Jason Gunthorpe
2023-10-05 14:08                   ` Lorenzo Pieralisi
2023-10-05 14:08                     ` Lorenzo Pieralisi
2023-10-12 12:35                 ` Will Deacon
2023-10-12 12:35                   ` Will Deacon
2023-10-12 13:20                   ` Jason Gunthorpe
2023-10-12 13:20                     ` Jason Gunthorpe
2023-10-12 14:29                     ` Lorenzo Pieralisi
2023-10-12 14:29                       ` Lorenzo Pieralisi
2023-10-12 13:53                   ` Catalin Marinas
2023-10-12 13:53                     ` Catalin Marinas
2023-10-12 14:48                     ` Will Deacon
2023-10-12 14:48                       ` Will Deacon
2023-10-12 15:44                       ` Jason Gunthorpe
2023-10-12 15:44                         ` Jason Gunthorpe
2023-10-12 16:39                         ` Will Deacon
2023-10-12 16:39                           ` Will Deacon
2023-10-12 18:36                           ` Jason Gunthorpe
2023-10-12 18:36                             ` Jason Gunthorpe
2023-10-13  9:29                             ` Will Deacon
2023-10-13  9:29                               ` Will Deacon
2023-10-12 17:26                       ` Catalin Marinas
2023-10-12 17:26                         ` Catalin Marinas
2023-10-13  9:29                         ` Will Deacon
2023-10-13  9:29                           ` Will Deacon
2023-10-13 13:08                           ` Catalin Marinas
2023-10-13 13:08                             ` Catalin Marinas
2023-10-13 13:45                             ` Jason Gunthorpe
2023-10-13 13:45                               ` Jason Gunthorpe
2023-10-19 11:07                               ` Catalin Marinas
2023-10-19 11:07                                 ` Catalin Marinas
2023-10-19 11:51                                 ` Jason Gunthorpe
2023-10-19 11:51                                   ` Jason Gunthorpe
2023-10-20 11:21                                   ` Catalin Marinas
2023-10-20 11:21                                     ` Catalin Marinas
2023-10-20 11:47                                     ` Jason Gunthorpe
2023-10-20 11:47                                       ` Jason Gunthorpe
2023-10-20 14:03                                       ` Lorenzo Pieralisi
2023-10-20 14:03                                         ` Lorenzo Pieralisi
2023-10-20 14:28                                         ` Jason Gunthorpe
2023-10-20 14:28                                           ` Jason Gunthorpe
2023-10-19 13:35                                 ` Lorenzo Pieralisi
2023-10-19 13:35                                   ` Lorenzo Pieralisi
2023-10-13 15:28                             ` Lorenzo Pieralisi
2023-10-13 15:28                               ` Lorenzo Pieralisi
2023-10-19 11:12                               ` Catalin Marinas
2023-10-19 11:12                                 ` Catalin Marinas
2023-11-09 15:34                             ` Lorenzo Pieralisi
2023-11-09 15:34                               ` Lorenzo Pieralisi
2023-11-10 14:26                               ` Jason Gunthorpe
2023-11-10 14:26                                 ` Jason Gunthorpe
2023-11-13  0:42                                 ` Lorenzo Pieralisi
2023-11-13  0:42                                   ` Lorenzo Pieralisi
2023-11-13 17:41                               ` Catalin Marinas
2023-11-13 17:41                                 ` Catalin Marinas
2023-10-12 12:27   ` Will Deacon
2023-10-12 12:27     ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZSgb0WBSsXHHYJT0@arm.com \
    --to=catalin.marinas@arm.com \
    --cc=acurrid@nvidia.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=kwankhede@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.