From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Thu, 12 Oct 2023 17:16:17 +0100 [thread overview] Message-ID: <ZSgb0WBSsXHHYJT0@arm.com> (raw) In-Reply-To: <20231011183839.GC3952@nvidia.com> On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote: > On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote: > > But for ZONE_DEVICE ranges, these are not guaranteed to support all the > > characteristics of the main RAM. I think that's what memremap_pages() > > gives us. I'm not too familiar with this part of the kernel but IIUC > > that falls under the HMM category, so not interchangeable with the > > normal RAM (hotplugged or not). > > DAX pages use ZONE_DEVICE and they are cachable, and not "HMM". > > They are not fully interchangable, but they get into the page cache, > they can back .data segements, they could be subject atomics/etc. So > they should be fully functional like DDR. Unfortunately the Arm architecture makes the distinction between "cacheable" and "cacheable tagged". We don't currently have any way of describing this in firmware tables, so we rely on the hardware or firmware not advertising MTE if such memory ends up as general purpose. That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED set, so no mmap(PROT_MTE) possible. > > I don't see the mm code doing this but I haven't looked deep enough. > > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel > > allocating ZONE_DEVICE pages and passing them to the user. > > Not ZONE_DEVICE. One popular coherent GPU approach is to use > ZONE_MOVABLE pages. OK, so presumably the driver could tell on which architecture it is running and plug in the memory appropriately (or refuse to). It's a bit more arch knowledge in a (generic) driver that I'd like but we don't have a way to describe or probe this yet. Maybe a firmware config would just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken bit or tie-off for the ID registers not to advertise MTE). > > If a VMM wants to mmap() such GPU memory and give it to the guest as > > general purpose RAM, it should make sure it has all the characteristics > > as advertised by the CPU or disable certain features (if it can). > > This is the VFIO flow we are talking about here, I think. PFNMAP > memory that goes into a VM that is cachable. > > > Currently we don't have a way to tell what such memory supports (neither > > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is > > that it doesn't. > > Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO > driver turned in into PFNMAP.. So these things seem incompatible. So can we end up with the same pfn mapped in two different vmas, one backed by struct page and another with VM_PFNMAP (implying no struct page)? I don't know how this is supposed to work. Even if the memory supports MTE, we rely on the struct page to track the status of the tags (the PG_mte_tagged flag). We may get away with this if user_mem_abort() -> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd actually prevent this path altogether if VM_PFNMAP (well, we may actually have a bug if we don't already do this). > > From the earlier discussions, we can probably ignore VM_IO > > since we won't have a cacheable mapping with this flag. Not sure about > > VM_PFNMAP. > > PFNMAP is the interesting one for VFIO, at least. Can we use the same > reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE > discussion. > > Currently no VFIO driver is doing cachable that has memory that is > different from DDR memory. So this is sort of theoretical discussion > about future cachable HW that does use VFIO that does have a > non-uniformity. > > Maybe that HW should set VM_IO on its VFIO PFN map and obviously not > use ZONE_MOVABLE? I think we should only keep VM_IO for memory mapped I/O with side effects. Other ranges can be VM_PFNMAP if not backed by struct page. > Where does that leave us for this patch? We check the VM_MTE_ALLOWED > and check for ZONE_MOVABLE struct pages as one of the conditions for > NORMAL? I think we should keep it as simple as possible and, looking at it again, maybe even ignore vm_page_prot. Two questions though: 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or MT_NORMAL_TAGGED? 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and which end up in user_mem_abort()) imply VM_IO? If yes to both, I think something like below would do: mte_allowed = kvm_vma_mte_allowed(vma); noncacheable = false; // or 'device' as in user_mem_abort() ... if (vma->flags & VM_IO) // replaces !pfn_is_map_memory() noncacheable = true; else if (!mte_allowed && kvm_has_mte()) noncaheable = true; ... if (noncacheable) prot |= KVM_PGTABLE_PROT_DEVICE; // or the new KVM_PGTABLE_PROT_NC mte_allowed would cover DAX mappings (and, who knows, some future DAX mapping may allow MTE and the driver explicitly set the flag). Anything else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or MTE disabled altogether. -- Catalin
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Thu, 12 Oct 2023 17:16:17 +0100 [thread overview] Message-ID: <ZSgb0WBSsXHHYJT0@arm.com> (raw) In-Reply-To: <20231011183839.GC3952@nvidia.com> On Wed, Oct 11, 2023 at 03:38:39PM -0300, Jason Gunthorpe wrote: > On Wed, Oct 11, 2023 at 06:45:52PM +0100, Catalin Marinas wrote: > > But for ZONE_DEVICE ranges, these are not guaranteed to support all the > > characteristics of the main RAM. I think that's what memremap_pages() > > gives us. I'm not too familiar with this part of the kernel but IIUC > > that falls under the HMM category, so not interchangeable with the > > normal RAM (hotplugged or not). > > DAX pages use ZONE_DEVICE and they are cachable, and not "HMM". > > They are not fully interchangable, but they get into the page cache, > they can back .data segements, they could be subject atomics/etc. So > they should be fully functional like DDR. Unfortunately the Arm architecture makes the distinction between "cacheable" and "cacheable tagged". We don't currently have any way of describing this in firmware tables, so we rely on the hardware or firmware not advertising MTE if such memory ends up as general purpose. That said, DAX mappings are safe since the vma would have VM_MTE_ALLOWED set, so no mmap(PROT_MTE) possible. > > I don't see the mm code doing this but I haven't looked deep enough. > > At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel > > allocating ZONE_DEVICE pages and passing them to the user. > > Not ZONE_DEVICE. One popular coherent GPU approach is to use > ZONE_MOVABLE pages. OK, so presumably the driver could tell on which architecture it is running and plug in the memory appropriately (or refuse to). It's a bit more arch knowledge in a (generic) driver that I'd like but we don't have a way to describe or probe this yet. Maybe a firmware config would just turn MTE off in this case (SCTLR_EL3.ATA=0 and some other chicken bit or tie-off for the ID registers not to advertise MTE). > > If a VMM wants to mmap() such GPU memory and give it to the guest as > > general purpose RAM, it should make sure it has all the characteristics > > as advertised by the CPU or disable certain features (if it can). > > This is the VFIO flow we are talking about here, I think. PFNMAP > memory that goes into a VM that is cachable. > > > Currently we don't have a way to tell what such memory supports (neither > > ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is > > that it doesn't. > > Indeed, but my GPU driver hot plugged it as ZONE_MOVABLE and my VFIO > driver turned in into PFNMAP.. So these things seem incompatible. So can we end up with the same pfn mapped in two different vmas, one backed by struct page and another with VM_PFNMAP (implying no struct page)? I don't know how this is supposed to work. Even if the memory supports MTE, we rely on the struct page to track the status of the tags (the PG_mte_tagged flag). We may get away with this if user_mem_abort() -> sanitise_mte_tags() -> pfn_to_page() finds a page structure but I'd actually prevent this path altogether if VM_PFNMAP (well, we may actually have a bug if we don't already do this). > > From the earlier discussions, we can probably ignore VM_IO > > since we won't have a cacheable mapping with this flag. Not sure about > > VM_PFNMAP. > > PFNMAP is the interesting one for VFIO, at least. Can we use the same > reasoning that it will be !VM_MTE_ALLOWED and we can close the MTE > discussion. > > Currently no VFIO driver is doing cachable that has memory that is > different from DDR memory. So this is sort of theoretical discussion > about future cachable HW that does use VFIO that does have a > non-uniformity. > > Maybe that HW should set VM_IO on its VFIO PFN map and obviously not > use ZONE_MOVABLE? I think we should only keep VM_IO for memory mapped I/O with side effects. Other ranges can be VM_PFNMAP if not backed by struct page. > Where does that leave us for this patch? We check the VM_MTE_ALLOWED > and check for ZONE_MOVABLE struct pages as one of the conditions for > NORMAL? I think we should keep it as simple as possible and, looking at it again, maybe even ignore vm_page_prot. Two questions though: 1. Does VM_IO imply vm_page_prot never having MT_NORMAL or MT_NORMAL_TAGGED? 2. Do all I/O ranges (side-effects, non-RAM) mapped into a guest (and which end up in user_mem_abort()) imply VM_IO? If yes to both, I think something like below would do: mte_allowed = kvm_vma_mte_allowed(vma); noncacheable = false; // or 'device' as in user_mem_abort() ... if (vma->flags & VM_IO) // replaces !pfn_is_map_memory() noncacheable = true; else if (!mte_allowed && kvm_has_mte()) noncaheable = true; ... if (noncacheable) prot |= KVM_PGTABLE_PROT_DEVICE; // or the new KVM_PGTABLE_PROT_NC mte_allowed would cover DAX mappings (and, who knows, some future DAX mapping may allow MTE and the driver explicitly set the flag). Anything else hot-plugged into ZONE_MOVABLE should have VM_MTE_ALLOWED set or MTE disabled altogether. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-10-12 16:16 UTC|newest] Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 18:14 ` ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas [this message] 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 2024-03-19 13:38 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-07 18:14 ` ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 2023-11-13 17:41 ` Catalin Marinas 2023-10-12 12:27 ` Will Deacon 2023-10-12 12:27 ` Will Deacon
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ZSgb0WBSsXHHYJT0@arm.com \ --to=catalin.marinas@arm.com \ --cc=acurrid@nvidia.com \ --cc=aniketa@nvidia.com \ --cc=ankita@nvidia.com \ --cc=apopple@nvidia.com \ --cc=cjia@nvidia.com \ --cc=danw@nvidia.com \ --cc=jgg@nvidia.com \ --cc=jhubbard@nvidia.com \ --cc=kvmarm@lists.linux.dev \ --cc=kwankhede@nvidia.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=maz@kernel.org \ --cc=oliver.upton@linux.dev \ --cc=targupta@nvidia.com \ --cc=vsethi@nvidia.com \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.