From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Wed, 11 Oct 2023 18:45:52 +0100 [thread overview] Message-ID: <ZSbfUNLwDkaYL4ts@arm.com> (raw) In-Reply-To: <20231010182328.GS3952@nvidia.com> On Tue, Oct 10, 2023 at 03:23:28PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote: > > This has been fixed in newer architecture versions but we haven't > > added Linux support for it yet (and there's no hardware available > > either). AFAIK, there's no MTE support for CXL-attached memory at > > the moment in the current interconnects, so better not pretend it's > > general purpose memory that supports all the features. > > I don't know much about MTE, but the use case imagined for CXL memory > allows the MM to exchange any system page with a CXL page. So there > cannot be a behavioral difference. > > Can usespace continue to use tagged pointers even if the mm has moved > the pages to CXL that doesn't support it? That would lead to external aborts in the worst case or just losing tags in the best. But none of them ideal. > The main purpose for this is to tier VM memory, so having CXL > behaviorally different in a VM seems fatal to me. Yes, that's why if you can't tell the memory supports MTE, either it should not be given to a guest with KVM_CAP_ARM_MTE or MTE gets disabled in the guest. > Linux drivers need a way to understand this, like we can't have a CXL > memory pool driver or GPU driver calling memremap_pages() and getting > a somewhat broken system because of MTE incompatibilities. So maybe > ARM really should block memremap_pages() in case of MTE until > everything is resolved? I need to understand this part of the kernel a bit more but see below (and MTE as it currently stands doesn't play well with server-like systems, memory hotplug etc.) > From the mm perspective we can't have two kinds of cachable struct > pages running around that are functionally different. From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work fine, all pages interchangeable (if it doesn't, the hardware is broken). These are added via the add_memory_resource() hotplug path. If a platform is known not to support this, it better not advertise MTE as a feature (the CPUs usually have some tie-off signal when the rest of the SoC cannot handle MTE). We could claim it's a hardware erratum if it does. But for ZONE_DEVICE ranges, these are not guaranteed to support all the characteristics of the main RAM. I think that's what memremap_pages() gives us. I'm not too familiar with this part of the kernel but IIUC that falls under the HMM category, so not interchangeable with the normal RAM (hotplugged or not). > > Other than preventing malicious guest behaviour, it depends what the VM > > needs cacheable access for: some GPU memory that's only used for sharing > > data and we don't need all features or general purpose memory that a VM > > can use to run applications from etc. The former may not need all the > > features (e.g. can skip exclusives) but the latter does. > > Like CXL memory pooling GPU memory is used interchangably with every > kind of DDR memory in every context. It must be completely transparent > and interchangable via the mm's migration machinery. I don't see the mm code doing this but I haven't looked deep enough. At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel allocating ZONE_DEVICE pages and passing them to the user. > > > > I've seen something similar in the past with > > > > LSE atomics (or was it exclusives?) not being propagated. These don't > > > > make the memory safe for a guest to use as general purpose RAM. > > > > > > At least from a mm perspective, I think it is important that cachable > > > struct pages are all the same and all interchangable. If the arch > > > cannot provide this it should not allow the pgmap/memremap to succeed > > > at all. Otherwise drivers using these new APIs are never going to work > > > fully right.. > > > > Yes, for struct page backed memory, the current assumption is that all > > are the same, support all CPU features. It's the PFN-based memory where > > we don't have such guarantees. > > I see it got a bit confused, I am talking about memremap_pages() (ie > include/linux/memremap.h), not memremap (include/linux/io.h) for > getting non-struct page memory. It is confusing :| > > memremap_pages() is one of the entry points of the struct page hotplug > machinery. Things like CXL drivers assume they can hot plug in new > memory through these APIs and get new cachable struct pages that are > functionally identical to boot time cachable struct pages. We have two mechanisms, one in memremap.c and another in memory_hotplug.c. So far my assumption is that only the memory added by the latter ends up in ZONE_MOVABLE and can be used by the kernel as any of the ZONE_NORMAL RAM, transparently to the user. For ZONE_DEVICE allocations, one would have to explicitly mmap() it via a device fd. If a VMM wants to mmap() such GPU memory and give it to the guest as general purpose RAM, it should make sure it has all the characteristics as advertised by the CPU or disable certain features (if it can). Currently we don't have a way to tell what such memory supports (neither ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is that it doesn't. > > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed > > by struct page. We could probe that in KVM and either fall back to > > non-cacheable or allow cacheable if MTE is disable at the vCPU level. > > I'm not sure what this does, it is only set by shmem_map? That is > much stricter than "mappings backed by struct page" This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the memory supports MTE and sets the flag explicitly. Currently that's only done in shmem_mmap() as we know where this memory is coming from. When the user wants an mmap(PROT_MTE), the arch code checks whether VM_MTE_ALLOWED is set on the vma before allowing tag accesses. Memory mapped from ZONE_DEVICE won't have such flag set, so mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap() call into the hypervisor. A guest can simply enable MTE at stage 1 without the hypervisor being able to tell. > Still, I'm not sure how to proceed here - we veered into this MTE > stuff I don't know we have experiance with yet. We veered mostly because on arm64 such GPU memory is not guaranteed to have all the characteristics of the generic RAM. I think only MTE is the dangerous one and it needs extra care but I wouldn't be surprised if we notice atomics failing. It looks like memremap_pages() also takes a memory type and I suspect it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic RAM). Is there any sanity check on the host kernel side to allow VMM cacheable mappings only for such memory and not the other MEMORY_DEVICE_* types? Going back to KVM, we can relax to cacheable mapping at Stage 2 if the vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is disabled. From the earlier discussions, we can probably ignore VM_IO since we won't have a cacheable mapping with this flag. Not sure about VM_PFNMAP. -- Catalin
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Wed, 11 Oct 2023 18:45:52 +0100 [thread overview] Message-ID: <ZSbfUNLwDkaYL4ts@arm.com> (raw) In-Reply-To: <20231010182328.GS3952@nvidia.com> On Tue, Oct 10, 2023 at 03:23:28PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 06:19:14PM +0100, Catalin Marinas wrote: > > This has been fixed in newer architecture versions but we haven't > > added Linux support for it yet (and there's no hardware available > > either). AFAIK, there's no MTE support for CXL-attached memory at > > the moment in the current interconnects, so better not pretend it's > > general purpose memory that supports all the features. > > I don't know much about MTE, but the use case imagined for CXL memory > allows the MM to exchange any system page with a CXL page. So there > cannot be a behavioral difference. > > Can usespace continue to use tagged pointers even if the mm has moved > the pages to CXL that doesn't support it? That would lead to external aborts in the worst case or just losing tags in the best. But none of them ideal. > The main purpose for this is to tier VM memory, so having CXL > behaviorally different in a VM seems fatal to me. Yes, that's why if you can't tell the memory supports MTE, either it should not be given to a guest with KVM_CAP_ARM_MTE or MTE gets disabled in the guest. > Linux drivers need a way to understand this, like we can't have a CXL > memory pool driver or GPU driver calling memremap_pages() and getting > a somewhat broken system because of MTE incompatibilities. So maybe > ARM really should block memremap_pages() in case of MTE until > everything is resolved? I need to understand this part of the kernel a bit more but see below (and MTE as it currently stands doesn't play well with server-like systems, memory hotplug etc.) > From the mm perspective we can't have two kinds of cachable struct > pages running around that are functionally different. From a Linux+MTE perspective, what goes into ZONE_MOVABLE should work fine, all pages interchangeable (if it doesn't, the hardware is broken). These are added via the add_memory_resource() hotplug path. If a platform is known not to support this, it better not advertise MTE as a feature (the CPUs usually have some tie-off signal when the rest of the SoC cannot handle MTE). We could claim it's a hardware erratum if it does. But for ZONE_DEVICE ranges, these are not guaranteed to support all the characteristics of the main RAM. I think that's what memremap_pages() gives us. I'm not too familiar with this part of the kernel but IIUC that falls under the HMM category, so not interchangeable with the normal RAM (hotplugged or not). > > Other than preventing malicious guest behaviour, it depends what the VM > > needs cacheable access for: some GPU memory that's only used for sharing > > data and we don't need all features or general purpose memory that a VM > > can use to run applications from etc. The former may not need all the > > features (e.g. can skip exclusives) but the latter does. > > Like CXL memory pooling GPU memory is used interchangably with every > kind of DDR memory in every context. It must be completely transparent > and interchangable via the mm's migration machinery. I don't see the mm code doing this but I haven't looked deep enough. At least not in the way of doing an mmap(MAP_ANONYMOUS) and the kernel allocating ZONE_DEVICE pages and passing them to the user. > > > > I've seen something similar in the past with > > > > LSE atomics (or was it exclusives?) not being propagated. These don't > > > > make the memory safe for a guest to use as general purpose RAM. > > > > > > At least from a mm perspective, I think it is important that cachable > > > struct pages are all the same and all interchangable. If the arch > > > cannot provide this it should not allow the pgmap/memremap to succeed > > > at all. Otherwise drivers using these new APIs are never going to work > > > fully right.. > > > > Yes, for struct page backed memory, the current assumption is that all > > are the same, support all CPU features. It's the PFN-based memory where > > we don't have such guarantees. > > I see it got a bit confused, I am talking about memremap_pages() (ie > include/linux/memremap.h), not memremap (include/linux/io.h) for > getting non-struct page memory. It is confusing :| > > memremap_pages() is one of the entry points of the struct page hotplug > machinery. Things like CXL drivers assume they can hot plug in new > memory through these APIs and get new cachable struct pages that are > functionally identical to boot time cachable struct pages. We have two mechanisms, one in memremap.c and another in memory_hotplug.c. So far my assumption is that only the memory added by the latter ends up in ZONE_MOVABLE and can be used by the kernel as any of the ZONE_NORMAL RAM, transparently to the user. For ZONE_DEVICE allocations, one would have to explicitly mmap() it via a device fd. If a VMM wants to mmap() such GPU memory and give it to the guest as general purpose RAM, it should make sure it has all the characteristics as advertised by the CPU or disable certain features (if it can). Currently we don't have a way to tell what such memory supports (neither ACPI tables nor any hardware probing). The same assumption w.r.t. MTE is that it doesn't. > > We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed > > by struct page. We could probe that in KVM and either fall back to > > non-cacheable or allow cacheable if MTE is disable at the vCPU level. > > I'm not sure what this does, it is only set by shmem_map? That is > much stricter than "mappings backed by struct page" This flag is similar to the VM_MAYWRITE etc. On an mmap(), the vma gets the VM_MTE_ALLOWED flag if the mapping is MAP_ANONYMOUS (see arch_calc_vm_flag_bits()) or the (driver) mmap function knows that the memory supports MTE and sets the flag explicitly. Currently that's only done in shmem_mmap() as we know where this memory is coming from. When the user wants an mmap(PROT_MTE), the arch code checks whether VM_MTE_ALLOWED is set on the vma before allowing tag accesses. Memory mapped from ZONE_DEVICE won't have such flag set, so mmap(PROT_MTE) will fail. But for KVM guests, there's no such mmap() call into the hypervisor. A guest can simply enable MTE at stage 1 without the hypervisor being able to tell. > Still, I'm not sure how to proceed here - we veered into this MTE > stuff I don't know we have experiance with yet. We veered mostly because on arm64 such GPU memory is not guaranteed to have all the characteristics of the generic RAM. I think only MTE is the dangerous one and it needs extra care but I wouldn't be surprised if we notice atomics failing. It looks like memremap_pages() also takes a memory type and I suspect it's only safe to map MEMORY_DEVICE_COHERENT into a guest (as generic RAM). Is there any sanity check on the host kernel side to allow VMM cacheable mappings only for such memory and not the other MEMORY_DEVICE_* types? Going back to KVM, we can relax to cacheable mapping at Stage 2 if the vma is cacheable and either VM_MTE_ALLOWED is set or KVM_CAP_ARM_MTE is disabled. From the earlier discussions, we can probably ignore VM_IO since we won't have a cacheable mapping with this flag. Not sure about VM_PFNMAP. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-10-11 17:46 UTC|newest] Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 18:14 ` ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas [this message] 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 2024-03-19 13:38 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-07 18:14 ` ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 2023-11-13 17:41 ` Catalin Marinas 2023-10-12 12:27 ` Will Deacon 2023-10-12 12:27 ` Will Deacon
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ZSbfUNLwDkaYL4ts@arm.com \ --to=catalin.marinas@arm.com \ --cc=acurrid@nvidia.com \ --cc=aniketa@nvidia.com \ --cc=ankita@nvidia.com \ --cc=apopple@nvidia.com \ --cc=cjia@nvidia.com \ --cc=danw@nvidia.com \ --cc=jgg@nvidia.com \ --cc=jhubbard@nvidia.com \ --cc=kvmarm@lists.linux.dev \ --cc=kwankhede@nvidia.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=maz@kernel.org \ --cc=oliver.upton@linux.dev \ --cc=targupta@nvidia.com \ --cc=vsethi@nvidia.com \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.