From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Tue, 10 Oct 2023 18:19:14 +0100 [thread overview] Message-ID: <ZSWHkvhutlnvVLUZ@arm.com> (raw) In-Reply-To: <20231010150502.GM3952@nvidia.com> On Tue, Oct 10, 2023 at 12:05:02PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote: > > On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote: > > > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote: > > > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > > > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > Currently KVM determines if a VMA is pointing at IO memory by checking > > > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what > > > > > kind of memory it is by inspecting the VMA. > > > > > > > > Well, it doesn't. It tells us what attributes the user mapped that > > > > memory with, not whether it's I/O memory or standard RAM. > > > > > > There is VM_IO which is intended to be used for address space with > > > side effects. > > > > > > And there is VM_PFNMAP which is intended to be used for address space > > > without struct page (IO or not) > > > > > > And finally we have the pgprot bit which define the cachability. > > > > > > Do you have a definition of IO memory that those three things don't > > > cover? > > > > > > I would propose that, for KVM's purpose, IO memory is marked with > > > VM_IO or a non-cachable pgprot > > > > > > And "standard RAM" is defined by a cachable pgprot. Linux never makes > > > something that is VM_IO cachable. > > > > I think we can safely set a stage 2 Normal NC for a vma with pgprot > > other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is > > not that simple. Just because the VMM was allowed to map it as cacheable > > does not mean that it supports all the CPU features. One example is MTE > > where we can only guarantee that the RAM given to the OS at boot > > supports tagged accesses. > > Is there a use case to supply the VMM with cachable memory that is not > full featured? At least the vfio cases I am aware of do not actually > want to do this and would probably like the arch to prevent these > corner cases upfront. The MTE case is the problematic one here. On a data access, the interconnect shifts (right) the physical address and adds an offset. The resulting address is used to access tags. Such shift+offset is configured by firmware at boot and normally only covers the default memory. If there's some memory on PCIe, it's very unlikely to be covered and we can't tell whether it simply drops such tag accesses or makes up some random address that may or may not hit an existing memory or device. We don't currently have a way to describe this in ACPI tables (there were talks about describing special purpose memory, I lost track of the progress) and the way MTE was first designed doesn't allow a hypervisor to prevent the guest from generating a tagged access (other than mapping the memory as non-cacheable at Stage 2). This has been fixed in newer architecture versions but we haven't added Linux support for it yet (and there's no hardware available either). AFAIK, there's no MTE support for CXL-attached memory at the moment in the current interconnects, so better not pretend it's general purpose memory that supports all the features. Other than preventing malicious guest behaviour, it depends what the VM needs cacheable access for: some GPU memory that's only used for sharing data and we don't need all features or general purpose memory that a VM can use to run applications from etc. The former may not need all the features (e.g. can skip exclusives) but the latter does. We can probably work around the MTE case by only allowing cacheable Stage 2 if MTE is disabled for the guest or FEAT_MTE_PERM is implemented/supported (TBD when we'll add this). For the other cases, it would be up to the VMM how it presents the mapping to the guest (device pass-through or general purpose memory). > > I've seen something similar in the past with > > LSE atomics (or was it exclusives?) not being propagated. These don't > > make the memory safe for a guest to use as general purpose RAM. > > At least from a mm perspective, I think it is important that cachable > struct pages are all the same and all interchangable. If the arch > cannot provide this it should not allow the pgmap/memremap to succeed > at all. Otherwise drivers using these new APIs are never going to work > fully right.. Yes, for struct page backed memory, the current assumption is that all are the same, support all CPU features. It's the PFN-based memory where we don't have such guarantees. > That leaves open the question of what to do with a cachable VM_PFNMAP, > but if the arch can deal with the memremap issue then it seems like it > can use the same logic when inspecting the VMA contents? In the MTE case, memremap() never returns a Normal Tagged mapping and it would not map it in user-space as PROT_MTE either. But if a page is not mmap(PROT_MTE) (or VM_MTE in vma->flags) in the VMM, it doesn't mean the guest should not be allowed to use MTE. Qemu for example doesn't map the guest memory with mmap(PROT_MTE) since it doesn't have a need for it but the guest can enable MTE independently (well, if enabled at the vCPU level). We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed by struct page. We could probe that in KVM and either fall back to non-cacheable or allow cacheable if MTE is disable at the vCPU level. -- Catalin
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, will@kernel.org, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 1/2] KVM: arm64: determine memory type from VMA Date: Tue, 10 Oct 2023 18:19:14 +0100 [thread overview] Message-ID: <ZSWHkvhutlnvVLUZ@arm.com> (raw) In-Reply-To: <20231010150502.GM3952@nvidia.com> On Tue, Oct 10, 2023 at 12:05:02PM -0300, Jason Gunthorpe wrote: > On Tue, Oct 10, 2023 at 03:25:22PM +0100, Catalin Marinas wrote: > > On Thu, Oct 05, 2023 at 01:54:58PM -0300, Jason Gunthorpe wrote: > > > On Thu, Oct 05, 2023 at 05:15:37PM +0100, Catalin Marinas wrote: > > > > On Thu, Sep 07, 2023 at 11:14:58AM -0700, ankita@nvidia.com wrote: > > > > > From: Ankit Agrawal <ankita@nvidia.com> > > > > > Currently KVM determines if a VMA is pointing at IO memory by checking > > > > > pfn_is_map_memory(). However, the MM already gives us a way to tell what > > > > > kind of memory it is by inspecting the VMA. > > > > > > > > Well, it doesn't. It tells us what attributes the user mapped that > > > > memory with, not whether it's I/O memory or standard RAM. > > > > > > There is VM_IO which is intended to be used for address space with > > > side effects. > > > > > > And there is VM_PFNMAP which is intended to be used for address space > > > without struct page (IO or not) > > > > > > And finally we have the pgprot bit which define the cachability. > > > > > > Do you have a definition of IO memory that those three things don't > > > cover? > > > > > > I would propose that, for KVM's purpose, IO memory is marked with > > > VM_IO or a non-cachable pgprot > > > > > > And "standard RAM" is defined by a cachable pgprot. Linux never makes > > > something that is VM_IO cachable. > > > > I think we can safely set a stage 2 Normal NC for a vma with pgprot > > other than MT_NORMAL or MT_NORMAL_TAGGED. But the other way around is > > not that simple. Just because the VMM was allowed to map it as cacheable > > does not mean that it supports all the CPU features. One example is MTE > > where we can only guarantee that the RAM given to the OS at boot > > supports tagged accesses. > > Is there a use case to supply the VMM with cachable memory that is not > full featured? At least the vfio cases I am aware of do not actually > want to do this and would probably like the arch to prevent these > corner cases upfront. The MTE case is the problematic one here. On a data access, the interconnect shifts (right) the physical address and adds an offset. The resulting address is used to access tags. Such shift+offset is configured by firmware at boot and normally only covers the default memory. If there's some memory on PCIe, it's very unlikely to be covered and we can't tell whether it simply drops such tag accesses or makes up some random address that may or may not hit an existing memory or device. We don't currently have a way to describe this in ACPI tables (there were talks about describing special purpose memory, I lost track of the progress) and the way MTE was first designed doesn't allow a hypervisor to prevent the guest from generating a tagged access (other than mapping the memory as non-cacheable at Stage 2). This has been fixed in newer architecture versions but we haven't added Linux support for it yet (and there's no hardware available either). AFAIK, there's no MTE support for CXL-attached memory at the moment in the current interconnects, so better not pretend it's general purpose memory that supports all the features. Other than preventing malicious guest behaviour, it depends what the VM needs cacheable access for: some GPU memory that's only used for sharing data and we don't need all features or general purpose memory that a VM can use to run applications from etc. The former may not need all the features (e.g. can skip exclusives) but the latter does. We can probably work around the MTE case by only allowing cacheable Stage 2 if MTE is disabled for the guest or FEAT_MTE_PERM is implemented/supported (TBD when we'll add this). For the other cases, it would be up to the VMM how it presents the mapping to the guest (device pass-through or general purpose memory). > > I've seen something similar in the past with > > LSE atomics (or was it exclusives?) not being propagated. These don't > > make the memory safe for a guest to use as general purpose RAM. > > At least from a mm perspective, I think it is important that cachable > struct pages are all the same and all interchangable. If the arch > cannot provide this it should not allow the pgmap/memremap to succeed > at all. Otherwise drivers using these new APIs are never going to work > fully right.. Yes, for struct page backed memory, the current assumption is that all are the same, support all CPU features. It's the PFN-based memory where we don't have such guarantees. > That leaves open the question of what to do with a cachable VM_PFNMAP, > but if the arch can deal with the memremap issue then it seems like it > can use the same logic when inspecting the VMA contents? In the MTE case, memremap() never returns a Normal Tagged mapping and it would not map it in user-space as PROT_MTE either. But if a page is not mmap(PROT_MTE) (or VM_MTE in vma->flags) in the VMM, it doesn't mean the guest should not be allowed to use MTE. Qemu for example doesn't map the guest memory with mmap(PROT_MTE) since it doesn't have a need for it but the guest can enable MTE independently (well, if enabled at the vCPU level). We have an additional flag, VM_MTE_ALLOWED, only set for mappings backed by struct page. We could probe that in KVM and either fall back to non-cacheable or allow cacheable if MTE is disable at the vCPU level. -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-10-10 17:19 UTC|newest] Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 18:14 ` ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas [this message] 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 2024-03-19 13:38 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-07 18:14 ` ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 2023-11-13 17:41 ` Catalin Marinas 2023-10-12 12:27 ` Will Deacon 2023-10-12 12:27 ` Will Deacon
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ZSWHkvhutlnvVLUZ@arm.com \ --to=catalin.marinas@arm.com \ --cc=acurrid@nvidia.com \ --cc=aniketa@nvidia.com \ --cc=ankita@nvidia.com \ --cc=apopple@nvidia.com \ --cc=cjia@nvidia.com \ --cc=danw@nvidia.com \ --cc=jgg@nvidia.com \ --cc=jhubbard@nvidia.com \ --cc=kvmarm@lists.linux.dev \ --cc=kwankhede@nvidia.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=maz@kernel.org \ --cc=oliver.upton@linux.dev \ --cc=targupta@nvidia.com \ --cc=vsethi@nvidia.com \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.