From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: Will Deacon <will@kernel.org>, Lorenzo Pieralisi <lpieralisi@kernel.org>, ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory Date: Thu, 19 Oct 2023 12:07:42 +0100 [thread overview] Message-ID: <ZTEN_oe97VRWbnHb@arm.com> (raw) In-Reply-To: <20231013134541.GP3952@nvidia.com> On Fri, Oct 13, 2023 at 10:45:41AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > > > Claiming back the device also seems strange if the guest has been using > > > > > non-cacheable accesses since I think you could get write merging and > > > > > reordering with subsequent device accesses trying to reset the device. > > > > > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > > > > > We do have a good story for this part: use Device-nGnRE! > > > > Don't we actually need Device-nGnRnE for this, coupled with a DSB for > > endpoint completion? > > > > Device-nGnRE may be sufficient as a read from that device would ensure > > that the previous write is observable (potentially with a DMB if > > accessing separate device regions) but I don't think we do this now > > either. Even this, isn't it device-specific? I don't know enough about > > PCIe, posted writes, reordering, maybe others can shed some light. > > > > For Normal NC, if the access doesn't have side-effects (or rather the > > endpoint is memory-like), I think we are fine. The Stage 2 unmapping + > > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU > > was pushed sufficiently far as not to affect subsequent writes by other > > CPUs. > > > > For I/O accesses that change some state of the device, I'm not sure the > > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only > > nE + DSB as long as the PCIe device plays along nicely. > > Can someone explain this concern a little more simply please? > > Let's try something simpler. I have no KVM. My kernel driver > creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a > write to the NormalNC and immediately unmaps the VMA > > What is the issue? The issue is when the device is reclaimed to be given to another user-space process, do we know the previous transaction reached the device? With the TLBI + DSB in unmap, we can only tell that a subsequent map + write (in a new process) is ordered after the write in the old process. Maybe that's sufficient in most cases. > And then how does making KVM the thing that creates the NormalNC > change this? They are similar. > Not knowing the whole details, here is my story about how it should work: > > Unmapping the VMA's must already have some NormalNC friendly ordering > barrier across all CPUs or we have a bigger problem. This barrier > definately must close write combining. I think what we have is TLBI + DSB generating the DVM/DVMSync messages should ensure that the Normal NC writes on other CPUs reach some serialisation point (not necessarily the device, AFAICT we can't guarantee end-point completion here). > VFIO issues a config space write to reset the PCI function. Config > space writes MUST NOT write combine with anything. This is already > impossible for PCIe since they are different TLP types at the PCIe > level. Yes, config space writes are fine, vfio-pci even maps them as Device_nGnRnE. But AFAIK a guest won't have direct access to the config space. > By the PCIe rules, config space write must order strictly after all > other CPU's accesses. Once the reset non-posted write returns back to > VFIO we know that: > > 1) There is no reference in any CPU page table to the MMIO PFN > 2) No CPU has pending data in any write buffer > 3) The interconnect and PCIe fabric have no inflight operations > 4) The device is in a clean post-reset state I think from the CPU perspective, we can guarantee that a Normal_NC write on CPU0 for example reaches a serialisation point before a config space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If it is the PCIe root complex, we are probably fine, we hope it deals with any ordering between the Normal_NC write and the config space one. Talking to Will earlier, I think we can deem the PCIe scenario (somewhat) safe but not as a generic mechanism for other non-PCIe devices (e.g. platform). With this concern, can we make this Stage 2 relaxation in KVM only for vfio-pci mappings? I don't have an example of non-PCIe device assignment to figure out how this should work though. > > knows all the details. The safest is for the VMM to keep it as Device (I > > think vfio-pci goes for the strongest nGnRnE). > > We are probably going to allow VFIO to let userspace pick if it should > be pgprot_device or pgprot_writecombine. I guess that's for the direct use by an application rather than VMM+VM. IIUC people work around this currently by mapping PCIe BARs as pgprot_writecombine() via sysfs. Getting vfio-pci to allow different mappings is probably a good idea, though it doesn't currently help with the KVM case as we can't force the VMM to know the specifics of the device it is giving to a guest. > The alias issue could be resolved by teaching KVM how to insert a > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that > the PFNs are never mapped in the hypervisor side. This should work as well and solves the aliasing problem, though it requires changes to the VMM as well, not just KVM, which currently relies on vfio-pci mmap(). -- Catalin
WARNING: multiple messages have this Message-ID (diff)
From: Catalin Marinas <catalin.marinas@arm.com> To: Jason Gunthorpe <jgg@nvidia.com> Cc: Will Deacon <will@kernel.org>, Lorenzo Pieralisi <lpieralisi@kernel.org>, ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory Date: Thu, 19 Oct 2023 12:07:42 +0100 [thread overview] Message-ID: <ZTEN_oe97VRWbnHb@arm.com> (raw) In-Reply-To: <20231013134541.GP3952@nvidia.com> On Fri, Oct 13, 2023 at 10:45:41AM -0300, Jason Gunthorpe wrote: > On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote: > > On Fri, Oct 13, 2023 at 10:29:35AM +0100, Will Deacon wrote: > > > On Thu, Oct 12, 2023 at 06:26:01PM +0100, Catalin Marinas wrote: > > > > On Thu, Oct 12, 2023 at 03:48:08PM +0100, Will Deacon wrote: > > > > > Claiming back the device also seems strange if the guest has been using > > > > > non-cacheable accesses since I think you could get write merging and > > > > > reordering with subsequent device accesses trying to reset the device. > > > > > > > > True. Not sure we have a good story here (maybe reinvent the DWB barrier ;)). > > > > > > We do have a good story for this part: use Device-nGnRE! > > > > Don't we actually need Device-nGnRnE for this, coupled with a DSB for > > endpoint completion? > > > > Device-nGnRE may be sufficient as a read from that device would ensure > > that the previous write is observable (potentially with a DMB if > > accessing separate device regions) but I don't think we do this now > > either. Even this, isn't it device-specific? I don't know enough about > > PCIe, posted writes, reordering, maybe others can shed some light. > > > > For Normal NC, if the access doesn't have side-effects (or rather the > > endpoint is memory-like), I think we are fine. The Stage 2 unmapping + > > TLBI + DSB (DVM + DVMSync) should ensure that a pending write by the CPU > > was pushed sufficiently far as not to affect subsequent writes by other > > CPUs. > > > > For I/O accesses that change some state of the device, I'm not sure the > > TLBI+DSB is sufficient. But I don't think Device nGnRE is either, only > > nE + DSB as long as the PCIe device plays along nicely. > > Can someone explain this concern a little more simply please? > > Let's try something simpler. I have no KVM. My kernel driver > creates a VMA with pgprot_writecombine (NormalNC). Userpsace does a > write to the NormalNC and immediately unmaps the VMA > > What is the issue? The issue is when the device is reclaimed to be given to another user-space process, do we know the previous transaction reached the device? With the TLBI + DSB in unmap, we can only tell that a subsequent map + write (in a new process) is ordered after the write in the old process. Maybe that's sufficient in most cases. > And then how does making KVM the thing that creates the NormalNC > change this? They are similar. > Not knowing the whole details, here is my story about how it should work: > > Unmapping the VMA's must already have some NormalNC friendly ordering > barrier across all CPUs or we have a bigger problem. This barrier > definately must close write combining. I think what we have is TLBI + DSB generating the DVM/DVMSync messages should ensure that the Normal NC writes on other CPUs reach some serialisation point (not necessarily the device, AFAICT we can't guarantee end-point completion here). > VFIO issues a config space write to reset the PCI function. Config > space writes MUST NOT write combine with anything. This is already > impossible for PCIe since they are different TLP types at the PCIe > level. Yes, config space writes are fine, vfio-pci even maps them as Device_nGnRnE. But AFAIK a guest won't have direct access to the config space. > By the PCIe rules, config space write must order strictly after all > other CPU's accesses. Once the reset non-posted write returns back to > VFIO we know that: > > 1) There is no reference in any CPU page table to the MMIO PFN > 2) No CPU has pending data in any write buffer > 3) The interconnect and PCIe fabric have no inflight operations > 4) The device is in a clean post-reset state I think from the CPU perspective, we can guarantee that a Normal_NC write on CPU0 for example reaches a serialisation point before a config space (Device_nGnRnE) write on CPU1 by the host as long as CPU1 issued a TLBI+DSB. Now, what I'm not sure is what this serialisation point is. If it is the PCIe root complex, we are probably fine, we hope it deals with any ordering between the Normal_NC write and the config space one. Talking to Will earlier, I think we can deem the PCIe scenario (somewhat) safe but not as a generic mechanism for other non-PCIe devices (e.g. platform). With this concern, can we make this Stage 2 relaxation in KVM only for vfio-pci mappings? I don't have an example of non-PCIe device assignment to figure out how this should work though. > > knows all the details. The safest is for the VMM to keep it as Device (I > > think vfio-pci goes for the strongest nGnRnE). > > We are probably going to allow VFIO to let userspace pick if it should > be pgprot_device or pgprot_writecombine. I guess that's for the direct use by an application rather than VMM+VM. IIUC people work around this currently by mapping PCIe BARs as pgprot_writecombine() via sysfs. Getting vfio-pci to allow different mappings is probably a good idea, though it doesn't currently help with the KVM case as we can't force the VMM to know the specifics of the device it is giving to a guest. > The alias issue could be resolved by teaching KVM how to insert a > physical PFN based on some VFIO FD/dmabuf rather than a VMA so that > the PFNs are never mapped in the hypervisor side. This should work as well and solves the aliasing problem, though it requires changes to the VMM as well, not just KVM, which currently relies on vfio-pci mmap(). -- Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-10-19 11:07 UTC|newest] Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 18:14 ` ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 2024-03-19 13:38 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-07 18:14 ` ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas [this message] 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 2023-11-13 17:41 ` Catalin Marinas 2023-10-12 12:27 ` Will Deacon 2023-10-12 12:27 ` Will Deacon
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=ZTEN_oe97VRWbnHb@arm.com \ --to=catalin.marinas@arm.com \ --cc=acurrid@nvidia.com \ --cc=aniketa@nvidia.com \ --cc=ankita@nvidia.com \ --cc=apopple@nvidia.com \ --cc=cjia@nvidia.com \ --cc=danw@nvidia.com \ --cc=jgg@nvidia.com \ --cc=jhubbard@nvidia.com \ --cc=kvmarm@lists.linux.dev \ --cc=kwankhede@nvidia.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=lpieralisi@kernel.org \ --cc=maz@kernel.org \ --cc=oliver.upton@linux.dev \ --cc=targupta@nvidia.com \ --cc=vsethi@nvidia.com \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.