From: Jason Gunthorpe <jgg@nvidia.com> To: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org>, Lorenzo Pieralisi <lpieralisi@kernel.org>, ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory Date: Thu, 19 Oct 2023 08:51:42 -0300 [thread overview] Message-ID: <20231019115142.GQ3952@nvidia.com> (raw) In-Reply-To: <ZTEN_oe97VRWbnHb@arm.com> On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > The issue is when the device is reclaimed to be given to another > user-space process, do we know the previous transaction reached the > device? With the TLBI + DSB in unmap, we can only tell that a subsequent > map + write (in a new process) is ordered after the write in the old > process. Maybe that's sufficient in most cases. I can't think of a case where ordering is insufficient. As long as the device perceives a strict seperation in order of writes preceeding the 'reset' writel and writes following the reset writel() the absolute time of completion can not matter. At least for VFIO PCI the reset sequences uses non-posted config TLPs followed by config read TLPs. For something like mlx5 when it reclaims a WC VMA it does a RPC which is dma_wmb(), a writel(), a device DMA read, a device DMA write and a CPU read fo DMA'd data. Both of these are pretty "firm" ordering barriers. > > Unmapping the VMA's must already have some NormalNC friendly ordering > > barrier across all CPUs or we have a bigger problem. This barrier > > definately must close write combining. > > I think what we have is TLBI + DSB generating the DVM/DVMSync messages > should ensure that the Normal NC writes on other CPUs reach some > serialisation point (not necessarily the device, AFAICT we can't > guarantee end-point completion here). This is what our internal architects thought as well. > Talking to Will earlier, I think we can deem the PCIe scenario > (somewhat) safe but not as a generic mechanism for other non-PCIe > devices (e.g. platform). With this concern, can we make this Stage 2 > relaxation in KVM only for vfio-pci mappings? I don't have an example of > non-PCIe device assignment to figure out how this should work though. It is not a KVM problem. As I implied above it is VFIO's responsibility to reliably reset the device, not KVMs. If for some reason VFIO cannot do that on some SOC then VFIO devices should not exist. It is not KVM's job to double guess VFIO's own security properties. Specifically about platform the generic VFIO platform driver is the ACPI based one. If the FW provides an ACPI method for device reset that is not properly serializing, that is a FW bug. We can quirk it in VFIO and block using those devices if they actually exist. I expect the non-generic VFIO platform drivers to take care of this issue internally with, barriers, read from devices, whatver is required to make their SOCs order properly. Just as I would expect a normal Linux platform driver to directly manage whatever implementation specific ordering quirks the SOC may have. Generic PCI drivers are the ones that rely on the implicit ordering at the PCIe TLP level from TLBI + DSB. I would say this is part of the undocumented arch API around pgprot_wc > > > knows all the details. The safest is for the VMM to keep it as Device (I > > > think vfio-pci goes for the strongest nGnRnE). > > > > We are probably going to allow VFIO to let userspace pick if it should > > be pgprot_device or pgprot_writecombine. > > I guess that's for the direct use by an application rather than > VMM+VM. Yes, there have been requests for this. > IIUC people work around this currently by mapping PCIe BARs as > pgprot_writecombine() via sysfs. Getting vfio-pci to allow different > mappings is probably a good idea, though it doesn't currently help with > the KVM case as we can't force the VMM to know the specifics of the > device it is giving to a guest. Yes to all points Thanks, Jason
WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg@nvidia.com> To: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org>, Lorenzo Pieralisi <lpieralisi@kernel.org>, ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev, aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com, targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com, apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory Date: Thu, 19 Oct 2023 08:51:42 -0300 [thread overview] Message-ID: <20231019115142.GQ3952@nvidia.com> (raw) In-Reply-To: <ZTEN_oe97VRWbnHb@arm.com> On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote: > The issue is when the device is reclaimed to be given to another > user-space process, do we know the previous transaction reached the > device? With the TLBI + DSB in unmap, we can only tell that a subsequent > map + write (in a new process) is ordered after the write in the old > process. Maybe that's sufficient in most cases. I can't think of a case where ordering is insufficient. As long as the device perceives a strict seperation in order of writes preceeding the 'reset' writel and writes following the reset writel() the absolute time of completion can not matter. At least for VFIO PCI the reset sequences uses non-posted config TLPs followed by config read TLPs. For something like mlx5 when it reclaims a WC VMA it does a RPC which is dma_wmb(), a writel(), a device DMA read, a device DMA write and a CPU read fo DMA'd data. Both of these are pretty "firm" ordering barriers. > > Unmapping the VMA's must already have some NormalNC friendly ordering > > barrier across all CPUs or we have a bigger problem. This barrier > > definately must close write combining. > > I think what we have is TLBI + DSB generating the DVM/DVMSync messages > should ensure that the Normal NC writes on other CPUs reach some > serialisation point (not necessarily the device, AFAICT we can't > guarantee end-point completion here). This is what our internal architects thought as well. > Talking to Will earlier, I think we can deem the PCIe scenario > (somewhat) safe but not as a generic mechanism for other non-PCIe > devices (e.g. platform). With this concern, can we make this Stage 2 > relaxation in KVM only for vfio-pci mappings? I don't have an example of > non-PCIe device assignment to figure out how this should work though. It is not a KVM problem. As I implied above it is VFIO's responsibility to reliably reset the device, not KVMs. If for some reason VFIO cannot do that on some SOC then VFIO devices should not exist. It is not KVM's job to double guess VFIO's own security properties. Specifically about platform the generic VFIO platform driver is the ACPI based one. If the FW provides an ACPI method for device reset that is not properly serializing, that is a FW bug. We can quirk it in VFIO and block using those devices if they actually exist. I expect the non-generic VFIO platform drivers to take care of this issue internally with, barriers, read from devices, whatver is required to make their SOCs order properly. Just as I would expect a normal Linux platform driver to directly manage whatever implementation specific ordering quirks the SOC may have. Generic PCI drivers are the ones that rely on the implicit ordering at the PCIe TLP level from TLBI + DSB. I would say this is part of the undocumented arch API around pgprot_wc > > > knows all the details. The safest is for the VMM to keep it as Device (I > > > think vfio-pci goes for the strongest nGnRnE). > > > > We are probably going to allow VFIO to let userspace pick if it should > > be pgprot_device or pgprot_writecombine. > > I guess that's for the direct use by an application rather than > VMM+VM. Yes, there have been requests for this. > IIUC people work around this currently by mapping PCIe BARs as > pgprot_writecombine() via sysfs. Getting vfio-pci to allow different > mappings is probably a good idea, though it doesn't currently help with > the KVM case as we can't force the VMM to know the specifics of the > device it is giving to a guest. Yes to all points Thanks, Jason _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-10-19 11:51 UTC|newest] Thread overview: 110+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita 2023-09-07 18:14 ` ankita 2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita 2023-09-07 18:14 ` ankita 2023-09-07 19:12 ` Jason Gunthorpe 2023-09-07 19:12 ` Jason Gunthorpe 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:15 ` Catalin Marinas 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-05 16:54 ` Jason Gunthorpe 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 14:25 ` Catalin Marinas 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 15:05 ` Jason Gunthorpe 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 17:19 ` Catalin Marinas 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-10 18:23 ` Jason Gunthorpe 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 17:45 ` Catalin Marinas 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-11 18:38 ` Jason Gunthorpe 2023-10-12 16:16 ` Catalin Marinas 2023-10-12 16:16 ` Catalin Marinas 2024-03-10 3:49 ` Ankit Agrawal 2024-03-10 3:49 ` Ankit Agrawal 2024-03-19 13:38 ` Jason Gunthorpe 2024-03-19 13:38 ` Jason Gunthorpe 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-10-23 13:20 ` Shameerali Kolothum Thodi 2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita 2023-09-07 18:14 ` ankita 2023-09-08 16:40 ` Catalin Marinas 2023-09-08 16:40 ` Catalin Marinas 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 14:57 ` Lorenzo Pieralisi 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-11 17:20 ` Jason Gunthorpe 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 15:26 ` Lorenzo Pieralisi 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-13 18:54 ` Jason Gunthorpe 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 8:31 ` Lorenzo Pieralisi 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 12:25 ` Jason Gunthorpe 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 13:52 ` Catalin Marinas 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-09-26 16:12 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 9:56 ` Lorenzo Pieralisi 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 11:56 ` Jason Gunthorpe 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-05 14:08 ` Lorenzo Pieralisi 2023-10-12 12:35 ` Will Deacon 2023-10-12 12:35 ` Will Deacon 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 13:20 ` Jason Gunthorpe 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 14:29 ` Lorenzo Pieralisi 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 13:53 ` Catalin Marinas 2023-10-12 14:48 ` Will Deacon 2023-10-12 14:48 ` Will Deacon 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 15:44 ` Jason Gunthorpe 2023-10-12 16:39 ` Will Deacon 2023-10-12 16:39 ` Will Deacon 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-12 18:36 ` Jason Gunthorpe 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-12 17:26 ` Catalin Marinas 2023-10-12 17:26 ` Catalin Marinas 2023-10-13 9:29 ` Will Deacon 2023-10-13 9:29 ` Will Deacon 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:08 ` Catalin Marinas 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-13 13:45 ` Jason Gunthorpe 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:07 ` Catalin Marinas 2023-10-19 11:51 ` Jason Gunthorpe [this message] 2023-10-19 11:51 ` Jason Gunthorpe 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:21 ` Catalin Marinas 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 11:47 ` Jason Gunthorpe 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:03 ` Lorenzo Pieralisi 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-20 14:28 ` Jason Gunthorpe 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-19 13:35 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-13 15:28 ` Lorenzo Pieralisi 2023-10-19 11:12 ` Catalin Marinas 2023-10-19 11:12 ` Catalin Marinas 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-09 15:34 ` Lorenzo Pieralisi 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-10 14:26 ` Jason Gunthorpe 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 0:42 ` Lorenzo Pieralisi 2023-11-13 17:41 ` Catalin Marinas 2023-11-13 17:41 ` Catalin Marinas 2023-10-12 12:27 ` Will Deacon 2023-10-12 12:27 ` Will Deacon
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20231019115142.GQ3952@nvidia.com \ --to=jgg@nvidia.com \ --cc=acurrid@nvidia.com \ --cc=aniketa@nvidia.com \ --cc=ankita@nvidia.com \ --cc=apopple@nvidia.com \ --cc=catalin.marinas@arm.com \ --cc=cjia@nvidia.com \ --cc=danw@nvidia.com \ --cc=jhubbard@nvidia.com \ --cc=kvmarm@lists.linux.dev \ --cc=kwankhede@nvidia.com \ --cc=linux-arm-kernel@lists.infradead.org \ --cc=linux-kernel@vger.kernel.org \ --cc=lpieralisi@kernel.org \ --cc=maz@kernel.org \ --cc=oliver.upton@linux.dev \ --cc=targupta@nvidia.com \ --cc=vsethi@nvidia.com \ --cc=will@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.