Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory

From: Jason Gunthorpe <jgg@nvidia.com>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>,
	Lorenzo Pieralisi <lpieralisi@kernel.org>,
	ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
Date: Thu, 19 Oct 2023 08:51:42 -0300	[thread overview]
Message-ID: <20231019115142.GQ3952@nvidia.com> (raw)
In-Reply-To: <ZTEN_oe97VRWbnHb@arm.com>

On Thu, Oct 19, 2023 at 12:07:42PM +0100, Catalin Marinas wrote:

> The issue is when the device is reclaimed to be given to another
> user-space process, do we know the previous transaction reached the
> device? With the TLBI + DSB in unmap, we can only tell that a subsequent
> map + write (in a new process) is ordered after the write in the old
> process. Maybe that's sufficient in most cases.

I can't think of a case where ordering is insufficient. As long as the
device perceives a strict seperation in order of writes preceeding the
'reset' writel and writes following the reset writel() the absolute
time of completion can not matter.

At least for VFIO PCI the reset sequences uses non-posted config TLPs
followed by config read TLPs.

For something like mlx5 when it reclaims a WC VMA it does a RPC which
is dma_wmb(), a writel(), a device DMA read, a device DMA write and a
CPU read fo DMA'd data.

Both of these are pretty "firm" ordering barriers.

> > Unmapping the VMA's must already have some NormalNC friendly ordering
> > barrier across all CPUs or we have a bigger problem. This barrier
> > definately must close write combining.
> 
> I think what we have is TLBI + DSB generating the DVM/DVMSync messages
> should ensure that the Normal NC writes on other CPUs reach some
> serialisation point (not necessarily the device, AFAICT we can't
> guarantee end-point completion here).

This is what our internal architects thought as well.

> Talking to Will earlier, I think we can deem the PCIe scenario
> (somewhat) safe but not as a generic mechanism for other non-PCIe
> devices (e.g. platform). With this concern, can we make this Stage 2
> relaxation in KVM only for vfio-pci mappings? I don't have an example of
> non-PCIe device assignment to figure out how this should work though.

It is not a KVM problem. As I implied above it is VFIO's
responsibility to reliably reset the device, not KVMs. If for some
reason VFIO cannot do that on some SOC then VFIO devices should not
exist.

It is not KVM's job to double guess VFIO's own security properties.

Specifically about platform the generic VFIO platform driver is the
ACPI based one. If the FW provides an ACPI method for device reset
that is not properly serializing, that is a FW bug. We can quirk it in
VFIO and block using those devices if they actually exist.

I expect the non-generic VFIO platform drivers to take care of this
issue internally with, barriers, read from devices, whatver is
required to make their SOCs order properly. Just as I would expect a
normal Linux platform driver to directly manage whatever
implementation specific ordering quirks the SOC may have.

Generic PCI drivers are the ones that rely on the implicit ordering at
the PCIe TLP level from TLBI + DSB. I would say this is part of the
undocumented arch API around pgprot_wc

> > > knows all the details. The safest is for the VMM to keep it as Device (I
> > > think vfio-pci goes for the strongest nGnRnE).
> > 
> > We are probably going to allow VFIO to let userspace pick if it should
> > be pgprot_device or pgprot_writecombine.
> 
> I guess that's for the direct use by an application rather than
> VMM+VM.

Yes, there have been requests for this.

> IIUC people work around this currently by mapping PCIe BARs as
> pgprot_writecombine() via sysfs. Getting vfio-pci to allow different
> mappings is probably a good idea, though it doesn't currently help with
> the KVM case as we can't force the VMM to know the specifics of the
> device it is giving to a guest.

Yes to all points

Thanks,
Jason