[Bug 1869006] Re: PCIe cards passthrough to TCG guest works on 2GB of guest memory but fails on 4GB (vfio_dma_map invalid arg)

From: Alex Williamson <1869006@bugs.launchpad.net>
To: qemu-devel@nongnu.org
Subject: [Bug 1869006] Re: PCIe cards passthrough to TCG guest works on 2GB of guest memory but fails on 4GB (vfio_dma_map invalid arg)
Date: Tue, 07 Jul 2020 17:27:06 -0000	[thread overview]
Message-ID: <159414282604.31801.690802489009465603.malone@gac.canonical.com> (raw)
In-Reply-To: 158514404728.11288.8869885318197124821.malonedeb@soybean.canonical.com

> When you say "qemu has no support", do you actually mean "qemu people
> are unable to help you if you break things by bypassing the in-place
> restrictions", or "qemu is designed to not work when restrictions are
> bypassed"?

The former.  There are two aspects to this.  The first is that the
device has address space restrictions which therefore impose address
space restrictions on the VM.  That makes things like hotplug difficult
or impossible to support.  That much is something that could be
considered a feature which QEMU has not yet implemented.

The more significant aspect when RMRRs are involved in this restriction
is that an RMRR is essentially the platform firmware dictating that the
host OS must maintain an identity map between the device and a range of
physical address space.  We don't know the purpose of that mapping, but
we can assume that it allows the device to provide ongoing data for
platform firmware to consume.  This data might included health or sensor
information that's vital to the operation of the system.  It's therefore
not simply a matter that QEMU needs to avoid RMRR ranges, we need to
maintain the required identity maps while also manipulating the VM
address space, but the former requirement implies that a user owns a
device that has DMA access to a range of host memory that's been
previously defined as vital to the operation of the platform and
therefore likely exploitable by the user.

The configuration you've achieved appears to have disabled the host
kernel restrictions preventing RMRR encumbered devices from
participating in the IOMMU API, but left in place the VM address space
implications of those RMRRs.  This means that once the device is opened
by the user, that firmware mandated identity mapping is removed and
whatever health or sensor data was being reported by the device to that
range is no longer available to the host firmware, which might adversely
affect the behavior of the system.  Upstream put this restriction in
place as the safe thing to do to honor the firmware mapping requirement
and you've circumvented it, therefore you are your own support.

> Do I understand correctly that the BIOS can modify portions of the
> system usable RAM, so the vendor specific software tools can read
> those addresses, and if yes, does this mean is there a risk for
> data corruption if the RMRR restrictions are bypassed?

RMRRs used for devices other than IGD or USB are often associated with
reserved memory regions to prevent the host OS from making use of those
ranges.  It is possible that privileged utilities might interact with
these ranges, but AIUI the main use case is for the device to interact
with the range, which firmware then consumes.  If you were to ignore the
RMRR mapping altogether, there is a risk that the device will continue
to write whatever health or sensor data it's programmed to report to
that IOVA mapping, which could be a guest mapping and cause data
corruption.

> Is there other place in the kernel 5.4 source that must be modified
> to bring back the v5.3 kernel behaviour? (ie. I have a stable home
> windows vm with the gpu passthrough despite all)

I think the scenarios is that previously the RMRR patch worked because
the vfio IOMMU backend was not imposing the IOMMU reserved region
mapping restrictions, meaning that it was sufficient to simply allow the
device to participate in the IOMMU API and the remaining restrictions
were ignored.  Now the vfio IOMMU backend recognizes the address space
mapping restrictions and disallows creating those mappings that I
describe above as a potential source of data corruption.  Sorry, you are
your own support for this.  The system is not fit for this use case due
to the BIOS imposed restrictions.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1869006

Title:
  PCIe cards passthrough to TCG guest works on 2GB of guest memory but
  fails on 4GB (vfio_dma_map invalid arg)

Status in QEMU:
  New

Bug description:
  During one meeting coworker asked "did someone tried to passthrough
  PCIe card to other arch guest?" and I decided to check it.

  Plugged SATA and USB3 controllers into spare slots on mainboard and
  started playing. On 1GB VM instance it worked (both cold- and hot-
  plugged). On 4GB one it did not:

  Błąd podczas uruchamiania domeny: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
  2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

  Traceback (most recent call last):
    File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
      callback(asyncjob, *args, **kwargs)
    File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
      callback(*args, **kwargs)
    File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 66, in newfn
      ret = fn(self, *args, **kwargs)
    File "/usr/share/virt-manager/virtManager/object/domain.py", line 1279, in startup
      self._backend.create()
    File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1234, in create
      if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
  libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
  2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

  I played with memory and 3054 MB is maximum value possible to boot VM with coldplugged host PCIe cards.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1869006/+subscriptions