All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joao Martins <joao.m.martins@oracle.com>
To: qemu-devel@nongnu.org
Cc: Eduardo Habkost <ehabkost@redhat.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Edmondson <david.edmondson@oracle.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Igor Mammedov <imammedo@redhat.com>,
	Joao Martins <joao.m.martins@oracle.com>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Subject: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
Date: Tue, 22 Jun 2021 16:48:59 +0100	[thread overview]
Message-ID: <20210622154905.30858-1-joao.m.martins@oracle.com> (raw)

Hey,

This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This series tries to address that by dealing with this AMD-specific 1Tb hole,
similarly to how we deal with the 4G hole today in x86 in general. It is splitted
as following:

* patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
           which gets used too in other parts of pc/acpi besides MR creation. The
	   allowed IOVA *only* changes if it's an AMD host, so no change for
	   Intel. We walk the allowed ranges for memory above 4G, and
	   add a E820_RESERVED type everytime we find a hole (which is at the
	   1TB boundary).
	   
	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
	   understand that it doesn't cover the non-x86 host case running TCG.

	   Additionally, an alternative to hardcoded ranges as we do today,
	   VFIO could advertise the platform valid IOVA ranges without necessarily
	   requiring to have a PCI device added in the vfio container. That would
	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
	   ranges as we do today. But sadly, wouldn't work for older hypervisors.

* patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.

* patch 6: Add a machine property which is disabled for older machine types (<=6.0)
	   to keep things as is.

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40.
I can't figure a reasonable way to establish the required phys-bits we
need for the memory map in a dynamic way, especially considering that
today there's already a precedent to depend on the user to pick the right value
of phys-bits (regardless of this series).

Additionally, the reserved region is always added regardless of whether we have
VFIO devices to cover the VFIO device hotplug case.

Other options considered:

a) Consider the reserved range part of RAM, and just marking it as
E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
only allocate 1010G of RAM and the remaining would be marked reserved.
This is not how what we do today for the 4G hole i.e. the RAM
actually allocated is the value specified by the user and thus RAM available
to the guest (IIUC).

b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
alone wouldn't fix the way MRs are laid out which is where fundamentally the
problem is.

The proposed approach in this series works regardless of the kernel, and
relevant for old and new Qemu.

Open to alternatives/comments/suggestions that I should pursue instead.

	Joao

[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

Joao Martins (6):
  i386/pc: Account IOVA reserved ranges above 4G boundary
  i386/pc: Round up the hotpluggable memory within valid IOVA ranges
  pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
  i386/pc: Keep PCI 64-bit hole within usable IOVA space
  i386/acpi: Fix SRAT ranges in accordance to usable IOVA
  i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs

 hw/i386/acpi-build.c  |  22 ++++-
 hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
 hw/i386/pc_piix.c     |   2 +
 hw/i386/pc_q35.c      |   2 +
 hw/pci-host/i440fx.c  |   4 +-
 hw/pci-host/q35.c     |   4 +-
 include/hw/i386/pc.h  |  62 ++++++++++++-
 include/hw/i386/x86.h |   4 +
 target/i386/cpu.h     |   3 +
 9 files changed, 288 insertions(+), 21 deletions(-)

-- 
2.17.1



             reply	other threads:[~2021-06-22 15:59 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 15:48 Joao Martins [this message]
2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
2021-06-23  7:11   ` Igor Mammedov
2021-06-23  9:37     ` Joao Martins
2021-06-23 11:39       ` Igor Mammedov
2021-06-23 13:04         ` Joao Martins
2021-06-28 14:32           ` Igor Mammedov
2021-08-06 10:41             ` Joao Martins
2021-06-23  9:03   ` Igor Mammedov
2021-06-23  9:51     ` Joao Martins
2021-06-23 12:09       ` Igor Mammedov
2021-06-23 13:07         ` Joao Martins
2021-06-28 13:25           ` Igor Mammedov
2021-06-28 13:43             ` Joao Martins
2021-06-28 15:21               ` Igor Mammedov
2021-06-24  9:32     ` Dr. David Alan Gilbert
2021-06-28 14:42       ` Igor Mammedov
2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
2021-06-23 12:30   ` Igor Mammedov
2021-06-23 13:22     ` Joao Martins
2021-06-28 15:37       ` Igor Mammedov
2021-06-23 16:33     ` Laszlo Ersek
2021-06-25 17:19       ` Joao Martins
2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
2021-06-23  9:18   ` Igor Mammedov
2021-06-23  9:59     ` Joao Martins
2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
2021-06-23  7:40   ` David Edmondson
2021-06-23 19:13     ` Alex Williamson
2021-06-23  9:30   ` Joao Martins
2021-06-23 11:58     ` Igor Mammedov
2021-06-23 13:15       ` Joao Martins
2021-06-23 19:27     ` Alex Williamson
2021-06-24  9:22       ` Dr. David Alan Gilbert
2021-06-25 16:54       ` Joao Martins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210622154905.30858-1-joao.m.martins@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david.edmondson@oracle.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=suravee.suthikulpanit@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.