All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
@ 2021-06-22 15:48 Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

Hey,

This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This series tries to address that by dealing with this AMD-specific 1Tb hole,
similarly to how we deal with the 4G hole today in x86 in general. It is splitted
as following:

* patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
           which gets used too in other parts of pc/acpi besides MR creation. The
	   allowed IOVA *only* changes if it's an AMD host, so no change for
	   Intel. We walk the allowed ranges for memory above 4G, and
	   add a E820_RESERVED type everytime we find a hole (which is at the
	   1TB boundary).
	   
	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
	   understand that it doesn't cover the non-x86 host case running TCG.

	   Additionally, an alternative to hardcoded ranges as we do today,
	   VFIO could advertise the platform valid IOVA ranges without necessarily
	   requiring to have a PCI device added in the vfio container. That would
	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
	   ranges as we do today. But sadly, wouldn't work for older hypervisors.

* patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.

* patch 6: Add a machine property which is disabled for older machine types (<=6.0)
	   to keep things as is.

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40.
I can't figure a reasonable way to establish the required phys-bits we
need for the memory map in a dynamic way, especially considering that
today there's already a precedent to depend on the user to pick the right value
of phys-bits (regardless of this series).

Additionally, the reserved region is always added regardless of whether we have
VFIO devices to cover the VFIO device hotplug case.

Other options considered:

a) Consider the reserved range part of RAM, and just marking it as
E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
only allocate 1010G of RAM and the remaining would be marked reserved.
This is not how what we do today for the 4G hole i.e. the RAM
actually allocated is the value specified by the user and thus RAM available
to the guest (IIUC).

b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
alone wouldn't fix the way MRs are laid out which is where fundamentally the
problem is.

The proposed approach in this series works regardless of the kernel, and
relevant for old and new Qemu.

Open to alternatives/comments/suggestions that I should pursue instead.

	Joao

[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

Joao Martins (6):
  i386/pc: Account IOVA reserved ranges above 4G boundary
  i386/pc: Round up the hotpluggable memory within valid IOVA ranges
  pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
  i386/pc: Keep PCI 64-bit hole within usable IOVA space
  i386/acpi: Fix SRAT ranges in accordance to usable IOVA
  i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs

 hw/i386/acpi-build.c  |  22 ++++-
 hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
 hw/i386/pc_piix.c     |   2 +
 hw/i386/pc_q35.c      |   2 +
 hw/pci-host/i440fx.c  |   4 +-
 hw/pci-host/q35.c     |   4 +-
 include/hw/i386/pc.h  |  62 ++++++++++++-
 include/hw/i386/x86.h |   4 +
 target/i386/cpu.h     |   3 +
 9 files changed, 288 insertions(+), 21 deletions(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
@ 2021-06-22 15:49 ` Joao Martins
  2021-06-23  7:11   ` Igor Mammedov
  2021-06-23  9:03   ` Igor Mammedov
  2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

It is assumed that the whole GPA space is available to be
DMA addressable, within a given address space limit. Since
v5.4 based that is not true, and VFIO will validate whether
the selected IOVA is indeed valid i.e. not reserved by IOMMU
on behalf of some specific devices or platform-defined.

AMD systems with an IOMMU are examples of such platforms and
particularly may export only these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)

We already know of accounting for the 4G hole, albeit if the
guest is big enough we will fail to allocate a >1010G given
the ~12G hole at the 1Tb boundary, reserved for HyperTransport.

When creating the region above 4G, take into account what
IOVAs are allowed by defining the known allowed ranges
and search for the next free IOVA ranges. When finding a
invalid IOVA we mark them as reserved and proceed to the
next allowed IOVA region.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000100000000-000000fcffffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
0000010000000000-000001037fffffff (prio 0, i/o):
	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff

Co-developed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 103 +++++++++++++++++++++++++++++++++++++++----
 include/hw/i386/pc.h |  57 ++++++++++++++++++++++++
 target/i386/cpu.h    |   3 ++
 3 files changed, 154 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index c6d8d0d84d91..52a5473ba846 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -91,6 +91,7 @@
 #include "qapi/qmp/qerror.h"
 #include "e820_memory_layout.h"
 #include "fw_cfg.h"
+#include "target/i386/cpu.h"
 #include "trace.h"
 #include CONFIG_DEVICES
 
@@ -860,6 +861,93 @@ void xen_load_linux(PCMachineState *pcms)
     x86ms->fw_cfg = fw_cfg;
 }
 
+struct GPARange usable_iova_ranges[] = {
+    { .start = 4 * GiB, .end = UINT64_MAX, .name = "ram-above-4g" },
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. The range is:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define DEFAULT_NR_USABLE_IOVAS 1
+#define AMD_MAX_PHYSADDR_BELOW_1TB  0xfcffffffff
+    { .start = 1 * TiB, .end = UINT64_MAX, .name = "ram-above-1t" },
+};
+
+ uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
+
+static void init_usable_iova_ranges(void)
+{
+    uint32_t eax, vendor[3];
+
+    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
+    if (IS_AMD_VENDOR(vendor)) {
+        usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
+        nb_iova_ranges++;
+    }
+}
+
+static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
+                                hwaddr base, hwaddr size, hwaddr offset)
+{
+    hwaddr start, region_size, resv_start, resv_end;
+    struct GPARange *range;
+    MemoryRegion *region;
+    uint32_t index;
+
+    for_each_usable_range(index, base, size, range, start, region_size) {
+        region = g_malloc(sizeof(*region));
+        memory_region_init_alias(region, NULL, range->name, ram,
+                                 offset, region_size);
+        memory_region_add_subregion(system_memory, start, region);
+        e820_add_entry(start, region_size, E820_RAM);
+
+        assert(size >= region_size);
+        if (size == region_size) {
+            return;
+        }
+
+        /*
+         * There's memory left to create a region for, so there should be
+         * another valid IOVA range left.  Creating the reserved region
+         * would also be pointless.
+         */
+        if (index + 1 == nb_iova_ranges) {
+            return;
+        }
+
+        resv_start = start + region_size;
+        resv_end = usable_iova_ranges[index + 1].start;
+
+        /* Create a reserved region in the IOVA hole. */
+        e820_add_entry(resv_start, resv_end - resv_start, E820_RESERVED);
+
+        offset += region_size;
+    }
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -867,7 +955,7 @@ void pc_memory_init(PCMachineState *pcms,
 {
     int linux_boot, i;
     MemoryRegion *option_rom_mr;
-    MemoryRegion *ram_below_4g, *ram_above_4g;
+    MemoryRegion *ram_below_4g;
     FWCfgState *fw_cfg;
     MachineState *machine = MACHINE(pcms);
     MachineClass *mc = MACHINE_GET_CLASS(machine);
@@ -877,6 +965,8 @@ void pc_memory_init(PCMachineState *pcms,
     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
 
+    init_usable_iova_ranges();
+
     linux_boot = (machine->kernel_filename != NULL);
 
     /*
@@ -888,16 +978,11 @@ void pc_memory_init(PCMachineState *pcms,
     memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
+
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
     if (x86ms->above_4g_mem_size > 0) {
-        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
-        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
-                                 machine->ram,
-                                 x86ms->below_4g_mem_size,
-                                 x86ms->above_4g_mem_size);
-        memory_region_add_subregion(system_memory, 0x100000000ULL,
-                                    ram_above_4g);
-        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
+        add_memory_region(system_memory, machine->ram, 4 * GiB,
+                          x86ms->above_4g_mem_size, x86ms->below_4g_mem_size);
     }
 
     if (!pcmc->has_reserved_memory &&
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 1522a3359a93..73b8e2900c72 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -151,6 +151,63 @@ void pc_guest_info_init(PCMachineState *pcms);
 #define PCI_HOST_BELOW_4G_MEM_SIZE     "below-4g-mem-size"
 #define PCI_HOST_ABOVE_4G_MEM_SIZE     "above-4g-mem-size"
 
+struct GPARange {
+    uint64_t start;
+    uint64_t end;
+#define range_len(r) ((r).end - (r).start + 1)
+
+    const char *name;
+};
+
+extern uint32_t nb_iova_ranges;
+extern struct GPARange usable_iova_ranges[];
+
+static inline void next_iova_range_index(uint32_t i, hwaddr base, hwaddr size,
+                                         hwaddr *start, hwaddr *region_size,
+                                         uint32_t *index, struct GPARange **r)
+{
+    struct GPARange *range;
+
+    while (i < nb_iova_ranges) {
+        range = &usable_iova_ranges[i];
+        if (range->end >= base) {
+            break;
+        }
+        i++;
+    }
+
+    *index = i;
+    /* index is out of bounds or no region left to process */
+    if (i >= nb_iova_ranges || !size) {
+        return;
+    }
+
+    *start = MAX(range->start, base);
+    *region_size = MIN(range->end - *start + 1, size);
+    *r = range;
+}
+
+/*
+ * for_each_usable_range() - iterates over usable IOVA regions
+ *
+ * @i:      range index within usable_iova_ranges
+ * @base:   requested address we want to use
+ * @size:   total size of the region with @base
+ * @r:      range indexed by @i within usable_iova_ranges
+ * @s:      calculated usable iova address base
+ * @i_size: calculated usable iova region size starting @s
+ *
+ * Use this iterator to walk through usable GPA ranges. Platforms
+ * such as AMD with IOMMU capable hardware restrict certain address
+ * ranges for Hyper Transport. This iterator helper lets user avoid
+ * using those said reserved ranges.
+ */
+#define for_each_usable_range(i, base, size, r, s, i_size) \
+    for (s = 0, i_size = 0, r = NULL, \
+         next_iova_range_index(0, base, size, &s, &i_size, &i, &r); \
+         i < nb_iova_ranges && size != 0; \
+         size -= i_size, \
+         next_iova_range_index(i + 1, base, size, &s, &i_size, &i, &r))
 
 void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
                             MemoryRegion *pci_address_space);
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 1e11071d817b..f8f15a4523c6 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -869,6 +869,9 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
 #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
                          (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
                          (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
+#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
+                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
+                         (vendor[2]) == CPUID_VENDOR_AMD_3)
 
 #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
@ 2021-06-22 15:49 ` Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

When accounting for allowed IOVA above 4G hole we also need to
consider the hotplug memory sits within allowed ranges.

Failure to do such validation, means that when we hotplug memory
and DMA map it, the DMA_MAP ioctl() fails given invalid IOVA use
but also leading to a catastrophic failure and exiting Qemu.

Similar to the region above 4G we need to make also do create a
region for the [ram .. maxram] GPA range, and select one which
is within the allowed IOVA ranges, preventing any such failures
in the future.

Co-developed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 52a5473ba846..94497f22b908 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -909,7 +909,35 @@ static void init_usable_iova_ranges(void)
     }
 }
 
-static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
+static hwaddr allowed_round_up(hwaddr base, hwaddr size)
+{
+    hwaddr base_aligned = ROUND_UP(base, 1 * GiB), addr;
+    uint32_t index;
+
+    for (index = 0; index < nb_iova_ranges; index++) {
+        hwaddr min_iova, max_iova;
+
+        min_iova = usable_iova_ranges[index].start;
+        max_iova = usable_iova_ranges[index].end;
+
+        if (max_iova < base_aligned) {
+            continue;
+        }
+
+        addr = MAX(ROUND_UP(min_iova, 1 * GiB), base_aligned);
+        if (addr > max_iova) {
+            continue;
+        }
+
+        if (max_iova - addr >= size) {
+            return addr;
+        }
+    }
+
+    return 0;
+}
+
+static hwaddr add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
                                 hwaddr base, hwaddr size, hwaddr offset)
 {
     hwaddr start, region_size, resv_start, resv_end;
@@ -926,7 +954,7 @@ static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
 
         assert(size >= region_size);
         if (size == region_size) {
-            return;
+            return start + region_size;
         }
 
         /*
@@ -935,7 +963,7 @@ static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
          * would also be pointless.
          */
         if (index + 1 == nb_iova_ranges) {
-            return;
+            break;
         }
 
         resv_start = start + region_size;
@@ -946,6 +974,8 @@ static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
 
         offset += region_size;
     }
+
+    return 0;
 }
 
 void pc_memory_init(PCMachineState *pcms,
@@ -961,6 +991,7 @@ void pc_memory_init(PCMachineState *pcms,
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxram_start = 4 * GiB + x86ms->above_4g_mem_size;
 
     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -981,8 +1012,13 @@ void pc_memory_init(PCMachineState *pcms,
 
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
     if (x86ms->above_4g_mem_size > 0) {
-        add_memory_region(system_memory, machine->ram, 4 * GiB,
+        maxram_start = add_memory_region(system_memory, machine->ram, 4 * GiB,
                           x86ms->above_4g_mem_size, x86ms->below_4g_mem_size);
+        if (!maxram_start) {
+            error_report("unsupported amount of memory: %"PRIu64,
+                         x86ms->above_4g_mem_size);
+            exit(EXIT_FAILURE);
+        }
     }
 
     if (!pcmc->has_reserved_memory &&
@@ -1001,6 +1037,7 @@ void pc_memory_init(PCMachineState *pcms,
     if (pcmc->has_reserved_memory &&
         (machine->ram_size < machine->maxram_size)) {
         ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+        hwaddr device_mem_base;
 
         if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
             error_report("unsupported amount of memory slots: %"PRIu64,
@@ -1015,8 +1052,14 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }
 
-        machine->device_memory->base =
-            ROUND_UP(0x100000000ULL + x86ms->above_4g_mem_size, 1 * GiB);
+        device_mem_base = allowed_round_up(maxram_start, device_mem_size);
+        if (!device_mem_base) {
+            error_report("unable to find device memory base for %"PRIu64
+                         " - %"PRIu64, maxram_start, device_mem_size);
+            exit(EXIT_FAILURE);
+        }
+
+        machine->device_memory->base = device_mem_base;
 
         if (pcmc->enforce_aligned_dimm) {
             /* size device region assuming 1G page max alignment per slot */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
@ 2021-06-22 15:49 ` Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

CMOS doesn't have the notion of reserved spaces, much like E820, so
limit the amount of memory above 4G to not acount for the memory
above 1Tb.

Suggested-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c          | 14 ++++++++++++--
 include/hw/i386/x86.h |  4 ++++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 94497f22b908..2e2ea82a4661 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -644,8 +644,12 @@ void pc_cmos_init(PCMachineState *pcms,
         val = 65535;
     rtc_set_memory(s, 0x34, val);
     rtc_set_memory(s, 0x35, val >> 8);
-    /* memory above 4GiB */
-    val = x86ms->above_4g_mem_size / 65536;
+    /* memory above 4GiB but below 1Tib (where applicable) */
+    if (!x86ms->above_1t_mem_size) {
+        val = x86ms->above_4g_mem_size / 65536;
+    } else {
+        val = (x86ms->above_4g_mem_size - x86ms->above_1t_mem_size) / 65536;
+    }
     rtc_set_memory(s, 0x5b, val);
     rtc_set_memory(s, 0x5c, val >> 8);
     rtc_set_memory(s, 0x5d, val >> 16);
@@ -1019,6 +1023,12 @@ void pc_memory_init(PCMachineState *pcms,
                          x86ms->above_4g_mem_size);
             exit(EXIT_FAILURE);
         }
+
+        if (nb_iova_ranges != DEFAULT_NR_USABLE_IOVAS) {
+            x86ms->above_1t_maxram_start = maxram_start;
+            if (maxram_start > AMD_MAX_PHYSADDR_BELOW_1TB)
+                x86ms->above_1t_mem_size = maxram_start - 1 * TiB;
+        }
     }
 
     if (!pcmc->has_reserved_memory &&
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index 25a1f16f0121..cc22e30bd08c 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -57,6 +57,10 @@ struct X86MachineState {
     /* RAM information (sizes, addresses, configuration): */
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 
+    /* RAM information when there's a hole in 1Tb */
+    ram_addr_t above_1t_mem_size;
+    uint64_t above_1t_maxram_start;
+
     /* CPU and apic information: */
     bool apic_xrupt_override;
     unsigned pci_irq_mask;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
@ 2021-06-22 15:49 ` Joao Martins
  2021-06-23 12:30   ` Igor Mammedov
  2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

pci_memory initialized by q35 and i440fx is set to a range
of 0 .. UINT64_MAX, and as a consequence when ACPI and pci-host
pick the hole64_start it does not account for allowed IOVA ranges.

Rather than blindly returning, round up the hole64_start
value to the allowable IOVA range, such that it accounts for
the 1Tb hole *on AMD*. On Intel it returns the input value
for hole64 start.

Suggested-by: David Edmondson <david.edmondson@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 17 +++++++++++++++--
 hw/pci-host/i440fx.c |  4 +++-
 hw/pci-host/q35.c    |  4 +++-
 include/hw/i386/pc.h |  3 ++-
 4 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 2e2ea82a4661..65885cc16037 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1141,7 +1141,7 @@ void pc_memory_init(PCMachineState *pcms,
  * The 64bit pci hole starts after "above 4G RAM" and
  * potentially the space reserved for memory hotplug.
  */
-uint64_t pc_pci_hole64_start(void)
+uint64_t pc_pci_hole64_start(uint64_t size)
 {
     PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
@@ -1155,12 +1155,25 @@ uint64_t pc_pci_hole64_start(void)
             hole64_start += memory_region_size(&ms->device_memory->mr);
         }
     } else {
-        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
+        if (!x86ms->above_1t_mem_size) {
+            hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
+        } else {
+            hole64_start = x86ms->above_1t_maxram_start;
+        }
     }
+    hole64_start = allowed_round_up(hole64_start, size);
 
     return ROUND_UP(hole64_start, 1 * GiB);
 }
 
+uint64_t pc_pci_hole64_start_aligned(uint64_t start, uint64_t size)
+{
+    if (nb_iova_ranges == DEFAULT_NR_USABLE_IOVAS) {
+        return start;
+    }
+    return allowed_round_up(start, size);
+}
+
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus)
 {
     DeviceState *dev = NULL;
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index 28c9bae89944..e8eaebfe1034 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -163,8 +163,10 @@ static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
     pci_bus_get_w64_range(h->bus, &w64);
     value = range_is_empty(&w64) ? 0 : range_lob(&w64);
     if (!value && s->pci_hole64_fix) {
-        value = pc_pci_hole64_start();
+        value = pc_pci_hole64_start(s->pci_hole64_size);
     }
+    /* This returns @value when not on AMD */
+    value = pc_pci_hole64_start_aligned(value, s->pci_hole64_size);
     return value;
 }
 
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 2eb729dff585..d556eb965ddb 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -126,8 +126,10 @@ static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
     pci_bus_get_w64_range(h->bus, &w64);
     value = range_is_empty(&w64) ? 0 : range_lob(&w64);
     if (!value && s->pci_hole64_fix) {
-        value = pc_pci_hole64_start();
+        value = pc_pci_hole64_start(s->mch.pci_hole64_size);
     }
+    /* This returns @value when not on AMD */
+    value = pc_pci_hole64_start_aligned(value, s->mch.pci_hole64_size);
     return value;
 }
 
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 73b8e2900c72..b924aef3a218 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -217,7 +217,8 @@ void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
                     MemoryRegion **ram_memory);
-uint64_t pc_pci_hole64_start(void);
+uint64_t pc_pci_hole64_start(uint64_t size);
+uint64_t pc_pci_hole64_start_aligned(uint64_t value, uint64_t size);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(struct PCMachineState *pcms,
                           ISABus *isa_bus, qemu_irq *gsi,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
                   ` (3 preceding siblings ...)
  2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
@ 2021-06-22 15:49 ` Joao Martins
  2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
  2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
  6 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

On configurations that lead to the creation of an SRAT with PXM entries
(-numa ...) because E820 and SRAT do not match, Linux tends to ignore
the ranges from SRAT, thus breaking NUMA topology in the guest.

When we start adding the ranges after 4G hole, use the newly added
iterator in add_srat_region() to create the SRAT PXM entries for the
usable GPA regions.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/acpi-build.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 796ffc6f5c40..bb0918025296 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -57,6 +57,7 @@
 #include "hw/acpi/pcihp.h"
 #include "hw/i386/fw_cfg.h"
 #include "hw/i386/ich9.h"
+#include "hw/i386/pc.h"
 #include "hw/pci/pci_bus.h"
 #include "hw/pci-host/q35.h"
 #include "hw/i386/x86-iommu.h"
@@ -1872,6 +1873,23 @@ build_tpm_tcpa(GArray *table_data, BIOSLinker *linker, GArray *tcpalog,
 #define HOLE_640K_START  (640 * KiB)
 #define HOLE_640K_END   (1 * MiB)
 
+static hwaddr add_srat_memory(hwaddr base, hwaddr size, GArray *table_data,
+                              int pxm)
+{
+    AcpiSratMemoryAffinity *numamem;
+    hwaddr start, region_size;
+    struct GPARange *range;
+    uint32_t index;
+
+    for_each_usable_range(index, base, size, range, start, region_size) {
+        numamem = acpi_data_push(table_data, sizeof *numamem);
+        build_srat_memory(numamem, start, region_size, pxm,
+                          MEM_AFFINITY_ENABLED);
+    }
+
+    return start + region_size;
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -1967,9 +1985,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
         }
 
         if (mem_len > 0) {
-            numamem = acpi_data_push(table_data, sizeof *numamem);
-            build_srat_memory(numamem, mem_base, mem_len, i - 1,
-                              MEM_AFFINITY_ENABLED);
+            next_base = add_srat_memory(mem_base, mem_len, table_data, i - 1);
         }
     }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
                   ` (4 preceding siblings ...)
  2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
@ 2021-06-22 15:49 ` Joao Martins
  2021-06-23  9:18   ` Igor Mammedov
  2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
  6 siblings, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-22 15:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Paolo Bonzini, Igor Mammedov,
	Joao Martins, Suravee Suthikulpanit

The added enforcing is only relevant in the case of AMD where the range
right before the 1TB is restricted and cannot be DMA mapped by the
kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST or possibly
other kinds of IOMMU events in the AMD IOMMU.

Although, there's a case where it may make sense to disable the IOVA
relocation/validation when migrating from a non-valid-IOVA-aware qemu to
one that supports it.

Relocating RAM regions to after the 1Tb hole has consequences for guest
ABI because we are changing the memory mapping, and thus it may make
sense to allow admin to disable the validation (e.g. upon migration) to
either 1) Fail early when the VFIO DMA_MAP ioctl fails thus preventing
the migration to happen 'gracefully' or 2) allow booting a guest
unchanged from source host without risking changing the PCI mmio hole64
or other things we consider in the valid IOVA range changing underneath
the guest.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 29 +++++++++++++++++++++++++++--
 hw/i386/pc_piix.c    |  2 ++
 hw/i386/pc_q35.c     |  2 ++
 include/hw/i386/pc.h |  2 ++
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 65885cc16037..eb08a6d1a2b9 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -902,10 +902,14 @@ struct GPARange usable_iova_ranges[] = {
 
  uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
 
-static void init_usable_iova_ranges(void)
+static void init_usable_iova_ranges(PCMachineClass *pcmc)
 {
     uint32_t eax, vendor[3];
 
+    if (!pcmc->enforce_valid_iova) {
+        return;
+    }
+
     host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
     if (IS_AMD_VENDOR(vendor)) {
         usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
@@ -1000,7 +1004,7 @@ void pc_memory_init(PCMachineState *pcms,
     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
 
-    init_usable_iova_ranges();
+    init_usable_iova_ranges(pcmc);
 
     linux_boot = (machine->kernel_filename != NULL);
 
@@ -1685,6 +1689,23 @@ static void pc_machine_set_hpet(Object *obj, bool value, Error **errp)
     pcms->hpet_enabled = value;
 }
 
+static bool pc_machine_get_enforce_valid_iova(Object *obj, Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+
+    return pcmc->enforce_valid_iova;
+}
+
+static void pc_machine_set_enforce_valid_iova(Object *obj, bool value,
+                                                  Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+
+    pcmc->enforce_valid_iova = value;
+}
+
 static void pc_machine_get_max_ram_below_4g(Object *obj, Visitor *v,
                                             const char *name, void *opaque,
                                             Error **errp)
@@ -1851,6 +1872,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     pcmc->has_reserved_memory = true;
     pcmc->kvmclock_enabled = true;
     pcmc->enforce_aligned_dimm = true;
+    pcmc->enforce_valid_iova = true;
     /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
      * to be used at the moment, 32K should be enough for a while.  */
     pcmc->acpi_data_size = 0x20000 + 0x8000;
@@ -1913,6 +1935,9 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
         NULL, NULL);
     object_class_property_set_description(oc, PC_MACHINE_MAX_FW_SIZE,
         "Maximum combined firmware size");
+
+    object_class_property_add_bool(oc, PC_MACHINE_ENFORCE_VALID_IOVA,
+        pc_machine_get_enforce_valid_iova, pc_machine_set_enforce_valid_iova);
 }
 
 static const TypeInfo pc_machine_info = {
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 30b8bd6ea92d..21a08e2f6a4c 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -427,11 +427,13 @@ DEFINE_I440FX_MACHINE(v6_1, "pc-i440fx-6.1", NULL,
 
 static void pc_i440fx_6_0_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_i440fx_6_1_machine_options(m);
     m->alias = NULL;
     m->is_default = false;
     compat_props_add(m->compat_props, hw_compat_6_0, hw_compat_6_0_len);
     compat_props_add(m->compat_props, pc_compat_6_0, pc_compat_6_0_len);
+    pcmc->enforce_valid_iova = false;
 }
 
 DEFINE_I440FX_MACHINE(v6_0, "pc-i440fx-6.0", NULL,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 46a0f196f413..80bb89a9bae1 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -357,10 +357,12 @@ DEFINE_Q35_MACHINE(v6_1, "pc-q35-6.1", NULL,
 
 static void pc_q35_6_0_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_q35_6_1_machine_options(m);
     m->alias = NULL;
     compat_props_add(m->compat_props, hw_compat_6_0, hw_compat_6_0_len);
     compat_props_add(m->compat_props, pc_compat_6_0, pc_compat_6_0_len);
+    pcmc->enforce_valid_iova = false;
 }
 
 DEFINE_Q35_MACHINE(v6_0, "pc-q35-6.0", NULL,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index b924aef3a218..7337f6f2d014 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -63,6 +63,7 @@ typedef struct PCMachineState {
 #define PC_MACHINE_SATA             "sata"
 #define PC_MACHINE_PIT              "pit"
 #define PC_MACHINE_MAX_FW_SIZE      "max-fw-size"
+#define PC_MACHINE_ENFORCE_VALID_IOVA  "enforce-valid-iova"
 /**
  * PCMachineClass:
  *
@@ -113,6 +114,7 @@ struct PCMachineClass {
     bool has_reserved_memory;
     bool enforce_aligned_dimm;
     bool broken_reserved_end;
+    bool enforce_valid_iova;
 
     /* generate legacy CPU hotplug AML */
     bool legacy_cpu_hotplug;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
                   ` (5 preceding siblings ...)
  2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
@ 2021-06-22 21:16 ` Alex Williamson
  2021-06-23  7:40   ` David Edmondson
  2021-06-23  9:30   ` Joao Martins
  6 siblings, 2 replies; 38+ messages in thread
From: Alex Williamson @ 2021-06-22 21:16 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Suravee Suthikulpanit, Igor Mammedov, Paolo Bonzini

On Tue, 22 Jun 2021 16:48:59 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Hey,
> 
> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> 	failed to setup container for group 258: memory listener initialization failed:
> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
> 
> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> as following:
> 
> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>            which gets used too in other parts of pc/acpi besides MR creation. The
> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> 	   Intel. We walk the allowed ranges for memory above 4G, and
> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> 	   1TB boundary).
> 	   
> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> 	   understand that it doesn't cover the non-x86 host case running TCG.
> 
> 	   Additionally, an alternative to hardcoded ranges as we do today,
> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> 	   requiring to have a PCI device added in the vfio container. That would
> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.


$ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
0x00000000fee00000 0x00000000feefffff msi
0x000000fd00000000 0x000000ffffffffff reserved

Ideally we might take that into account on all hosts, but of course
then we run into massive compatibility issues when we consider
migration.  We run into similar problems when people try to assign
devices to non-x86 TCG hosts, where the arch doesn't have a natural
memory hole overlapping the msi range.

The issue here is similar to trying to find a set of supported CPU
flags across hosts, QEMU only has visibility to the host where it runs,
an upper level tool needs to be able to pass through information about
compatibility to all possible migration targets.  Towards that end, we
should probably have command line options that either allow to specify
specific usable or reserved GPA address ranges.  For example something
like:
	--reserved-mem-ranges=host

Or explicitly:

	--reserved-mem-ranges=13G@1010G,1M@4078M

> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
> 
> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
> 	   to keep things as is.
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40.
> I can't figure a reasonable way to establish the required phys-bits we
> need for the memory map in a dynamic way, especially considering that
> today there's already a precedent to depend on the user to pick the right value
> of phys-bits (regardless of this series).
> 
> Additionally, the reserved region is always added regardless of whether we have
> VFIO devices to cover the VFIO device hotplug case.

Various migration issues as you note later in the series.

> Other options considered:
> 
> a) Consider the reserved range part of RAM, and just marking it as
> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
> only allocate 1010G of RAM and the remaining would be marked reserved.
> This is not how what we do today for the 4G hole i.e. the RAM
> actually allocated is the value specified by the user and thus RAM available
> to the guest (IIUC).
> 
> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
> alone wouldn't fix the way MRs are laid out which is where fundamentally the
> problem is.

Data corruption with b) should the guest ever use memory within this
range as a DMA target.  Thanks,

Alex
 
> The proposed approach in this series works regardless of the kernel, and
> relevant for old and new Qemu.
> 
> Open to alternatives/comments/suggestions that I should pursue instead.
> 
> 	Joao
> 
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> 
> Joao Martins (6):
>   i386/pc: Account IOVA reserved ranges above 4G boundary
>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
> 
>  hw/i386/acpi-build.c  |  22 ++++-
>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
>  hw/i386/pc_piix.c     |   2 +
>  hw/i386/pc_q35.c      |   2 +
>  hw/pci-host/i440fx.c  |   4 +-
>  hw/pci-host/q35.c     |   4 +-
>  include/hw/i386/pc.h  |  62 ++++++++++++-
>  include/hw/i386/x86.h |   4 +
>  target/i386/cpu.h     |   3 +
>  9 files changed, 288 insertions(+), 21 deletions(-)
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
@ 2021-06-23  7:11   ` Igor Mammedov
  2021-06-23  9:37     ` Joao Martins
  2021-06-23  9:03   ` Igor Mammedov
  1 sibling, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23  7:11 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Tue, 22 Jun 2021 16:49:00 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be
> DMA addressable, within a given address space limit. Since
> v5.4 based that is not true, and VFIO will validate whether
> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> on behalf of some specific devices or platform-defined.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may export only these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> 
> We already know of accounting for the 4G hole, albeit if the
> guest is big enough we will fail to allocate a >1010G given
> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> 
> When creating the region above 4G, take into account what
> IOVAs are allowed by defining the known allowed ranges
> and search for the next free IOVA ranges. When finding a
> invalid IOVA we mark them as reserved and proceed to the
> next allowed IOVA region.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000100000000-000000fcffffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> 0000010000000000-000001037fffffff (prio 0, i/o):
> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff

why not push whole ram-above-4g above 1Tb mark
when RAM is sufficiently large (regardless of used host),
instead of creating yet another hole and all complexity it brings along?


> Co-developed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c         | 103 +++++++++++++++++++++++++++++++++++++++----
>  include/hw/i386/pc.h |  57 ++++++++++++++++++++++++
>  target/i386/cpu.h    |   3 ++
>  3 files changed, 154 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index c6d8d0d84d91..52a5473ba846 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -91,6 +91,7 @@
>  #include "qapi/qmp/qerror.h"
>  #include "e820_memory_layout.h"
>  #include "fw_cfg.h"
> +#include "target/i386/cpu.h"
>  #include "trace.h"
>  #include CONFIG_DEVICES
>  
> @@ -860,6 +861,93 @@ void xen_load_linux(PCMachineState *pcms)
>      x86ms->fw_cfg = fw_cfg;
>  }
>  
> +struct GPARange usable_iova_ranges[] = {
> +    { .start = 4 * GiB, .end = UINT64_MAX, .name = "ram-above-4g" },
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. The range is:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define DEFAULT_NR_USABLE_IOVAS 1
> +#define AMD_MAX_PHYSADDR_BELOW_1TB  0xfcffffffff
> +    { .start = 1 * TiB, .end = UINT64_MAX, .name = "ram-above-1t" },
> +};
> +
> + uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
> +
> +static void init_usable_iova_ranges(void)
> +{
> +    uint32_t eax, vendor[3];
> +
> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +    if (IS_AMD_VENDOR(vendor)) {
> +        usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
> +        nb_iova_ranges++;
> +    }
> +}
> +
> +static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
> +                                hwaddr base, hwaddr size, hwaddr offset)
> +{
> +    hwaddr start, region_size, resv_start, resv_end;
> +    struct GPARange *range;
> +    MemoryRegion *region;
> +    uint32_t index;
> +
> +    for_each_usable_range(index, base, size, range, start, region_size) {
> +        region = g_malloc(sizeof(*region));
> +        memory_region_init_alias(region, NULL, range->name, ram,
> +                                 offset, region_size);
> +        memory_region_add_subregion(system_memory, start, region);
> +        e820_add_entry(start, region_size, E820_RAM);
> +
> +        assert(size >= region_size);
> +        if (size == region_size) {
> +            return;
> +        }
> +
> +        /*
> +         * There's memory left to create a region for, so there should be
> +         * another valid IOVA range left.  Creating the reserved region
> +         * would also be pointless.
> +         */
> +        if (index + 1 == nb_iova_ranges) {
> +            return;
> +        }
> +
> +        resv_start = start + region_size;
> +        resv_end = usable_iova_ranges[index + 1].start;
> +
> +        /* Create a reserved region in the IOVA hole. */
> +        e820_add_entry(resv_start, resv_end - resv_start, E820_RESERVED);
> +
> +        offset += region_size;
> +    }
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -867,7 +955,7 @@ void pc_memory_init(PCMachineState *pcms,
>  {
>      int linux_boot, i;
>      MemoryRegion *option_rom_mr;
> -    MemoryRegion *ram_below_4g, *ram_above_4g;
> +    MemoryRegion *ram_below_4g;
>      FWCfgState *fw_cfg;
>      MachineState *machine = MACHINE(pcms);
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
> @@ -877,6 +965,8 @@ void pc_memory_init(PCMachineState *pcms,
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
>  
> +    init_usable_iova_ranges();
> +
>      linux_boot = (machine->kernel_filename != NULL);
>  
>      /*
> @@ -888,16 +978,11 @@ void pc_memory_init(PCMachineState *pcms,
>      memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> +
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>      if (x86ms->above_4g_mem_size > 0) {
> -        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> -        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> -                                 machine->ram,
> -                                 x86ms->below_4g_mem_size,
> -                                 x86ms->above_4g_mem_size);
> -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> -                                    ram_above_4g);
> -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> +        add_memory_region(system_memory, machine->ram, 4 * GiB,
> +                          x86ms->above_4g_mem_size, x86ms->below_4g_mem_size);
>      }
>  
>      if (!pcmc->has_reserved_memory &&
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 1522a3359a93..73b8e2900c72 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -151,6 +151,63 @@ void pc_guest_info_init(PCMachineState *pcms);
>  #define PCI_HOST_BELOW_4G_MEM_SIZE     "below-4g-mem-size"
>  #define PCI_HOST_ABOVE_4G_MEM_SIZE     "above-4g-mem-size"
>  
> +struct GPARange {
> +    uint64_t start;
> +    uint64_t end;
> +#define range_len(r) ((r).end - (r).start + 1)
> +
> +    const char *name;
> +};
> +
> +extern uint32_t nb_iova_ranges;
> +extern struct GPARange usable_iova_ranges[];
> +
> +static inline void next_iova_range_index(uint32_t i, hwaddr base, hwaddr size,
> +                                         hwaddr *start, hwaddr *region_size,
> +                                         uint32_t *index, struct GPARange **r)
> +{
> +    struct GPARange *range;
> +
> +    while (i < nb_iova_ranges) {
> +        range = &usable_iova_ranges[i];
> +        if (range->end >= base) {
> +            break;
> +        }
> +        i++;
> +    }
> +
> +    *index = i;
> +    /* index is out of bounds or no region left to process */
> +    if (i >= nb_iova_ranges || !size) {
> +        return;
> +    }
> +
> +    *start = MAX(range->start, base);
> +    *region_size = MIN(range->end - *start + 1, size);
> +    *r = range;
> +}
> +
> +/*
> + * for_each_usable_range() - iterates over usable IOVA regions
> + *
> + * @i:      range index within usable_iova_ranges
> + * @base:   requested address we want to use
> + * @size:   total size of the region with @base
> + * @r:      range indexed by @i within usable_iova_ranges
> + * @s:      calculated usable iova address base
> + * @i_size: calculated usable iova region size starting @s
> + *
> + * Use this iterator to walk through usable GPA ranges. Platforms
> + * such as AMD with IOMMU capable hardware restrict certain address
> + * ranges for Hyper Transport. This iterator helper lets user avoid
> + * using those said reserved ranges.
> + */
> +#define for_each_usable_range(i, base, size, r, s, i_size) \
> +    for (s = 0, i_size = 0, r = NULL, \
> +         next_iova_range_index(0, base, size, &s, &i_size, &i, &r); \
> +         i < nb_iova_ranges && size != 0; \
> +         size -= i_size, \
> +         next_iova_range_index(i + 1, base, size, &s, &i_size, &i, &r))
>  
>  void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
>                              MemoryRegion *pci_address_space);
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 1e11071d817b..f8f15a4523c6 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -869,6 +869,9 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
>  
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
@ 2021-06-23  7:40   ` David Edmondson
  2021-06-23 19:13     ` Alex Williamson
  2021-06-23  9:30   ` Joao Martins
  1 sibling, 1 reply; 38+ messages in thread
From: David Edmondson @ 2021-06-23  7:40 UTC (permalink / raw)
  To: Alex Williamson, Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, Auger Eric, Suravee Suthikulpanit,
	Igor Mammedov, Paolo Bonzini

On Tuesday, 2021-06-22 at 15:16:29 -06, Alex Williamson wrote:

>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>> 	   requiring to have a PCI device added in the vfio container. That would
>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.
>
>
> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> 0x00000000fee00000 0x00000000feefffff msi
> 0x000000fd00000000 0x000000ffffffffff reserved
>
> Ideally we might take that into account on all hosts, but of course
> then we run into massive compatibility issues when we consider
> migration.  We run into similar problems when people try to assign
> devices to non-x86 TCG hosts, where the arch doesn't have a natural
> memory hole overlapping the msi range.
>
> The issue here is similar to trying to find a set of supported CPU
> flags across hosts, QEMU only has visibility to the host where it runs,
> an upper level tool needs to be able to pass through information about
> compatibility to all possible migration targets.  Towards that end, we
> should probably have command line options that either allow to specify
> specific usable or reserved GPA address ranges.  For example something
> like:
> 	--reserved-mem-ranges=host
>
> Or explicitly:
>
> 	--reserved-mem-ranges=13G@1010G,1M@4078M

Would this not naturally be a property of a machine model?

dme.
-- 
Seems I'm not alone at being alone.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
  2021-06-23  7:11   ` Igor Mammedov
@ 2021-06-23  9:03   ` Igor Mammedov
  2021-06-23  9:51     ` Joao Martins
  2021-06-24  9:32     ` Dr. David Alan Gilbert
  1 sibling, 2 replies; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23  9:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Tue, 22 Jun 2021 16:49:00 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be
> DMA addressable, within a given address space limit. Since
> v5.4 based that is not true, and VFIO will validate whether
> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> on behalf of some specific devices or platform-defined.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may export only these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> 
> We already know of accounting for the 4G hole, albeit if the
> guest is big enough we will fail to allocate a >1010G given
> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> 
> When creating the region above 4G, take into account what
> IOVAs are allowed by defining the known allowed ranges
> and search for the next free IOVA ranges. When finding a
> invalid IOVA we mark them as reserved and proceed to the
> next allowed IOVA region.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000100000000-000000fcffffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> 0000010000000000-000001037fffffff (prio 0, i/o):
> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff

You are talking here about GPA which is guest specific thing
and then somehow it becomes tied to host. For bystanders it's
not clear from above commit message how both are related.
I'd add here an explicit explanation how AMD host is related GPAs
and clarify where you are talking about guest/host side.

also what about usecases:
 * start QEMU with Intel cpu model on AMD host with intel's iommu
 * start QEMU with AMD cpu model and AMD's iommu on Intel host
 * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)

> Co-developed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c         | 103 +++++++++++++++++++++++++++++++++++++++----
>  include/hw/i386/pc.h |  57 ++++++++++++++++++++++++
>  target/i386/cpu.h    |   3 ++
>  3 files changed, 154 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index c6d8d0d84d91..52a5473ba846 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -91,6 +91,7 @@
>  #include "qapi/qmp/qerror.h"
>  #include "e820_memory_layout.h"
>  #include "fw_cfg.h"
> +#include "target/i386/cpu.h"
>  #include "trace.h"
>  #include CONFIG_DEVICES
>  
> @@ -860,6 +861,93 @@ void xen_load_linux(PCMachineState *pcms)
>      x86ms->fw_cfg = fw_cfg;
>  }
>  
> +struct GPARange usable_iova_ranges[] = {
> +    { .start = 4 * GiB, .end = UINT64_MAX, .name = "ram-above-4g" },
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. The range is:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define DEFAULT_NR_USABLE_IOVAS 1
> +#define AMD_MAX_PHYSADDR_BELOW_1TB  0xfcffffffff
> +    { .start = 1 * TiB, .end = UINT64_MAX, .name = "ram-above-1t" },
> +};
> +
> + uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
> +
> +static void init_usable_iova_ranges(void)
> +{
> +    uint32_t eax, vendor[3];
> +
> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +    if (IS_AMD_VENDOR(vendor)) {
> +        usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
> +        nb_iova_ranges++;
> +    }
> +}
> +
> +static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
> +                                hwaddr base, hwaddr size, hwaddr offset)
> +{
> +    hwaddr start, region_size, resv_start, resv_end;
> +    struct GPARange *range;
> +    MemoryRegion *region;
> +    uint32_t index;
> +
> +    for_each_usable_range(index, base, size, range, start, region_size) {
> +        region = g_malloc(sizeof(*region));
> +        memory_region_init_alias(region, NULL, range->name, ram,
> +                                 offset, region_size);
> +        memory_region_add_subregion(system_memory, start, region);
> +        e820_add_entry(start, region_size, E820_RAM);
> +
> +        assert(size >= region_size);
> +        if (size == region_size) {
> +            return;
> +        }
> +
> +        /*
> +         * There's memory left to create a region for, so there should be
> +         * another valid IOVA range left.  Creating the reserved region
> +         * would also be pointless.
> +         */
> +        if (index + 1 == nb_iova_ranges) {
> +            return;
> +        }
> +
> +        resv_start = start + region_size;
> +        resv_end = usable_iova_ranges[index + 1].start;
> +
> +        /* Create a reserved region in the IOVA hole. */
> +        e820_add_entry(resv_start, resv_end - resv_start, E820_RESERVED);
> +
> +        offset += region_size;
> +    }
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -867,7 +955,7 @@ void pc_memory_init(PCMachineState *pcms,
>  {
>      int linux_boot, i;
>      MemoryRegion *option_rom_mr;
> -    MemoryRegion *ram_below_4g, *ram_above_4g;
> +    MemoryRegion *ram_below_4g;
>      FWCfgState *fw_cfg;
>      MachineState *machine = MACHINE(pcms);
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
> @@ -877,6 +965,8 @@ void pc_memory_init(PCMachineState *pcms,
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
>  
> +    init_usable_iova_ranges();
> +
>      linux_boot = (machine->kernel_filename != NULL);
>  
>      /*
> @@ -888,16 +978,11 @@ void pc_memory_init(PCMachineState *pcms,
>      memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> +
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>      if (x86ms->above_4g_mem_size > 0) {
> -        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> -        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> -                                 machine->ram,
> -                                 x86ms->below_4g_mem_size,
> -                                 x86ms->above_4g_mem_size);
> -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> -                                    ram_above_4g);
> -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> +        add_memory_region(system_memory, machine->ram, 4 * GiB,
> +                          x86ms->above_4g_mem_size, x86ms->below_4g_mem_size);
>      }
>  
>      if (!pcmc->has_reserved_memory &&
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 1522a3359a93..73b8e2900c72 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -151,6 +151,63 @@ void pc_guest_info_init(PCMachineState *pcms);
>  #define PCI_HOST_BELOW_4G_MEM_SIZE     "below-4g-mem-size"
>  #define PCI_HOST_ABOVE_4G_MEM_SIZE     "above-4g-mem-size"
>  
> +struct GPARange {
> +    uint64_t start;
> +    uint64_t end;
> +#define range_len(r) ((r).end - (r).start + 1)
> +
> +    const char *name;
> +};
> +
> +extern uint32_t nb_iova_ranges;
> +extern struct GPARange usable_iova_ranges[];
> +
> +static inline void next_iova_range_index(uint32_t i, hwaddr base, hwaddr size,
> +                                         hwaddr *start, hwaddr *region_size,
> +                                         uint32_t *index, struct GPARange **r)
> +{
> +    struct GPARange *range;
> +
> +    while (i < nb_iova_ranges) {
> +        range = &usable_iova_ranges[i];
> +        if (range->end >= base) {
> +            break;
> +        }
> +        i++;
> +    }
> +
> +    *index = i;
> +    /* index is out of bounds or no region left to process */
> +    if (i >= nb_iova_ranges || !size) {
> +        return;
> +    }
> +
> +    *start = MAX(range->start, base);
> +    *region_size = MIN(range->end - *start + 1, size);
> +    *r = range;
> +}
> +
> +/*
> + * for_each_usable_range() - iterates over usable IOVA regions
> + *
> + * @i:      range index within usable_iova_ranges
> + * @base:   requested address we want to use
> + * @size:   total size of the region with @base
> + * @r:      range indexed by @i within usable_iova_ranges
> + * @s:      calculated usable iova address base
> + * @i_size: calculated usable iova region size starting @s
> + *
> + * Use this iterator to walk through usable GPA ranges. Platforms
> + * such as AMD with IOMMU capable hardware restrict certain address
> + * ranges for Hyper Transport. This iterator helper lets user avoid
> + * using those said reserved ranges.
> + */
> +#define for_each_usable_range(i, base, size, r, s, i_size) \
> +    for (s = 0, i_size = 0, r = NULL, \
> +         next_iova_range_index(0, base, size, &s, &i_size, &i, &r); \
> +         i < nb_iova_ranges && size != 0; \
> +         size -= i_size, \
> +         next_iova_range_index(i + 1, base, size, &s, &i_size, &i, &r))
>  
>  void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
>                              MemoryRegion *pci_address_space);
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 1e11071d817b..f8f15a4523c6 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -869,6 +869,9 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
>  
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
  2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
@ 2021-06-23  9:18   ` Igor Mammedov
  2021-06-23  9:59     ` Joao Martins
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23  9:18 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Tue, 22 Jun 2021 16:49:05 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> The added enforcing is only relevant in the case of AMD where the range
> right before the 1TB is restricted and cannot be DMA mapped by the
> kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST or possibly
> other kinds of IOMMU events in the AMD IOMMU.
> 
> Although, there's a case where it may make sense to disable the IOVA
> relocation/validation when migrating from a non-valid-IOVA-aware qemu to
> one that supports it.

we can keep old machines broken, that's what you are doing in
 *_machine_options() but for new machine it can be unconditionally
enforced.
So I miss the point of having user visible knob to disable that.
(it's not like one would be able to find a new machine type that
do not support new memory layout)
I'd drop user visible property and keep only hunks need for
*_machine_options().
 
> Relocating RAM regions to after the 1Tb hole has consequences for guest
> ABI because we are changing the memory mapping, and thus it may make
> sense to allow admin to disable the validation (e.g. upon migration) to
> either 1) Fail early when the VFIO DMA_MAP ioctl fails thus preventing
> the migration to happen 'gracefully' or 2) allow booting a guest
> unchanged from source host without risking changing the PCI mmio hole64
> or other things we consider in the valid IOVA range changing underneath
> the guest.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c         | 29 +++++++++++++++++++++++++++--
>  hw/i386/pc_piix.c    |  2 ++
>  hw/i386/pc_q35.c     |  2 ++
>  include/hw/i386/pc.h |  2 ++
>  4 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 65885cc16037..eb08a6d1a2b9 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -902,10 +902,14 @@ struct GPARange usable_iova_ranges[] = {
>  
>   uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
>  
> -static void init_usable_iova_ranges(void)
> +static void init_usable_iova_ranges(PCMachineClass *pcmc)
>  {
>      uint32_t eax, vendor[3];
>  
> +    if (!pcmc->enforce_valid_iova) {
> +        return;
> +    }
> +
>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>      if (IS_AMD_VENDOR(vendor)) {
>          usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
> @@ -1000,7 +1004,7 @@ void pc_memory_init(PCMachineState *pcms,
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
>  
> -    init_usable_iova_ranges();
> +    init_usable_iova_ranges(pcmc);
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> @@ -1685,6 +1689,23 @@ static void pc_machine_set_hpet(Object *obj, bool value, Error **errp)
>      pcms->hpet_enabled = value;
>  }
>  
> +static bool pc_machine_get_enforce_valid_iova(Object *obj, Error **errp)
> +{
> +    PCMachineState *pcms = PC_MACHINE(obj);
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +
> +    return pcmc->enforce_valid_iova;
> +}
> +
> +static void pc_machine_set_enforce_valid_iova(Object *obj, bool value,
> +                                                  Error **errp)
> +{
> +    PCMachineState *pcms = PC_MACHINE(obj);
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +
> +    pcmc->enforce_valid_iova = value;
> +}
> +
>  static void pc_machine_get_max_ram_below_4g(Object *obj, Visitor *v,
>                                              const char *name, void *opaque,
>                                              Error **errp)
> @@ -1851,6 +1872,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>      pcmc->has_reserved_memory = true;
>      pcmc->kvmclock_enabled = true;
>      pcmc->enforce_aligned_dimm = true;
> +    pcmc->enforce_valid_iova = true;
>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>       * to be used at the moment, 32K should be enough for a while.  */
>      pcmc->acpi_data_size = 0x20000 + 0x8000;
> @@ -1913,6 +1935,9 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>          NULL, NULL);
>      object_class_property_set_description(oc, PC_MACHINE_MAX_FW_SIZE,
>          "Maximum combined firmware size");
> +
> +    object_class_property_add_bool(oc, PC_MACHINE_ENFORCE_VALID_IOVA,
> +        pc_machine_get_enforce_valid_iova, pc_machine_set_enforce_valid_iova);
>  }
>  
>  static const TypeInfo pc_machine_info = {
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 30b8bd6ea92d..21a08e2f6a4c 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -427,11 +427,13 @@ DEFINE_I440FX_MACHINE(v6_1, "pc-i440fx-6.1", NULL,
>  
>  static void pc_i440fx_6_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_i440fx_6_1_machine_options(m);
>      m->alias = NULL;
>      m->is_default = false;
>      compat_props_add(m->compat_props, hw_compat_6_0, hw_compat_6_0_len);
>      compat_props_add(m->compat_props, pc_compat_6_0, pc_compat_6_0_len);
> +    pcmc->enforce_valid_iova = false;
>  }
>  
>  DEFINE_I440FX_MACHINE(v6_0, "pc-i440fx-6.0", NULL,
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 46a0f196f413..80bb89a9bae1 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -357,10 +357,12 @@ DEFINE_Q35_MACHINE(v6_1, "pc-q35-6.1", NULL,
>  
>  static void pc_q35_6_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_q35_6_1_machine_options(m);
>      m->alias = NULL;
>      compat_props_add(m->compat_props, hw_compat_6_0, hw_compat_6_0_len);
>      compat_props_add(m->compat_props, pc_compat_6_0, pc_compat_6_0_len);
> +    pcmc->enforce_valid_iova = false;
>  }
>  
>  DEFINE_Q35_MACHINE(v6_0, "pc-q35-6.0", NULL,
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index b924aef3a218..7337f6f2d014 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -63,6 +63,7 @@ typedef struct PCMachineState {
>  #define PC_MACHINE_SATA             "sata"
>  #define PC_MACHINE_PIT              "pit"
>  #define PC_MACHINE_MAX_FW_SIZE      "max-fw-size"
> +#define PC_MACHINE_ENFORCE_VALID_IOVA  "enforce-valid-iova"
>  /**
>   * PCMachineClass:
>   *
> @@ -113,6 +114,7 @@ struct PCMachineClass {
>      bool has_reserved_memory;
>      bool enforce_aligned_dimm;
>      bool broken_reserved_end;
> +    bool enforce_valid_iova;
>  
>      /* generate legacy CPU hotplug AML */
>      bool legacy_cpu_hotplug;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
  2021-06-23  7:40   ` David Edmondson
@ 2021-06-23  9:30   ` Joao Martins
  2021-06-23 11:58     ` Igor Mammedov
  2021-06-23 19:27     ` Alex Williamson
  1 sibling, 2 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-23  9:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Suravee Suthikulpanit, Igor Mammedov, Paolo Bonzini

On 6/22/21 10:16 PM, Alex Williamson wrote:
> On Tue, 22 Jun 2021 16:48:59 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Hey,
>>
>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>> when running on AMD systems with an IOMMU.
>>
>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>
>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>> of the failure:
>>
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>> 	failed to setup container for group 258: memory listener initialization failed:
>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>
>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>> as documented on the links down below.
>>
>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>> as following:
>>
>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>            which gets used too in other parts of pc/acpi besides MR creation. The
>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>> 	   1TB boundary).
>> 	   
>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>
>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>> 	   requiring to have a PCI device added in the vfio container. That would
>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.
> 
> 
> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> 0x00000000fee00000 0x00000000feefffff msi
> 0x000000fd00000000 0x000000ffffffffff reserved
> 
Yeap, I am aware.

The VFIO advertising extension was just because we already advertise the above info,
although behind a non-empty vfio container e.g. we seem to use that for example in
collect_usable_iova_ranges().

> Ideally we might take that into account on all hosts, but of course
> then we run into massive compatibility issues when we consider
> migration.  We run into similar problems when people try to assign
> devices to non-x86 TCG hosts, where the arch doesn't have a natural
> memory hole overlapping the msi range.
> 
> The issue here is similar to trying to find a set of supported CPU
> flags across hosts, QEMU only has visibility to the host where it runs,
> an upper level tool needs to be able to pass through information about
> compatibility to all possible migration targets.

I agree with your generic sentiment (and idea) but are we sure this is really something as
dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
pc/q35 is one very good example, because this hasn't changed since it's inception [a
decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
with more than 1Tb). Additionally, there might be architectural impositions like on x86
e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
keep the reserved range nonetheless in the common denominator)

> Towards that end, we
> should probably have command line options that either allow to specify
> specific usable or reserved GPA address ranges.  For example something
> like:
> 	--reserved-mem-ranges=host
> 
> Or explicitly:
> 
> 	--reserved-mem-ranges=13G@1010G,1M@4078M
> 
I like the added option, particularly because it lets everyone workaround similar issues.
I remember a series before that had similar issues on ARM (but can't remember now what it
was).

>> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
>> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
>>
>> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
>> 	   to keep things as is.
>>
>> The 'consequence' of this approach is that we may need more than the default
>> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
>> address, consequently needing 41 phys-bits as opposed to the default of 40.
>> I can't figure a reasonable way to establish the required phys-bits we
>> need for the memory map in a dynamic way, especially considering that
>> today there's already a precedent to depend on the user to pick the right value
>> of phys-bits (regardless of this series).
>>
>> Additionally, the reserved region is always added regardless of whether we have
>> VFIO devices to cover the VFIO device hotplug case.
> 
> Various migration issues as you note later in the series.
> 
/me nods

>> Other options considered:
>>
>> a) Consider the reserved range part of RAM, and just marking it as
>> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
>> only allocate 1010G of RAM and the remaining would be marked reserved.
>> This is not how what we do today for the 4G hole i.e. the RAM
>> actually allocated is the value specified by the user and thus RAM available
>> to the guest (IIUC).
>>
>> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
>> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
>> alone wouldn't fix the way MRs are laid out which is where fundamentally the
>> problem is.
> 
> Data corruption with b) should the guest ever use memory within this
> range as a DMA target.  Thanks,
> 
Yeap.

> Alex
>  
>> The proposed approach in this series works regardless of the kernel, and
>> relevant for old and new Qemu.
>>
>> Open to alternatives/comments/suggestions that I should pursue instead.
>>
>> 	Joao
>>
>> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
>> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
>>
>> Joao Martins (6):
>>   i386/pc: Account IOVA reserved ranges above 4G boundary
>>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
>>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
>>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
>>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
>>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
>>
>>  hw/i386/acpi-build.c  |  22 ++++-
>>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
>>  hw/i386/pc_piix.c     |   2 +
>>  hw/i386/pc_q35.c      |   2 +
>>  hw/pci-host/i440fx.c  |   4 +-
>>  hw/pci-host/q35.c     |   4 +-
>>  include/hw/i386/pc.h  |  62 ++++++++++++-
>>  include/hw/i386/x86.h |   4 +
>>  target/i386/cpu.h     |   3 +
>>  9 files changed, 288 insertions(+), 21 deletions(-)
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23  7:11   ` Igor Mammedov
@ 2021-06-23  9:37     ` Joao Martins
  2021-06-23 11:39       ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-23  9:37 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On 6/23/21 8:11 AM, Igor Mammedov wrote:
> On Tue, 22 Jun 2021 16:49:00 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> It is assumed that the whole GPA space is available to be
>> DMA addressable, within a given address space limit. Since
>> v5.4 based that is not true, and VFIO will validate whether
>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>> on behalf of some specific devices or platform-defined.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may export only these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
>>
>> We already know of accounting for the 4G hole, albeit if the
>> guest is big enough we will fail to allocate a >1010G given
>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>
>> When creating the region above 4G, take into account what
>> IOVAs are allowed by defining the known allowed ranges
>> and search for the next free IOVA ranges. When finding a
>> invalid IOVA we mark them as reserved and proceed to the
>> next allowed IOVA region.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>> 0000010000000000-000001037fffffff (prio 0, i/o):
>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff
> 
> why not push whole ram-above-4g above 1Tb mark
> when RAM is sufficiently large (regardless of used host),
> instead of creating yet another hole and all complexity it brings along?
> 

There's the problem with CMOS which describes memory above 4G, part of the
reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
only describe up to 1T.

But should we not care about that then it's an option, I suppose.

We would waste 1Tb of address space because of 12G, and btw the logic here
is not so different than the 4G hole, in fact could probably share this
with it.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23  9:03   ` Igor Mammedov
@ 2021-06-23  9:51     ` Joao Martins
  2021-06-23 12:09       ` Igor Mammedov
  2021-06-24  9:32     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-23  9:51 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On 6/23/21 10:03 AM, Igor Mammedov wrote:
> On Tue, 22 Jun 2021 16:49:00 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> It is assumed that the whole GPA space is available to be
>> DMA addressable, within a given address space limit. Since
>> v5.4 based that is not true, and VFIO will validate whether
>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>> on behalf of some specific devices or platform-defined.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may export only these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
>>
>> We already know of accounting for the 4G hole, albeit if the
>> guest is big enough we will fail to allocate a >1010G given
>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>
>> When creating the region above 4G, take into account what
>> IOVAs are allowed by defining the known allowed ranges
>> and search for the next free IOVA ranges. When finding a
>> invalid IOVA we mark them as reserved and proceed to the
>> next allowed IOVA region.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>> 0000010000000000-000001037fffffff (prio 0, i/o):
>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff
> 
> You are talking here about GPA which is guest specific thing
> and then somehow it becomes tied to host. For bystanders it's
> not clear from above commit message how both are related.
> I'd add here an explicit explanation how AMD host is related GPAs
> and clarify where you are talking about guest/host side.
> 
OK, makes sense.

Perhaps using IOVA makes it easier to understand. I said GPA because
there's an 1:1 mapping between GPA and IOVA (if you're not using vIOMMU).

> also what about usecases:
>  * start QEMU with Intel cpu model on AMD host with intel's iommu

In principle it would be less likely to occur. But you would still need
to mark the same range as reserved. The limitation is on DMA occuring
on those IOVAs (host or guest) coinciding with that range, so you would
want to inform the guest that at least those should be avoided.

>  * start QEMU with AMD cpu model and AMD's iommu on Intel host

Here you would probably only mark the range, solely for honoring how hardware
is usually represented. But really, on Intel, nothing stops you from exposing the
aforementioned range as RAM.

>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> 
This one is tricky. Because you can hotplug a VFIO device later on,
I opted for always marking the reserved range. If you don't use VFIO you're good, but
otherwise you would still need reserved. But I am not sure how qtest is used
today for testing huge guests.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
  2021-06-23  9:18   ` Igor Mammedov
@ 2021-06-23  9:59     ` Joao Martins
  0 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-23  9:59 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini



On 6/23/21 10:18 AM, Igor Mammedov wrote:
> On Tue, 22 Jun 2021 16:49:05 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> The added enforcing is only relevant in the case of AMD where the range
>> right before the 1TB is restricted and cannot be DMA mapped by the
>> kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST or possibly
>> other kinds of IOMMU events in the AMD IOMMU.
>>
>> Although, there's a case where it may make sense to disable the IOVA
>> relocation/validation when migrating from a non-valid-IOVA-aware qemu to
>> one that supports it.
> 
> we can keep old machines broken, that's what you are doing in
>  *_machine_options() but for new machine it can be unconditionally
> enforced.

Yeap.

> So I miss the point of having user visible knob to disable that.
> (it's not like one would be able to find a new machine type that
> do not support new memory layout)
> I'd drop user visible property and keep only hunks need for
> *_machine_options().
>  OK, I'll remove it.

This was aimed at just not changing gABI for older machine types, thinking
in migration.

>> Relocating RAM regions to after the 1Tb hole has consequences for guest
>> ABI because we are changing the memory mapping, and thus it may make
>> sense to allow admin to disable the validation (e.g. upon migration) to
>> either 1) Fail early when the VFIO DMA_MAP ioctl fails thus preventing
>> the migration to happen 'gracefully' or 2) allow booting a guest
>> unchanged from source host without risking changing the PCI mmio hole64
>> or other things we consider in the valid IOVA range changing underneath
>> the guest.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c         | 29 +++++++++++++++++++++++++++--
>>  hw/i386/pc_piix.c    |  2 ++
>>  hw/i386/pc_q35.c     |  2 ++
>>  include/hw/i386/pc.h |  2 ++
>>  4 files changed, 33 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 65885cc16037..eb08a6d1a2b9 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -902,10 +902,14 @@ struct GPARange usable_iova_ranges[] = {
>>  
>>   uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
>>  
>> -static void init_usable_iova_ranges(void)
>> +static void init_usable_iova_ranges(PCMachineClass *pcmc)
>>  {
>>      uint32_t eax, vendor[3];
>>  
>> +    if (!pcmc->enforce_valid_iova) {
>> +        return;
>> +    }
>> +
>>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>      if (IS_AMD_VENDOR(vendor)) {
>>          usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
>> @@ -1000,7 +1004,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>>  
>> -    init_usable_iova_ranges();
>> +    init_usable_iova_ranges(pcmc);
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> @@ -1685,6 +1689,23 @@ static void pc_machine_set_hpet(Object *obj, bool value, Error **errp)
>>      pcms->hpet_enabled = value;
>>  }
>>  
>> +static bool pc_machine_get_enforce_valid_iova(Object *obj, Error **errp)
>> +{
>> +    PCMachineState *pcms = PC_MACHINE(obj);
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +
>> +    return pcmc->enforce_valid_iova;
>> +}
>> +
>> +static void pc_machine_set_enforce_valid_iova(Object *obj, bool value,
>> +                                                  Error **errp)
>> +{
>> +    PCMachineState *pcms = PC_MACHINE(obj);
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +
>> +    pcmc->enforce_valid_iova = value;
>> +}
>> +
>>  static void pc_machine_get_max_ram_below_4g(Object *obj, Visitor *v,
>>                                              const char *name, void *opaque,
>>                                              Error **errp)
>> @@ -1851,6 +1872,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>>      pcmc->has_reserved_memory = true;
>>      pcmc->kvmclock_enabled = true;
>>      pcmc->enforce_aligned_dimm = true;
>> +    pcmc->enforce_valid_iova = true;
>>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>>       * to be used at the moment, 32K should be enough for a while.  */
>>      pcmc->acpi_data_size = 0x20000 + 0x8000;
>> @@ -1913,6 +1935,9 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>>          NULL, NULL);
>>      object_class_property_set_description(oc, PC_MACHINE_MAX_FW_SIZE,
>>          "Maximum combined firmware size");
>> +
>> +    object_class_property_add_bool(oc, PC_MACHINE_ENFORCE_VALID_IOVA,
>> +        pc_machine_get_enforce_valid_iova, pc_machine_set_enforce_valid_iova);
>>  }
>>  
>>  static const TypeInfo pc_machine_info = {
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index 30b8bd6ea92d..21a08e2f6a4c 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -427,11 +427,13 @@ DEFINE_I440FX_MACHINE(v6_1, "pc-i440fx-6.1", NULL,
>>  
>>  static void pc_i440fx_6_0_machine_options(MachineClass *m)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>>      pc_i440fx_6_1_machine_options(m);
>>      m->alias = NULL;
>>      m->is_default = false;
>>      compat_props_add(m->compat_props, hw_compat_6_0, hw_compat_6_0_len);
>>      compat_props_add(m->compat_props, pc_compat_6_0, pc_compat_6_0_len);
>> +    pcmc->enforce_valid_iova = false;
>>  }
>>  
>>  DEFINE_I440FX_MACHINE(v6_0, "pc-i440fx-6.0", NULL,
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 46a0f196f413..80bb89a9bae1 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -357,10 +357,12 @@ DEFINE_Q35_MACHINE(v6_1, "pc-q35-6.1", NULL,
>>  
>>  static void pc_q35_6_0_machine_options(MachineClass *m)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>>      pc_q35_6_1_machine_options(m);
>>      m->alias = NULL;
>>      compat_props_add(m->compat_props, hw_compat_6_0, hw_compat_6_0_len);
>>      compat_props_add(m->compat_props, pc_compat_6_0, pc_compat_6_0_len);
>> +    pcmc->enforce_valid_iova = false;
>>  }
>>  
>>  DEFINE_Q35_MACHINE(v6_0, "pc-q35-6.0", NULL,
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index b924aef3a218..7337f6f2d014 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -63,6 +63,7 @@ typedef struct PCMachineState {
>>  #define PC_MACHINE_SATA             "sata"
>>  #define PC_MACHINE_PIT              "pit"
>>  #define PC_MACHINE_MAX_FW_SIZE      "max-fw-size"
>> +#define PC_MACHINE_ENFORCE_VALID_IOVA  "enforce-valid-iova"
>>  /**
>>   * PCMachineClass:
>>   *
>> @@ -113,6 +114,7 @@ struct PCMachineClass {
>>      bool has_reserved_memory;
>>      bool enforce_aligned_dimm;
>>      bool broken_reserved_end;
>> +    bool enforce_valid_iova;
>>  
>>      /* generate legacy CPU hotplug AML */
>>      bool legacy_cpu_hotplug;
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23  9:37     ` Joao Martins
@ 2021-06-23 11:39       ` Igor Mammedov
  2021-06-23 13:04         ` Joao Martins
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23 11:39 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Wed, 23 Jun 2021 10:37:38 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 8:11 AM, Igor Mammedov wrote:
> > On Tue, 22 Jun 2021 16:49:00 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> It is assumed that the whole GPA space is available to be
> >> DMA addressable, within a given address space limit. Since
> >> v5.4 based that is not true, and VFIO will validate whether
> >> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >> on behalf of some specific devices or platform-defined.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may export only these ranges as allowed:
> >>
> >> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>
> >> We already know of accounting for the 4G hole, albeit if the
> >> guest is big enough we will fail to allocate a >1010G given
> >> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>
> >> When creating the region above 4G, take into account what
> >> IOVAs are allowed by defining the known allowed ranges
> >> and search for the next free IOVA ranges. When finding a
> >> invalid IOVA we mark them as reserved and proceed to the
> >> next allowed IOVA region.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >> 0000010000000000-000001037fffffff (prio 0, i/o):
> >> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff  
> > 
> > why not push whole ram-above-4g above 1Tb mark
> > when RAM is sufficiently large (regardless of used host),
> > instead of creating yet another hole and all complexity it brings along?
> >   
> 
> There's the problem with CMOS which describes memory above 4G, part of the
> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
> only describe up to 1T.
> 
> But should we not care about that then it's an option, I suppose.
we probably do not care about CMOS with so large RAM,
as long as QEMU generates correct E820 (cmos mattered only with old Seabios
which used it for generating memory map)

> We would waste 1Tb of address space because of 12G, and btw the logic here
> is not so different than the 4G hole, in fact could probably share this
> with it.
the main reason I'm looking for alternative, is complexity
of making hole brings in. At this point, we can't do anything
with 4G hole as it's already there, but we can try to avoid that
for high RAM and keep rules there simple as it's now.

Also partitioning/splitting main RAM is one of the things that
gets in the way converting it to PC-DIMMMs model.

Loosing 1Tb of address space might be acceptable on a host
that can handle such amounts of RAM




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-23  9:30   ` Joao Martins
@ 2021-06-23 11:58     ` Igor Mammedov
  2021-06-23 13:15       ` Joao Martins
  2021-06-23 19:27     ` Alex Williamson
  1 sibling, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23 11:58 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Alex Williamson, Suravee Suthikulpanit, Paolo Bonzini

On Wed, 23 Jun 2021 10:30:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/22/21 10:16 PM, Alex Williamson wrote:
> > On Tue, 22 Jun 2021 16:48:59 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Hey,
> >>
> >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> >> when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> >> 	failed to setup container for group 258: memory listener initialization failed:
> >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> >> as following:
> >>
> >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> >>            which gets used too in other parts of pc/acpi besides MR creation. The
> >> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> >> 	   Intel. We walk the allowed ranges for memory above 4G, and
> >> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> >> 	   1TB boundary).
> >> 	   
> >> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> >> 	   understand that it doesn't cover the non-x86 host case running TCG.
> >>
> >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> >> 	   requiring to have a PCI device added in the vfio container. That would
> >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> > 
> > 
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >   
> Yeap, I am aware.
> 
> The VFIO advertising extension was just because we already advertise the above info,
> although behind a non-empty vfio container e.g. we seem to use that for example in
> collect_usable_iova_ranges().
> 
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> > 
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  
> 
> I agree with your generic sentiment (and idea) but are we sure this is really something as
> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> pc/q35 is one very good example, because this hasn't changed since it's inception [a
> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> with more than 1Tb). Additionally, there might be architectural impositions like on x86
> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> keep the reserved range nonetheless in the common denominator)
> 
> > Towards that end, we
> > should probably have command line options that either allow to specify
> > specific usable or reserved GPA address ranges.  For example something
> > like:
> > 	--reserved-mem-ranges=host
> > 
> > Or explicitly:
> > 
> > 	--reserved-mem-ranges=13G@1010G,1M@4078M

if we can do without adding any option at all it will be even better
since user/mgmt won't need to care about it as well.

> >   
> I like the added option, particularly because it lets everyone workaround similar issues.
> I remember a series before that had similar issues on ARM (but can't remember now what it
> was).
> 
> >> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
> >> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
> >>
> >> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
> >> 	   to keep things as is.
> >>
> >> The 'consequence' of this approach is that we may need more than the default
> >> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
> >> address, consequently needing 41 phys-bits as opposed to the default of 40.
> >> I can't figure a reasonable way to establish the required phys-bits we
> >> need for the memory map in a dynamic way, especially considering that
> >> today there's already a precedent to depend on the user to pick the right value
> >> of phys-bits (regardless of this series).
> >>
> >> Additionally, the reserved region is always added regardless of whether we have
> >> VFIO devices to cover the VFIO device hotplug case.  
> > 
> > Various migration issues as you note later in the series.
> >   
> /me nods
> 
> >> Other options considered:
> >>
> >> a) Consider the reserved range part of RAM, and just marking it as
> >> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
> >> only allocate 1010G of RAM and the remaining would be marked reserved.
> >> This is not how what we do today for the 4G hole i.e. the RAM
> >> actually allocated is the value specified by the user and thus RAM available
> >> to the guest (IIUC).

it's partially true, we don't care about small MMIO regions that
overlay on top of low memory. But concealing RAM behind large PCI
hole would be a significant waste (especially when we are speaking
about PCI hole below 4GB)

I wonder how it works on real hardware?
i.e. does memory controller remap physical RAM at 1T hole region, just hides it
or just doesn't place any DIMMs there?


> >> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
> >> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
> >> alone wouldn't fix the way MRs are laid out which is where fundamentally the
> >> problem is.  
> > 
> > Data corruption with b) should the guest ever use memory within this
> > range as a DMA target.  Thanks,
> >   
> Yeap.
> 
> > Alex
> >    
> >> The proposed approach in this series works regardless of the kernel, and
> >> relevant for old and new Qemu.
> >>
> >> Open to alternatives/comments/suggestions that I should pursue instead.
> >>
> >> 	Joao
> >>
> >> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> >> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> >>
> >> Joao Martins (6):
> >>   i386/pc: Account IOVA reserved ranges above 4G boundary
> >>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
> >>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
> >>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
> >>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
> >>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
> >>
> >>  hw/i386/acpi-build.c  |  22 ++++-
> >>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
> >>  hw/i386/pc_piix.c     |   2 +
> >>  hw/i386/pc_q35.c      |   2 +
> >>  hw/pci-host/i440fx.c  |   4 +-
> >>  hw/pci-host/q35.c     |   4 +-
> >>  include/hw/i386/pc.h  |  62 ++++++++++++-
> >>  include/hw/i386/x86.h |   4 +
> >>  target/i386/cpu.h     |   3 +
> >>  9 files changed, 288 insertions(+), 21 deletions(-)
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23  9:51     ` Joao Martins
@ 2021-06-23 12:09       ` Igor Mammedov
  2021-06-23 13:07         ` Joao Martins
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23 12:09 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Wed, 23 Jun 2021 10:51:59 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 10:03 AM, Igor Mammedov wrote:
> > On Tue, 22 Jun 2021 16:49:00 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> It is assumed that the whole GPA space is available to be
> >> DMA addressable, within a given address space limit. Since
> >> v5.4 based that is not true, and VFIO will validate whether
> >> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >> on behalf of some specific devices or platform-defined.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may export only these ranges as allowed:
> >>
> >> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>
> >> We already know of accounting for the 4G hole, albeit if the
> >> guest is big enough we will fail to allocate a >1010G given
> >> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>
> >> When creating the region above 4G, take into account what
> >> IOVAs are allowed by defining the known allowed ranges
> >> and search for the next free IOVA ranges. When finding a
> >> invalid IOVA we mark them as reserved and proceed to the
> >> next allowed IOVA region.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >> 0000010000000000-000001037fffffff (prio 0, i/o):
> >> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff  
> > 
> > You are talking here about GPA which is guest specific thing
> > and then somehow it becomes tied to host. For bystanders it's
> > not clear from above commit message how both are related.
> > I'd add here an explicit explanation how AMD host is related GPAs
> > and clarify where you are talking about guest/host side.
> >   
> OK, makes sense.
> 
> Perhaps using IOVA makes it easier to understand. I said GPA because
> there's an 1:1 mapping between GPA and IOVA (if you're not using vIOMMU).

IOVA may be a too broad term, maybe explain it in terms of GPA and HPA
and why it does matter on each side (host/guest)


> > also what about usecases:
> >  * start QEMU with Intel cpu model on AMD host with intel's iommu  
> 
> In principle it would be less likely to occur. But you would still need
> to mark the same range as reserved. The limitation is on DMA occuring
> on those IOVAs (host or guest) coinciding with that range, so you would
> want to inform the guest that at least those should be avoided.
> 
> >  * start QEMU with AMD cpu model and AMD's iommu on Intel host  
> 
> Here you would probably only mark the range, solely for honoring how hardware
> is usually represented. But really, on Intel, nothing stops you from exposing the
> aforementioned range as RAM.
> 
> >  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> >   
> This one is tricky. Because you can hotplug a VFIO device later on,
> I opted for always marking the reserved range. If you don't use VFIO you're good, but
> otherwise you would still need reserved. But I am not sure how qtest is used
> today for testing huge guests.
I do not know if there are VFIO tests in qtest (probably nope, since that
could require a host configured for that), but we can add a test
for his memory quirk (assuming phys-bits won't get in the way)



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space
  2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
@ 2021-06-23 12:30   ` Igor Mammedov
  2021-06-23 13:22     ` Joao Martins
  2021-06-23 16:33     ` Laszlo Ersek
  0 siblings, 2 replies; 38+ messages in thread
From: Igor Mammedov @ 2021-06-23 12:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini, lersek

On Tue, 22 Jun 2021 16:49:03 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> pci_memory initialized by q35 and i440fx is set to a range
> of 0 .. UINT64_MAX, and as a consequence when ACPI and pci-host
> pick the hole64_start it does not account for allowed IOVA ranges.
> 
> Rather than blindly returning, round up the hole64_start
> value to the allowable IOVA range, such that it accounts for
> the 1Tb hole *on AMD*. On Intel it returns the input value
> for hole64 start.

wouldn't that work only in case where guest firmware hadn't
mapped any PCI bars above 4Gb (possibly in not allowed region).

Looking at Seabios, it uses reserved_memory_end as the start
of PCI64 window. Not sure about OVMF,
 CCing Laszlo.

> Suggested-by: David Edmondson <david.edmondson@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c         | 17 +++++++++++++++--
>  hw/pci-host/i440fx.c |  4 +++-
>  hw/pci-host/q35.c    |  4 +++-
>  include/hw/i386/pc.h |  3 ++-
>  4 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 2e2ea82a4661..65885cc16037 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1141,7 +1141,7 @@ void pc_memory_init(PCMachineState *pcms,
>   * The 64bit pci hole starts after "above 4G RAM" and
>   * potentially the space reserved for memory hotplug.
>   */
> -uint64_t pc_pci_hole64_start(void)
> +uint64_t pc_pci_hole64_start(uint64_t size)
>  {
>      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> @@ -1155,12 +1155,25 @@ uint64_t pc_pci_hole64_start(void)
>              hole64_start += memory_region_size(&ms->device_memory->mr);
>          }
>      } else {
> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
> +        if (!x86ms->above_1t_mem_size) {
> +            hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
> +        } else {
> +            hole64_start = x86ms->above_1t_maxram_start;
> +        }
>      }

> +    hole64_start = allowed_round_up(hole64_start, size);

I'm not sure but, might it cause problems if there were BARs placed
by firmware in region below rounded up value?
(i.e. ACPI description which uses PCI_HOST_PROP_PCI_HOLE_START property
won't match whatever firmware programmed due to rounding pushing hole
start up)

>      return ROUND_UP(hole64_start, 1 * GiB);
>  }
>  
> +uint64_t pc_pci_hole64_start_aligned(uint64_t start, uint64_t size)
> +{
> +    if (nb_iova_ranges == DEFAULT_NR_USABLE_IOVAS) {
> +        return start;
> +    }
> +    return allowed_round_up(start, size);
> +}
> +
>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus)
>  {
>      DeviceState *dev = NULL;
> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> index 28c9bae89944..e8eaebfe1034 100644
> --- a/hw/pci-host/i440fx.c
> +++ b/hw/pci-host/i440fx.c
> @@ -163,8 +163,10 @@ static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
>      pci_bus_get_w64_range(h->bus, &w64);
>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>      if (!value && s->pci_hole64_fix) {
> -        value = pc_pci_hole64_start();
> +        value = pc_pci_hole64_start(s->pci_hole64_size);
>      }
> +    /* This returns @value when not on AMD */
> +    value = pc_pci_hole64_start_aligned(value, s->pci_hole64_size);
>      return value;
>  }
>  
> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
> index 2eb729dff585..d556eb965ddb 100644
> --- a/hw/pci-host/q35.c
> +++ b/hw/pci-host/q35.c
> @@ -126,8 +126,10 @@ static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
>      pci_bus_get_w64_range(h->bus, &w64);
>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>      if (!value && s->pci_hole64_fix) {
> -        value = pc_pci_hole64_start();
> +        value = pc_pci_hole64_start(s->mch.pci_hole64_size);
>      }
> +    /* This returns @value when not on AMD */
> +    value = pc_pci_hole64_start_aligned(value, s->mch.pci_hole64_size);
>      return value;
>  }
>  
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 73b8e2900c72..b924aef3a218 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -217,7 +217,8 @@ void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
>                      MemoryRegion **ram_memory);
> -uint64_t pc_pci_hole64_start(void);
> +uint64_t pc_pci_hole64_start(uint64_t size);
> +uint64_t pc_pci_hole64_start_aligned(uint64_t value, uint64_t size);
>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>  void pc_basic_device_init(struct PCMachineState *pcms,
>                            ISABus *isa_bus, qemu_irq *gsi,



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23 11:39       ` Igor Mammedov
@ 2021-06-23 13:04         ` Joao Martins
  2021-06-28 14:32           ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-23 13:04 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini



On 6/23/21 12:39 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 10:37:38 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/23/21 8:11 AM, Igor Mammedov wrote:
>>> On Tue, 22 Jun 2021 16:49:00 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> It is assumed that the whole GPA space is available to be
>>>> DMA addressable, within a given address space limit. Since
>>>> v5.4 based that is not true, and VFIO will validate whether
>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>>>> on behalf of some specific devices or platform-defined.
>>>>
>>>> AMD systems with an IOMMU are examples of such platforms and
>>>> particularly may export only these ranges as allowed:
>>>>
>>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
>>>>
>>>> We already know of accounting for the 4G hole, albeit if the
>>>> guest is big enough we will fail to allocate a >1010G given
>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>>>
>>>> When creating the region above 4G, take into account what
>>>> IOVAs are allowed by defining the known allowed ranges
>>>> and search for the next free IOVA ranges. When finding a
>>>> invalid IOVA we mark them as reserved and proceed to the
>>>> next allowed IOVA region.
>>>>
>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
>>>> look like:
>>>>
>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
>>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff  
>>>
>>> why not push whole ram-above-4g above 1Tb mark
>>> when RAM is sufficiently large (regardless of used host),
>>> instead of creating yet another hole and all complexity it brings along?
>>>   
>>
>> There's the problem with CMOS which describes memory above 4G, part of the
>> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
>> only describe up to 1T.
>>
>> But should we not care about that then it's an option, I suppose.
> we probably do not care about CMOS with so large RAM,
> as long as QEMU generates correct E820 (cmos mattered only with old Seabios
> which used it for generating memory map)
> 
OK, good to know.

Any extension on CMOS would probably also be out of spec.

>> We would waste 1Tb of address space because of 12G, and btw the logic here
>> is not so different than the 4G hole, in fact could probably share this
>> with it.
> the main reason I'm looking for alternative, is complexity
> of making hole brings in. At this point, we can't do anything
> with 4G hole as it's already there, but we can try to avoid that
> for high RAM and keep rules there simple as it's now.
> 
Right. But for what is worth, that complexity is spread in two parts:

1) dealing with a sparse RAM model (with more than one hole)

2) offsetting everything else that assumes a linear RAM map.

I don't think that even if we shift start of RAM to after 1TB boundary that
we would get away top solving item 2 -- which personally is where I find this
a tad bit more hairy. So it would probably make this patch complexity smaller, but
not vary much in how spread the changes get.

> Also partitioning/splitting main RAM is one of the things that
> gets in the way converting it to PC-DIMMMs model.
> 
Can you expand on that? (a link to a series is enough)

> Loosing 1Tb of address space might be acceptable on a host
> that can handle such amounts of RAM
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23 12:09       ` Igor Mammedov
@ 2021-06-23 13:07         ` Joao Martins
  2021-06-28 13:25           ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-23 13:07 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini



On 6/23/21 1:09 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 10:51:59 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/23/21 10:03 AM, Igor Mammedov wrote:
>>> On Tue, 22 Jun 2021 16:49:00 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> It is assumed that the whole GPA space is available to be
>>>> DMA addressable, within a given address space limit. Since
>>>> v5.4 based that is not true, and VFIO will validate whether
>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>>>> on behalf of some specific devices or platform-defined.
>>>>
>>>> AMD systems with an IOMMU are examples of such platforms and
>>>> particularly may export only these ranges as allowed:
>>>>
>>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
>>>>
>>>> We already know of accounting for the 4G hole, albeit if the
>>>> guest is big enough we will fail to allocate a >1010G given
>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>>>
>>>> When creating the region above 4G, take into account what
>>>> IOVAs are allowed by defining the known allowed ranges
>>>> and search for the next free IOVA ranges. When finding a
>>>> invalid IOVA we mark them as reserved and proceed to the
>>>> next allowed IOVA region.
>>>>
>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
>>>> look like:
>>>>
>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
>>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff  
>>>
>>> You are talking here about GPA which is guest specific thing
>>> and then somehow it becomes tied to host. For bystanders it's
>>> not clear from above commit message how both are related.
>>> I'd add here an explicit explanation how AMD host is related GPAs
>>> and clarify where you are talking about guest/host side.
>>>   
>> OK, makes sense.
>>
>> Perhaps using IOVA makes it easier to understand. I said GPA because
>> there's an 1:1 mapping between GPA and IOVA (if you're not using vIOMMU).
> 
> IOVA may be a too broad term, maybe explain it in terms of GPA and HPA
> and why it does matter on each side (host/guest)
> 

I used the term IOVA specially because that is applicable to Host IOVA or
Guest IOVA (same rules apply as this is not special cased for VMs). So,
regardless of whether we have guest mode page tables, or just host
iommu page tables, this address range should be reserved and not used.

> 
>>> also what about usecases:
>>>  * start QEMU with Intel cpu model on AMD host with intel's iommu  
>>
>> In principle it would be less likely to occur. But you would still need
>> to mark the same range as reserved. The limitation is on DMA occuring
>> on those IOVAs (host or guest) coinciding with that range, so you would
>> want to inform the guest that at least those should be avoided.
>>
>>>  * start QEMU with AMD cpu model and AMD's iommu on Intel host  
>>
>> Here you would probably only mark the range, solely for honoring how hardware
>> is usually represented. But really, on Intel, nothing stops you from exposing the
>> aforementioned range as RAM.
>>
>>>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
>>>   
>> This one is tricky. Because you can hotplug a VFIO device later on,
>> I opted for always marking the reserved range. If you don't use VFIO you're good, but
>> otherwise you would still need reserved. But I am not sure how qtest is used
>> today for testing huge guests.
> I do not know if there are VFIO tests in qtest (probably nope, since that
> could require a host configured for that), but we can add a test
> for his memory quirk (assuming phys-bits won't get in the way)
> 

	Joao


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-23 11:58     ` Igor Mammedov
@ 2021-06-23 13:15       ` Joao Martins
  0 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-23 13:15 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Alex Williamson, Suravee Suthikulpanit, Paolo Bonzini



On 6/23/21 12:58 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/22/21 10:16 PM, Alex Williamson wrote:
>>> On Tue, 22 Jun 2021 16:48:59 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> Hey,
>>>>
>>>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>>>> when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>>>> as following:
>>>>
>>>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>>>            which gets used too in other parts of pc/acpi besides MR creation. The
>>>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>>>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>>>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>>>> 	   1TB boundary).
>>>> 	   
>>>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>>>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>>>
>>>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>>>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>>>> 	   requiring to have a PCI device added in the vfio container. That would
>>>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>>>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
>>>
>>>
>>> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
>>> 0x00000000fee00000 0x00000000feefffff msi
>>> 0x000000fd00000000 0x000000ffffffffff reserved
>>>   
>> Yeap, I am aware.
>>
>> The VFIO advertising extension was just because we already advertise the above info,
>> although behind a non-empty vfio container e.g. we seem to use that for example in
>> collect_usable_iova_ranges().
>>
>>> Ideally we might take that into account on all hosts, but of course
>>> then we run into massive compatibility issues when we consider
>>> migration.  We run into similar problems when people try to assign
>>> devices to non-x86 TCG hosts, where the arch doesn't have a natural
>>> memory hole overlapping the msi range.
>>>
>>> The issue here is similar to trying to find a set of supported CPU
>>> flags across hosts, QEMU only has visibility to the host where it runs,
>>> an upper level tool needs to be able to pass through information about
>>> compatibility to all possible migration targets.  
>>
>> I agree with your generic sentiment (and idea) but are we sure this is really something as
>> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
>> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
>> pc/q35 is one very good example, because this hasn't changed since it's inception [a
>> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
>> with more than 1Tb). Additionally, there might be architectural impositions like on x86
>> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
>> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
>> keep the reserved range nonetheless in the common denominator)
>>
>>> Towards that end, we
>>> should probably have command line options that either allow to specify
>>> specific usable or reserved GPA address ranges.  For example something
>>> like:
>>> 	--reserved-mem-ranges=host
>>>
>>> Or explicitly:
>>>
>>> 	--reserved-mem-ranges=13G@1010G,1M@4078M
> 
> if we can do without adding any option at all it will be even better
> since user/mgmt won't need to care about it as well.
> 
Yeap.

But should folks want the --reserved-mem-ranges approach perhaps the default ought to be
'host', and let the user customize the memmap should it deem necessary.

>>>   
>> I like the added option, particularly because it lets everyone workaround similar issues.
>> I remember a series before that had similar issues on ARM (but can't remember now what it
>> was).
>>
>>>> * patch 2 - 5: cover the remaining parts of the surrounding the mem map, particularly
>>>> 	       ACPI SRAT ranges, CMOS, hotplug as well as the PCI 64-bit hole.
>>>>
>>>> * patch 6: Add a machine property which is disabled for older machine types (<=6.0)
>>>> 	   to keep things as is.
>>>>
>>>> The 'consequence' of this approach is that we may need more than the default
>>>> phys-bits e.g. a guest with 1024G, will have ~13G be put after the 1TB
>>>> address, consequently needing 41 phys-bits as opposed to the default of 40.
>>>> I can't figure a reasonable way to establish the required phys-bits we
>>>> need for the memory map in a dynamic way, especially considering that
>>>> today there's already a precedent to depend on the user to pick the right value
>>>> of phys-bits (regardless of this series).
>>>>
>>>> Additionally, the reserved region is always added regardless of whether we have
>>>> VFIO devices to cover the VFIO device hotplug case.  
>>>
>>> Various migration issues as you note later in the series.
>>>   
>> /me nods
>>
>>>> Other options considered:
>>>>
>>>> a) Consider the reserved range part of RAM, and just marking it as
>>>> E820_RESERVED without SPA allocated for it. So a -m 1024G guest would
>>>> only allocate 1010G of RAM and the remaining would be marked reserved.
>>>> This is not how what we do today for the 4G hole i.e. the RAM
>>>> actually allocated is the value specified by the user and thus RAM available
>>>> to the guest (IIUC).
> 
> it's partially true, we don't care about small MMIO regions that
> overlay on top of low memory. But concealing RAM behind large PCI
> hole would be a significant waste (especially when we are speaking
> about PCI hole below 4GB)
> 
> I wonder how it works on real hardware?
> i.e. does memory controller remap physical RAM at 1T hole region, just hides it
> or just doesn't place any DIMMs there?
> 
In real hardware you lose 12G of RAM IIUC (and the range is marked as reserved). But
whether the said range is actually backed by SPA or not it might depend on family. Suravee
is CCed can probably keep me honest here :)

>>>> b) Avoid VFIO DMA_MAP ioctl() calls to the reserved range. Similar to a) but done at a
>>>> later stage when RAM mrs are already allocated at the invalid GPAs. Albeit that
>>>> alone wouldn't fix the way MRs are laid out which is where fundamentally the
>>>> problem is.  
>>>
>>> Data corruption with b) should the guest ever use memory within this
>>> range as a DMA target.  Thanks,
>>>   
>> Yeap.
>>
>>> Alex
>>>    
>>>> The proposed approach in this series works regardless of the kernel, and
>>>> relevant for old and new Qemu.
>>>>
>>>> Open to alternatives/comments/suggestions that I should pursue instead.
>>>>
>>>> 	Joao
>>>>
>>>> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
>>>> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
>>>>
>>>> Joao Martins (6):
>>>>   i386/pc: Account IOVA reserved ranges above 4G boundary
>>>>   i386/pc: Round up the hotpluggable memory within valid IOVA ranges
>>>>   pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary
>>>>   i386/pc: Keep PCI 64-bit hole within usable IOVA space
>>>>   i386/acpi: Fix SRAT ranges in accordance to usable IOVA
>>>>   i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs
>>>>
>>>>  hw/i386/acpi-build.c  |  22 ++++-
>>>>  hw/i386/pc.c          | 206 +++++++++++++++++++++++++++++++++++++++---
>>>>  hw/i386/pc_piix.c     |   2 +
>>>>  hw/i386/pc_q35.c      |   2 +
>>>>  hw/pci-host/i440fx.c  |   4 +-
>>>>  hw/pci-host/q35.c     |   4 +-
>>>>  include/hw/i386/pc.h  |  62 ++++++++++++-
>>>>  include/hw/i386/x86.h |   4 +
>>>>  target/i386/cpu.h     |   3 +
>>>>  9 files changed, 288 insertions(+), 21 deletions(-)
>>>>  
>>>   
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space
  2021-06-23 12:30   ` Igor Mammedov
@ 2021-06-23 13:22     ` Joao Martins
  2021-06-28 15:37       ` Igor Mammedov
  2021-06-23 16:33     ` Laszlo Ersek
  1 sibling, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-23 13:22 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini, lersek



On 6/23/21 1:30 PM, Igor Mammedov wrote:
> On Tue, 22 Jun 2021 16:49:03 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> pci_memory initialized by q35 and i440fx is set to a range
>> of 0 .. UINT64_MAX, and as a consequence when ACPI and pci-host
>> pick the hole64_start it does not account for allowed IOVA ranges.
>>
>> Rather than blindly returning, round up the hole64_start
>> value to the allowable IOVA range, such that it accounts for
>> the 1Tb hole *on AMD*. On Intel it returns the input value
>> for hole64 start.
> 
> wouldn't that work only in case where guest firmware hadn't
> mapped any PCI bars above 4Gb (possibly in not allowed region).
> 
> Looking at Seabios, it uses reserved_memory_end as the start
> of PCI64 window. Not sure about OVMF,
>  CCing Laszlo.
> 
Hmmm, perhaps only in the case that I don't advertise the reserved
region (i.e. mem size not being big enough to let the guest know in
the e820). But then that it begs the question, should we then always
advertise the said HT region as reserved regardless of memory size?

>> Suggested-by: David Edmondson <david.edmondson@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c         | 17 +++++++++++++++--
>>  hw/pci-host/i440fx.c |  4 +++-
>>  hw/pci-host/q35.c    |  4 +++-
>>  include/hw/i386/pc.h |  3 ++-
>>  4 files changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 2e2ea82a4661..65885cc16037 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -1141,7 +1141,7 @@ void pc_memory_init(PCMachineState *pcms,
>>   * The 64bit pci hole starts after "above 4G RAM" and
>>   * potentially the space reserved for memory hotplug.
>>   */
>> -uint64_t pc_pci_hole64_start(void)
>> +uint64_t pc_pci_hole64_start(uint64_t size)
>>  {
>>      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> @@ -1155,12 +1155,25 @@ uint64_t pc_pci_hole64_start(void)
>>              hole64_start += memory_region_size(&ms->device_memory->mr);
>>          }
>>      } else {
>> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>> +        if (!x86ms->above_1t_mem_size) {
>> +            hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>> +        } else {
>> +            hole64_start = x86ms->above_1t_maxram_start;
>> +        }
>>      }
> 
>> +    hole64_start = allowed_round_up(hole64_start, size);
> 
> I'm not sure but, might it cause problems if there were BARs placed
> by firmware in region below rounded up value?
> (i.e. ACPI description which uses PCI_HOST_PROP_PCI_HOLE_START property
> won't match whatever firmware programmed due to rounding pushing hole
> start up)
> 
But wouldn't then the problem be firmware ignoring E820 to avoid putting
firmware region where it shouldn't? Unless of course, it wasn't advertised
like I mentioned earlier.

>>      return ROUND_UP(hole64_start, 1 * GiB);
>>  }
>>  
>> +uint64_t pc_pci_hole64_start_aligned(uint64_t start, uint64_t size)
>> +{
>> +    if (nb_iova_ranges == DEFAULT_NR_USABLE_IOVAS) {
>> +        return start;
>> +    }
>> +    return allowed_round_up(start, size);
>> +}
>> +
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus)
>>  {
>>      DeviceState *dev = NULL;
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index 28c9bae89944..e8eaebfe1034 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -163,8 +163,10 @@ static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
>>      pci_bus_get_w64_range(h->bus, &w64);
>>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>>      if (!value && s->pci_hole64_fix) {
>> -        value = pc_pci_hole64_start();
>> +        value = pc_pci_hole64_start(s->pci_hole64_size);
>>      }
>> +    /* This returns @value when not on AMD */
>> +    value = pc_pci_hole64_start_aligned(value, s->pci_hole64_size);
>>      return value;
>>  }
>>  
>> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>> index 2eb729dff585..d556eb965ddb 100644
>> --- a/hw/pci-host/q35.c
>> +++ b/hw/pci-host/q35.c
>> @@ -126,8 +126,10 @@ static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
>>      pci_bus_get_w64_range(h->bus, &w64);
>>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>>      if (!value && s->pci_hole64_fix) {
>> -        value = pc_pci_hole64_start();
>> +        value = pc_pci_hole64_start(s->mch.pci_hole64_size);
>>      }
>> +    /* This returns @value when not on AMD */
>> +    value = pc_pci_hole64_start_aligned(value, s->mch.pci_hole64_size);
>>      return value;
>>  }
>>  
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 73b8e2900c72..b924aef3a218 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -217,7 +217,8 @@ void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>>                      MemoryRegion **ram_memory);
>> -uint64_t pc_pci_hole64_start(void);
>> +uint64_t pc_pci_hole64_start(uint64_t size);
>> +uint64_t pc_pci_hole64_start_aligned(uint64_t value, uint64_t size);
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>>  void pc_basic_device_init(struct PCMachineState *pcms,
>>                            ISABus *isa_bus, qemu_irq *gsi,
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space
  2021-06-23 12:30   ` Igor Mammedov
  2021-06-23 13:22     ` Joao Martins
@ 2021-06-23 16:33     ` Laszlo Ersek
  2021-06-25 17:19       ` Joao Martins
  1 sibling, 1 reply; 38+ messages in thread
From: Laszlo Ersek @ 2021-06-23 16:33 UTC (permalink / raw)
  To: Igor Mammedov, Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Marcel Apfelbaum, Paolo Bonzini,
	Dr. David Alan Gilbert

Adding Marcel and Dave.

Adding Alex (seriously, a vfio- / iommu-related patch set without Alex on CC?)

comments below

On 06/23/21 14:30, Igor Mammedov wrote:
> On Tue, 22 Jun 2021 16:49:03 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> pci_memory initialized by q35 and i440fx is set to a range
>> of 0 .. UINT64_MAX, and as a consequence when ACPI and pci-host
>> pick the hole64_start it does not account for allowed IOVA ranges.
>>
>> Rather than blindly returning, round up the hole64_start
>> value to the allowable IOVA range, such that it accounts for
>> the 1Tb hole *on AMD*. On Intel it returns the input value
>> for hole64 start.
> 
> wouldn't that work only in case where guest firmware hadn't
> mapped any PCI bars above 4Gb (possibly in not allowed region).
> 
> Looking at Seabios, it uses reserved_memory_end as the start
> of PCI64 window. Not sure about OVMF,
>  CCing Laszlo.

(thanks for the CC)

Yes, see "OvmfPkg/PlatformPei/MemDetect.c":

  //
  // The "etc/reserved-memory-end" fw_cfg file, when present, contains an
  // absolute, exclusive end address for the memory hotplug area. This area
  // starts right at the end of the memory above 4GB. The 64-bit PCI host
  // aperture must be placed above it.
  //
  Status = QemuFwCfgFindFile ("etc/reserved-memory-end", &FwCfgItem,
             &FwCfgSize);


> 
>> Suggested-by: David Edmondson <david.edmondson@oracle.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c         | 17 +++++++++++++++--
>>  hw/pci-host/i440fx.c |  4 +++-
>>  hw/pci-host/q35.c    |  4 +++-
>>  include/hw/i386/pc.h |  3 ++-
>>  4 files changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 2e2ea82a4661..65885cc16037 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -1141,7 +1141,7 @@ void pc_memory_init(PCMachineState *pcms,
>>   * The 64bit pci hole starts after "above 4G RAM" and
>>   * potentially the space reserved for memory hotplug.
>>   */
>> -uint64_t pc_pci_hole64_start(void)
>> +uint64_t pc_pci_hole64_start(uint64_t size)
>>  {
>>      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> @@ -1155,12 +1155,25 @@ uint64_t pc_pci_hole64_start(void)
>>              hole64_start += memory_region_size(&ms->device_memory->mr);
>>          }
>>      } else {
>> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>> +        if (!x86ms->above_1t_mem_size) {
>> +            hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>> +        } else {
>> +            hole64_start = x86ms->above_1t_maxram_start;
>> +        }
>>      }
> 
>> +    hole64_start = allowed_round_up(hole64_start, size);
> 
> I'm not sure but, might it cause problems if there were BARs placed
> by firmware in region below rounded up value?
> (i.e. ACPI description which uses PCI_HOST_PROP_PCI_HOLE_START property
> won't match whatever firmware programmed due to rounding pushing hole
> start up)

This code is a hornets' nest. Marcel and MST too surely remember the troubles between it and OVMF from the past years.

The firmware assigns BARs, then distinctly later, the firmware temporarily reenables MMIO decoding for all PCI devices, and fetches the ACPI tables blob via fw_cfg. At that point, the ACPI content is re-generated by QEMU, which (due to all PCI devices having MMIO decoding enabled) involves computing "bounding boxes" over the programmed BARs (32-bit and 64-bit). Said "bounding boxes" are exposed in the _CRS. If the bounding boxes exposed in the _CRS are shrunk after the BAR assignment by the firmware, the guest OS will almost certainly reject at least some pre-existent BAR assignments by the firmware, by virtue of them falling outside of the _CRS windows.

If you want to keep the 64-bit PCI MMIO *aperture* restricted to a particular area, then you need to steer the guest firmware to allocate the BARs from that area in the first place.

In OVMF, we have an experimental fw_cfg knob for controlling the *size* of the aperture. Search the above-referenced OVMF source file for the string "opt/ovmf/X-PciMmio64Mb".

Regarding the *base* of the aperture, its calculation in OVMF is already non-trivial; please refer to the GetFirstNonAddress() function in the same file. I'm not sure what you're trying to do, but you might need another piece of guest enlightenment (fw_cfg?) for steering OVMF to place the base at a higher address (if that's what you want to do). If you can recognize the AMD IOMMU limitation in the guest without explicit paravirt stuff, then migrating part of the logic to OVMF could be easier (not that I look forward to reviewing related firmware patches, TBH).

The topic overlaps with the vCPU physical address width problem, for which QEMU guests cannot trust the otherwise appropriate CPUID leaf. Refer to the following upstream TianoCore BZ for details: <https://bugzilla.tianocore.org/show_bug.cgi?id=2796>.

Hmmmmm. How about this workaround. Might be worth a shot. The 64-bit PCI MMIO aperture in OVMF is "naturally aligned":

  //
  // The 64-bit PCI host aperture should also be "naturally" aligned. The
  // alignment is determined by rounding the size of the aperture down to the
  // next smaller or equal power of two. That is, align the aperture by the
  // largest BAR size that can fit into it.
  //
  Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));

This means that, if you specify a huge enough count of megabytes for the *size*, for example 1TB in total:

  -fw_cfg opt/ovmf/X-PciMmio64Mb,string=$((1024*1024))

then you'll also end up with a high-enough *base* (1 TB). From the blurb, I reckon that should work, without any patches at all for QEMU?

(OVMF currently allows a 16 TB aperture size via the above knob:

  //
  // See if the user specified the number of megabytes for the 64-bit PCI host
  // aperture. Accept an aperture size up to 16TB.
  //
)

Thanks
Laszlo


> 
>>      return ROUND_UP(hole64_start, 1 * GiB);
>>  }
>>  
>> +uint64_t pc_pci_hole64_start_aligned(uint64_t start, uint64_t size)
>> +{
>> +    if (nb_iova_ranges == DEFAULT_NR_USABLE_IOVAS) {
>> +        return start;
>> +    }
>> +    return allowed_round_up(start, size);
>> +}
>> +
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus)
>>  {
>>      DeviceState *dev = NULL;
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index 28c9bae89944..e8eaebfe1034 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -163,8 +163,10 @@ static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
>>      pci_bus_get_w64_range(h->bus, &w64);
>>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>>      if (!value && s->pci_hole64_fix) {
>> -        value = pc_pci_hole64_start();
>> +        value = pc_pci_hole64_start(s->pci_hole64_size);
>>      }
>> +    /* This returns @value when not on AMD */
>> +    value = pc_pci_hole64_start_aligned(value, s->pci_hole64_size);
>>      return value;
>>  }
>>  
>> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>> index 2eb729dff585..d556eb965ddb 100644
>> --- a/hw/pci-host/q35.c
>> +++ b/hw/pci-host/q35.c
>> @@ -126,8 +126,10 @@ static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
>>      pci_bus_get_w64_range(h->bus, &w64);
>>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>>      if (!value && s->pci_hole64_fix) {
>> -        value = pc_pci_hole64_start();
>> +        value = pc_pci_hole64_start(s->mch.pci_hole64_size);
>>      }
>> +    /* This returns @value when not on AMD */
>> +    value = pc_pci_hole64_start_aligned(value, s->mch.pci_hole64_size);
>>      return value;
>>  }
>>  
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 73b8e2900c72..b924aef3a218 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -217,7 +217,8 @@ void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>>                      MemoryRegion **ram_memory);
>> -uint64_t pc_pci_hole64_start(void);
>> +uint64_t pc_pci_hole64_start(uint64_t size);
>> +uint64_t pc_pci_hole64_start_aligned(uint64_t value, uint64_t size);
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>>  void pc_basic_device_init(struct PCMachineState *pcms,
>>                            ISABus *isa_bus, qemu_irq *gsi,
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-23  7:40   ` David Edmondson
@ 2021-06-23 19:13     ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2021-06-23 19:13 UTC (permalink / raw)
  To: David Edmondson
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, Auger Eric, Suravee Suthikulpanit,
	Paolo Bonzini, Igor Mammedov, Joao Martins

On Wed, 23 Jun 2021 08:40:56 +0100
David Edmondson <dme@dme.org> wrote:

> On Tuesday, 2021-06-22 at 15:16:29 -06, Alex Williamson wrote:
> 
> >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> >> 	   requiring to have a PCI device added in the vfio container. That would
> >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> >
> >
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> >
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  Towards that end, we
> > should probably have command line options that either allow to specify
> > specific usable or reserved GPA address ranges.  For example something
> > like:
> > 	--reserved-mem-ranges=host
> >
> > Or explicitly:
> >
> > 	--reserved-mem-ranges=13G@1010G,1M@4078M  
> 
> Would this not naturally be a property of a machine model?

That's an option too, maybe a better one given that we already know how
to manage different machine types.  We probably could not adopt QEMU
v6.1 versions of q35 and 440fx to include this reserved range give the
physical address width downsides Joao mentions.  The above was meant to
be more of an explicit method to match the VM address space to the host
versus the implicit mechanism of the machine type.  A new machine type
at least makes the decision a user policy rather than a magic artifact
of the processor vendor where the VM was first created, but it also
doesn't have the flexibility of the extra command line options.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-23  9:30   ` Joao Martins
  2021-06-23 11:58     ` Igor Mammedov
@ 2021-06-23 19:27     ` Alex Williamson
  2021-06-24  9:22       ` Dr. David Alan Gilbert
  2021-06-25 16:54       ` Joao Martins
  1 sibling, 2 replies; 38+ messages in thread
From: Alex Williamson @ 2021-06-23 19:27 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Suravee Suthikulpanit, Paolo Bonzini, Igor Mammedov

On Wed, 23 Jun 2021 10:30:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/22/21 10:16 PM, Alex Williamson wrote:
> > On Tue, 22 Jun 2021 16:48:59 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Hey,
> >>
> >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> >> when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> >> 	failed to setup container for group 258: memory listener initialization failed:
> >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> >> as following:
> >>
> >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> >>            which gets used too in other parts of pc/acpi besides MR creation. The
> >> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> >> 	   Intel. We walk the allowed ranges for memory above 4G, and
> >> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> >> 	   1TB boundary).
> >> 	   
> >> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> >> 	   understand that it doesn't cover the non-x86 host case running TCG.
> >>
> >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> >> 	   requiring to have a PCI device added in the vfio container. That would
> >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> > 
> > 
> > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > 0x00000000fee00000 0x00000000feefffff msi
> > 0x000000fd00000000 0x000000ffffffffff reserved
> >   
> Yeap, I am aware.
> 
> The VFIO advertising extension was just because we already advertise the above info,
> although behind a non-empty vfio container e.g. we seem to use that for example in
> collect_usable_iova_ranges().

VFIO can't guess what groups you'll use to mark reserved ranges in an
empty container.  Each group might have unique ranges.  A container
enforcing ranges unrelated to the groups/devices in use doesn't make
sense.
 
> > Ideally we might take that into account on all hosts, but of course
> > then we run into massive compatibility issues when we consider
> > migration.  We run into similar problems when people try to assign
> > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > memory hole overlapping the msi range.
> > 
> > The issue here is similar to trying to find a set of supported CPU
> > flags across hosts, QEMU only has visibility to the host where it runs,
> > an upper level tool needs to be able to pass through information about
> > compatibility to all possible migration targets.  
> 
> I agree with your generic sentiment (and idea) but are we sure this is really something as
> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> pc/q35 is one very good example, because this hasn't changed since it's inception [a
> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> with more than 1Tb). Additionally, there might be architectural impositions like on x86
> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> keep the reserved range nonetheless in the common denominator)

I like the flexibility that being able to specify reserved ranges would
provide, but I agree that the machine memory map is usually deeply
embedded into the arch code and would probably be difficult to
generalize.  Cross vendor migration should be a consideration and only
an inter-system management policy could specify the importance of that.

Perhaps as David mentioned, this is really a machine type issue, where
the address width downsides you've noted might be sufficient reason
to introduce a new machine type that includes this memory hole.  That
would likely be the more traditional solution to this issue.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-23 19:27     ` Alex Williamson
@ 2021-06-24  9:22       ` Dr. David Alan Gilbert
  2021-06-25 16:54       ` Joao Martins
  1 sibling, 0 replies; 38+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-24  9:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Suravee Suthikulpanit, Igor Mammedov, Paolo Bonzini,
	Joao Martins

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> > On 6/22/21 10:16 PM, Alex Williamson wrote:
> > > On Tue, 22 Jun 2021 16:48:59 +0100
> > > Joao Martins <joao.m.martins@oracle.com> wrote:
> > >   
> > >> Hey,
> > >>
> > >> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
> > >> when running on AMD systems with an IOMMU.
> > >>
> > >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> > >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> > >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> > >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> > >> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> > >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> > >>
> > >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> > >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> > >> of the failure:
> > >>
> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> > >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> > >> 	failed to setup container for group 258: memory listener initialization failed:
> > >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> > >>
> > >> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
> > >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> > >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> > >> as documented on the links down below.
> > >>
> > >> This series tries to address that by dealing with this AMD-specific 1Tb hole,
> > >> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
> > >> as following:
> > >>
> > >> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
> > >>            which gets used too in other parts of pc/acpi besides MR creation. The
> > >> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
> > >> 	   Intel. We walk the allowed ranges for memory above 4G, and
> > >> 	   add a E820_RESERVED type everytime we find a hole (which is at the
> > >> 	   1TB boundary).
> > >> 	   
> > >> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
> > >> 	   understand that it doesn't cover the non-x86 host case running TCG.
> > >>
> > >> 	   Additionally, an alternative to hardcoded ranges as we do today,
> > >> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
> > >> 	   requiring to have a PCI device added in the vfio container. That would
> > >> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
> > >> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
> > > 
> > > 
> > > $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
> > > 0x00000000fee00000 0x00000000feefffff msi
> > > 0x000000fd00000000 0x000000ffffffffff reserved
> > >   
> > Yeap, I am aware.
> > 
> > The VFIO advertising extension was just because we already advertise the above info,
> > although behind a non-empty vfio container e.g. we seem to use that for example in
> > collect_usable_iova_ranges().
> 
> VFIO can't guess what groups you'll use to mark reserved ranges in an
> empty container.  Each group might have unique ranges.  A container
> enforcing ranges unrelated to the groups/devices in use doesn't make
> sense.
>  
> > > Ideally we might take that into account on all hosts, but of course
> > > then we run into massive compatibility issues when we consider
> > > migration.  We run into similar problems when people try to assign
> > > devices to non-x86 TCG hosts, where the arch doesn't have a natural
> > > memory hole overlapping the msi range.
> > > 
> > > The issue here is similar to trying to find a set of supported CPU
> > > flags across hosts, QEMU only has visibility to the host where it runs,
> > > an upper level tool needs to be able to pass through information about
> > > compatibility to all possible migration targets.  
> > 
> > I agree with your generic sentiment (and idea) but are we sure this is really something as
> > dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
> > in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
> > pc/q35 is one very good example, because this hasn't changed since it's inception [a
> > decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
> > with more than 1Tb). Additionally, there might be architectural impositions like on x86
> > e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
> > targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
> > keep the reserved range nonetheless in the common denominator)
> 
> I like the flexibility that being able to specify reserved ranges would
> provide, but I agree that the machine memory map is usually deeply
> embedded into the arch code and would probably be difficult to
> generalize.  Cross vendor migration should be a consideration and only
> an inter-system management policy could specify the importance of that.

On x86 at least, the cross vendor part doesn't seem to be an issue; I
wouldn't expect an Intel->AMD migration to work reliably anyway.

> Perhaps as David mentioned, this is really a machine type issue, where
> the address width downsides you've noted might be sufficient reason
> to introduce a new machine type that includes this memory hole.  That
> would likely be the more traditional solution to this issue.  Thanks,

To me this seems a combination of machine type+CPU model; perhaps what
we're looking at here is having a list of holes, which can be
contributed to by any of:
  a) The machine type
  b) The CPU model
  c) and extra command line option like you list

Dave

> Alex
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23  9:03   ` Igor Mammedov
  2021-06-23  9:51     ` Joao Martins
@ 2021-06-24  9:32     ` Dr. David Alan Gilbert
  2021-06-28 14:42       ` Igor Mammedov
  1 sibling, 1 reply; 38+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-24  9:32 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini, Joao Martins

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Tue, 22 Jun 2021 16:49:00 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> > It is assumed that the whole GPA space is available to be
> > DMA addressable, within a given address space limit. Since
> > v5.4 based that is not true, and VFIO will validate whether
> > the selected IOVA is indeed valid i.e. not reserved by IOMMU
> > on behalf of some specific devices or platform-defined.
> > 
> > AMD systems with an IOMMU are examples of such platforms and
> > particularly may export only these ranges as allowed:
> > 
> > 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> > 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> > 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> > 
> > We already know of accounting for the 4G hole, albeit if the
> > guest is big enough we will fail to allocate a >1010G given
> > the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> > 
> > When creating the region above 4G, take into account what
> > IOVAs are allowed by defining the known allowed ranges
> > and search for the next free IOVA ranges. When finding a
> > invalid IOVA we mark them as reserved and proceed to the
> > next allowed IOVA region.
> > 
> > After accounting for the 1Tb hole on AMD hosts, mtree should
> > look like:
> > 
> > 0000000100000000-000000fcffffffff (prio 0, i/o):
> > 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> > 0000010000000000-000001037fffffff (prio 0, i/o):
> > 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff
> 
> You are talking here about GPA which is guest specific thing
> and then somehow it becomes tied to host. For bystanders it's
> not clear from above commit message how both are related.
> I'd add here an explicit explanation how AMD host is related GPAs
> and clarify where you are talking about guest/host side.
> 
> also what about usecases:
>  * start QEMU with Intel cpu model on AMD host with intel's iommu

Does that work?

>  * start QEMU with AMD cpu model and AMD's iommu on Intel host

Isn't the case we commonly do:
   * start QEMU with AMD cpu model on AMD host with Intel iommu?

Dave

>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> 
> > Co-developed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > ---
> >  hw/i386/pc.c         | 103 +++++++++++++++++++++++++++++++++++++++----
> >  include/hw/i386/pc.h |  57 ++++++++++++++++++++++++
> >  target/i386/cpu.h    |   3 ++
> >  3 files changed, 154 insertions(+), 9 deletions(-)
> > 
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index c6d8d0d84d91..52a5473ba846 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -91,6 +91,7 @@
> >  #include "qapi/qmp/qerror.h"
> >  #include "e820_memory_layout.h"
> >  #include "fw_cfg.h"
> > +#include "target/i386/cpu.h"
> >  #include "trace.h"
> >  #include CONFIG_DEVICES
> >  
> > @@ -860,6 +861,93 @@ void xen_load_linux(PCMachineState *pcms)
> >      x86ms->fw_cfg = fw_cfg;
> >  }
> >  
> > +struct GPARange usable_iova_ranges[] = {
> > +    { .start = 4 * GiB, .end = UINT64_MAX, .name = "ram-above-4g" },
> > +
> > +/*
> > + * AMD systems with an IOMMU have an additional hole close to the
> > + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> > + * on kernel version, VFIO may or may not let you DMA map those ranges.
> > + * Starting v5.4 we validate it, and can't create guests on AMD machines
> > + * with certain memory sizes. The range is:
> > + *
> > + * FD_0000_0000h - FF_FFFF_FFFFh
> > + *
> > + * The ranges represent the following:
> > + *
> > + * Base Address   Top Address  Use
> > + *
> > + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> > + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> > + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> > + * FD_F910_0000h FD_F91F_FFFFh System Management
> > + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> > + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> > + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> > + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> > + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> > + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> > + *
> > + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> > + * Table 3: Special Address Controls (GPA) for more information.
> > + */
> > +#define DEFAULT_NR_USABLE_IOVAS 1
> > +#define AMD_MAX_PHYSADDR_BELOW_1TB  0xfcffffffff
> > +    { .start = 1 * TiB, .end = UINT64_MAX, .name = "ram-above-1t" },
> > +};
> > +
> > + uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
> > +
> > +static void init_usable_iova_ranges(void)
> > +{
> > +    uint32_t eax, vendor[3];
> > +
> > +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > +    if (IS_AMD_VENDOR(vendor)) {
> > +        usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
> > +        nb_iova_ranges++;
> > +    }
> > +}
> > +
> > +static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
> > +                                hwaddr base, hwaddr size, hwaddr offset)
> > +{
> > +    hwaddr start, region_size, resv_start, resv_end;
> > +    struct GPARange *range;
> > +    MemoryRegion *region;
> > +    uint32_t index;
> > +
> > +    for_each_usable_range(index, base, size, range, start, region_size) {
> > +        region = g_malloc(sizeof(*region));
> > +        memory_region_init_alias(region, NULL, range->name, ram,
> > +                                 offset, region_size);
> > +        memory_region_add_subregion(system_memory, start, region);
> > +        e820_add_entry(start, region_size, E820_RAM);
> > +
> > +        assert(size >= region_size);
> > +        if (size == region_size) {
> > +            return;
> > +        }
> > +
> > +        /*
> > +         * There's memory left to create a region for, so there should be
> > +         * another valid IOVA range left.  Creating the reserved region
> > +         * would also be pointless.
> > +         */
> > +        if (index + 1 == nb_iova_ranges) {
> > +            return;
> > +        }
> > +
> > +        resv_start = start + region_size;
> > +        resv_end = usable_iova_ranges[index + 1].start;
> > +
> > +        /* Create a reserved region in the IOVA hole. */
> > +        e820_add_entry(resv_start, resv_end - resv_start, E820_RESERVED);
> > +
> > +        offset += region_size;
> > +    }
> > +}
> > +
> >  void pc_memory_init(PCMachineState *pcms,
> >                      MemoryRegion *system_memory,
> >                      MemoryRegion *rom_memory,
> > @@ -867,7 +955,7 @@ void pc_memory_init(PCMachineState *pcms,
> >  {
> >      int linux_boot, i;
> >      MemoryRegion *option_rom_mr;
> > -    MemoryRegion *ram_below_4g, *ram_above_4g;
> > +    MemoryRegion *ram_below_4g;
> >      FWCfgState *fw_cfg;
> >      MachineState *machine = MACHINE(pcms);
> >      MachineClass *mc = MACHINE_GET_CLASS(machine);
> > @@ -877,6 +965,8 @@ void pc_memory_init(PCMachineState *pcms,
> >      assert(machine->ram_size == x86ms->below_4g_mem_size +
> >                                  x86ms->above_4g_mem_size);
> >  
> > +    init_usable_iova_ranges();
> > +
> >      linux_boot = (machine->kernel_filename != NULL);
> >  
> >      /*
> > @@ -888,16 +978,11 @@ void pc_memory_init(PCMachineState *pcms,
> >      memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
> >                               0, x86ms->below_4g_mem_size);
> >      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> > +
> >      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> >      if (x86ms->above_4g_mem_size > 0) {
> > -        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> > -        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> > -                                 machine->ram,
> > -                                 x86ms->below_4g_mem_size,
> > -                                 x86ms->above_4g_mem_size);
> > -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> > -                                    ram_above_4g);
> > -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> > +        add_memory_region(system_memory, machine->ram, 4 * GiB,
> > +                          x86ms->above_4g_mem_size, x86ms->below_4g_mem_size);
> >      }
> >  
> >      if (!pcmc->has_reserved_memory &&
> > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > index 1522a3359a93..73b8e2900c72 100644
> > --- a/include/hw/i386/pc.h
> > +++ b/include/hw/i386/pc.h
> > @@ -151,6 +151,63 @@ void pc_guest_info_init(PCMachineState *pcms);
> >  #define PCI_HOST_BELOW_4G_MEM_SIZE     "below-4g-mem-size"
> >  #define PCI_HOST_ABOVE_4G_MEM_SIZE     "above-4g-mem-size"
> >  
> > +struct GPARange {
> > +    uint64_t start;
> > +    uint64_t end;
> > +#define range_len(r) ((r).end - (r).start + 1)
> > +
> > +    const char *name;
> > +};
> > +
> > +extern uint32_t nb_iova_ranges;
> > +extern struct GPARange usable_iova_ranges[];
> > +
> > +static inline void next_iova_range_index(uint32_t i, hwaddr base, hwaddr size,
> > +                                         hwaddr *start, hwaddr *region_size,
> > +                                         uint32_t *index, struct GPARange **r)
> > +{
> > +    struct GPARange *range;
> > +
> > +    while (i < nb_iova_ranges) {
> > +        range = &usable_iova_ranges[i];
> > +        if (range->end >= base) {
> > +            break;
> > +        }
> > +        i++;
> > +    }
> > +
> > +    *index = i;
> > +    /* index is out of bounds or no region left to process */
> > +    if (i >= nb_iova_ranges || !size) {
> > +        return;
> > +    }
> > +
> > +    *start = MAX(range->start, base);
> > +    *region_size = MIN(range->end - *start + 1, size);
> > +    *r = range;
> > +}
> > +
> > +/*
> > + * for_each_usable_range() - iterates over usable IOVA regions
> > + *
> > + * @i:      range index within usable_iova_ranges
> > + * @base:   requested address we want to use
> > + * @size:   total size of the region with @base
> > + * @r:      range indexed by @i within usable_iova_ranges
> > + * @s:      calculated usable iova address base
> > + * @i_size: calculated usable iova region size starting @s
> > + *
> > + * Use this iterator to walk through usable GPA ranges. Platforms
> > + * such as AMD with IOMMU capable hardware restrict certain address
> > + * ranges for Hyper Transport. This iterator helper lets user avoid
> > + * using those said reserved ranges.
> > + */
> > +#define for_each_usable_range(i, base, size, r, s, i_size) \
> > +    for (s = 0, i_size = 0, r = NULL, \
> > +         next_iova_range_index(0, base, size, &s, &i_size, &i, &r); \
> > +         i < nb_iova_ranges && size != 0; \
> > +         size -= i_size, \
> > +         next_iova_range_index(i + 1, base, size, &s, &i_size, &i, &r))
> >  
> >  void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
> >                              MemoryRegion *pci_address_space);
> > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > index 1e11071d817b..f8f15a4523c6 100644
> > --- a/target/i386/cpu.h
> > +++ b/target/i386/cpu.h
> > @@ -869,6 +869,9 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
> >  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> >                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> >                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> > +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> > +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> > +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> >  
> >  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
> >  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU
  2021-06-23 19:27     ` Alex Williamson
  2021-06-24  9:22       ` Dr. David Alan Gilbert
@ 2021-06-25 16:54       ` Joao Martins
  1 sibling, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-25 16:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Auger Eric,
	Suravee Suthikulpanit, Paolo Bonzini, Igor Mammedov



On 6/23/21 8:27 PM, Alex Williamson wrote:
> On Wed, 23 Jun 2021 10:30:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/22/21 10:16 PM, Alex Williamson wrote:
>>> On Tue, 22 Jun 2021 16:48:59 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> Hey,
>>>>
>>>> This series lets Qemu properly spawn i386 guests with >= 1Tb with VFIO, particularly
>>>> when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here  FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map using these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> similarly to how we deal with the 4G hole today in x86 in general. It is splitted
>>>> as following:
>>>>
>>>> * patch 1: initialize the valid IOVA ranges above 4G, adding an iterator
>>>>            which gets used too in other parts of pc/acpi besides MR creation. The
>>>> 	   allowed IOVA *only* changes if it's an AMD host, so no change for
>>>> 	   Intel. We walk the allowed ranges for memory above 4G, and
>>>> 	   add a E820_RESERVED type everytime we find a hole (which is at the
>>>> 	   1TB boundary).
>>>> 	   
>>>> 	   NOTE: For purposes of this RFC, I rely on cpuid in hw/i386/pc.c but I
>>>> 	   understand that it doesn't cover the non-x86 host case running TCG.
>>>>
>>>> 	   Additionally, an alternative to hardcoded ranges as we do today,
>>>> 	   VFIO could advertise the platform valid IOVA ranges without necessarily
>>>> 	   requiring to have a PCI device added in the vfio container. That would
>>>> 	   fetching the valid IOVA ranges from VFIO, rather than hardcoded IOVA
>>>> 	   ranges as we do today. But sadly, wouldn't work for older hypervisors.  
>>>
>>>
>>> $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u
>>> 0x00000000fee00000 0x00000000feefffff msi
>>> 0x000000fd00000000 0x000000ffffffffff reserved
>>>   
>> Yeap, I am aware.
>>
>> The VFIO advertising extension was just because we already advertise the above info,
>> although behind a non-empty vfio container e.g. we seem to use that for example in
>> collect_usable_iova_ranges().
> 
> VFIO can't guess what groups you'll use to mark reserved ranges in an
> empty container.  Each group might have unique ranges.  A container
> enforcing ranges unrelated to the groups/devices in use doesn't make
> sense.
>  
Hmm OK, I see.

The suggestion/point was because AMD IOMMU seems to mark these reserved for
both MSI range and the HyperTransport regardless of device/group
specifics. See amd_iommu_get_resv_regions().

So I thought that something else for advertising platform-wide reserved ranges would be
appropriate rather than replicating the same said info on the various groups
reserved_regions sysfs entry.

>>> Ideally we might take that into account on all hosts, but of course
>>> then we run into massive compatibility issues when we consider
>>> migration.  We run into similar problems when people try to assign
>>> devices to non-x86 TCG hosts, where the arch doesn't have a natural
>>> memory hole overlapping the msi range.
>>>
>>> The issue here is similar to trying to find a set of supported CPU
>>> flags across hosts, QEMU only has visibility to the host where it runs,
>>> an upper level tool needs to be able to pass through information about
>>> compatibility to all possible migration targets.  
>>
>> I agree with your generic sentiment (and idea) but are we sure this is really something as
>> dynamic/needing-denominator like CPU Features? The memory map looks to be deeply embedded
>> in the devices (ARM) or machine model (x86) that we pass in and doesn't change very often.
>> pc/q35 is one very good example, because this hasn't changed since it's inception [a
>> decade?] (and this limitation is there only for any multi-socket AMD machine with IOMMU
>> with more than 1Tb). Additionally, there might be architectural impositions like on x86
>> e.g. CMOS seems to tie in with memory above certain boundaries. Unless by a migration
>> targets, you mean to also cover you migrate between Intel and AMD hosts (which may need to
>> keep the reserved range nonetheless in the common denominator)
> 
> I like the flexibility that being able to specify reserved ranges would
> provide, 

/me nods

> but I agree that the machine memory map is usually deeply
> embedded into the arch code and would probably be difficult to
> generalize.  Cross vendor migration should be a consideration and only
> an inter-system management policy could specify the importance of that.
> 
> Perhaps as David mentioned, this is really a machine type issue, where
> the address width downsides you've noted might be sufficient reason
> to introduce a new machine type that includes this memory hole.  That
> would likely be the more traditional solution to this issue.

Maybe there could be a generic facility that stores/manages the reserved ranges, and then
the different machines can provide a default set depending on the running target
heuristics (AMD x86, generic x86, ARM, etc). To some extent it means tracking reserved,
rather tracking usable IOVA as I do here (and move that some where else that's not x86
specific).


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space
  2021-06-23 16:33     ` Laszlo Ersek
@ 2021-06-25 17:19       ` Joao Martins
  0 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-06-25 17:19 UTC (permalink / raw)
  To: Laszlo Ersek, Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Marcel Apfelbaum, Paolo Bonzini,
	Dr. David Alan Gilbert



On 6/23/21 5:33 PM, Laszlo Ersek wrote:
> Adding Marcel and Dave.
> 
> Adding Alex (seriously, a vfio- / iommu-related patch set without Alex on CC?)
> 
> comments below
> 
> On 06/23/21 14:30, Igor Mammedov wrote:
>> On Tue, 22 Jun 2021 16:49:03 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> pci_memory initialized by q35 and i440fx is set to a range
>>> of 0 .. UINT64_MAX, and as a consequence when ACPI and pci-host
>>> pick the hole64_start it does not account for allowed IOVA ranges.
>>>
>>> Rather than blindly returning, round up the hole64_start
>>> value to the allowable IOVA range, such that it accounts for
>>> the 1Tb hole *on AMD*. On Intel it returns the input value
>>> for hole64 start.
>>
>> wouldn't that work only in case where guest firmware hadn't
>> mapped any PCI bars above 4Gb (possibly in not allowed region).
>>
>> Looking at Seabios, it uses reserved_memory_end as the start
>> of PCI64 window. Not sure about OVMF,
>>  CCing Laszlo.
> 
> (thanks for the CC)
> 
> Yes, see "OvmfPkg/PlatformPei/MemDetect.c":
> 
>   //
>   // The "etc/reserved-memory-end" fw_cfg file, when present, contains an
>   // absolute, exclusive end address for the memory hotplug area. This area
>   // starts right at the end of the memory above 4GB. The 64-bit PCI host
>   // aperture must be placed above it.
>   //
>   Status = QemuFwCfgFindFile ("etc/reserved-memory-end", &FwCfgItem,
>              &FwCfgSize);
> 
> 
'etc/reserved-memory-end' is *I think* advertised correctly after this series (taking into
consideration the HT range).

Except when memory is sufficiently small that it doesn't need to deal with the HT hole.
Which could well lead firmware to not know the said hole that it should avoid (with big
enough aperture). Maybe I should advertise this nonetheless considering this sits below
the default phys-bits/maxphysaddr of 1Tb.

>>
>>> Suggested-by: David Edmondson <david.edmondson@oracle.com>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  hw/i386/pc.c         | 17 +++++++++++++++--
>>>  hw/pci-host/i440fx.c |  4 +++-
>>>  hw/pci-host/q35.c    |  4 +++-
>>>  include/hw/i386/pc.h |  3 ++-
>>>  4 files changed, 23 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>> index 2e2ea82a4661..65885cc16037 100644
>>> --- a/hw/i386/pc.c
>>> +++ b/hw/i386/pc.c
>>> @@ -1141,7 +1141,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>   * The 64bit pci hole starts after "above 4G RAM" and
>>>   * potentially the space reserved for memory hotplug.
>>>   */
>>> -uint64_t pc_pci_hole64_start(void)
>>> +uint64_t pc_pci_hole64_start(uint64_t size)
>>>  {
>>>      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>> @@ -1155,12 +1155,25 @@ uint64_t pc_pci_hole64_start(void)
>>>              hole64_start += memory_region_size(&ms->device_memory->mr);
>>>          }
>>>      } else {
>>> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>>> +        if (!x86ms->above_1t_mem_size) {
>>> +            hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>>> +        } else {
>>> +            hole64_start = x86ms->above_1t_maxram_start;
>>> +        }
>>>      }
>>
>>> +    hole64_start = allowed_round_up(hole64_start, size);
>>
>> I'm not sure but, might it cause problems if there were BARs placed
>> by firmware in region below rounded up value?
>> (i.e. ACPI description which uses PCI_HOST_PROP_PCI_HOLE_START property
>> won't match whatever firmware programmed due to rounding pushing hole
>> start up)
> 
> This code is a hornets' nest. Marcel and MST too surely remember the troubles between it and OVMF from the past years.
> 
Thanks for the explanation below, appreciate it.

> The firmware assigns BARs, then distinctly later, the firmware temporarily reenables MMIO decoding for all PCI devices, and fetches the ACPI tables blob via fw_cfg. At that point, the ACPI content is re-generated by QEMU, which (due to all PCI devices having MMIO decoding enabled) involves computing "bounding boxes" over the programmed BARs (32-bit and 64-bit). Said "bounding boxes" are exposed in the _CRS. If the bounding boxes exposed in the _CRS are shrunk after the BAR assignment by the firmware, the guest OS will almost certainly reject at least some pre-existent BAR assignments by the firmware, by virtue of them falling outside of the _CRS windows.
> 
I think I am doing that here? That is limiting the said bounding boxes.

At least with this OVMF doesn't seem to be doing the wrong thing. when looking at the
configured bars.

> If you want to keep the 64-bit PCI MMIO *aperture* restricted to a particular area, then you need to steer the guest firmware to allocate the BARs from that area in the first place.
> 
Hmmm, why isn't E820_RESERVED entries enough? That is exposed here.

You already look at E820 RAM entries, but so far we assume that the reserved always comes
at the end, or via that fw_cfg entry.

> In OVMF, we have an experimental fw_cfg knob for controlling the *size* of the aperture. Search the above-referenced OVMF source file for the string "opt/ovmf/X-PciMmio64Mb".
> 
> Regarding the *base* of the aperture, its calculation in OVMF is already non-trivial; please refer to the GetFirstNonAddress() function in the same file. I'm not sure what you're trying to do, but you might need another piece of guest enlightenment (fw_cfg?) for steering OVMF to place the base at a higher address (if that's what you want to do). If you can recognize the AMD IOMMU limitation in the guest without explicit paravirt stuff, then migrating part of the logic to OVMF could be easier (not that I look forward to reviewing related firmware patches, TBH).
> 
> The topic overlaps with the vCPU physical address width problem, for which QEMU guests cannot trust the otherwise appropriate CPUID leaf. Refer to the following upstream TianoCore BZ for details: <https://bugzilla.tianocore.org/show_bug.cgi?id=2796>.
> 
> Hmmmmm. How about this workaround. Might be worth a shot. The 64-bit PCI MMIO aperture in OVMF is "naturally aligned":
> 
>   //
>   // The 64-bit PCI host aperture should also be "naturally" aligned. The
>   // alignment is determined by rounding the size of the aperture down to the
>   // next smaller or equal power of two. That is, align the aperture by the
>   // largest BAR size that can fit into it.
>   //
>   Pci64Base = ALIGN_VALUE (Pci64Base, GetPowerOfTwo64 (Pci64Size));
> 
> This means that, if you specify a huge enough count of megabytes for the *size*, for example 1TB in total:
> 
>   -fw_cfg opt/ovmf/X-PciMmio64Mb,string=$((1024*1024))
> 
> then you'll also end up with a high-enough *base* (1 TB). From the blurb, I reckon that should work, without any patches at all for QEMU?
> 
Yeah, albeit like you said it feels like a workaround. Firmware should either parse
'properly' what's reserved, or machine model should provide a representative memory map
that lets guest/firmware know that it should avoid that said range (either
implicit/explictly).

> (OVMF currently allows a 16 TB aperture size via the above knob:
> 
>   //
>   // See if the user specified the number of megabytes for the 64-bit PCI host
>   // aperture. Accept an aperture size up to 16TB.
>   //
> )
> 
> Thanks
> Laszlo
> 
> 
Thanks!
>>
>>>      return ROUND_UP(hole64_start, 1 * GiB);
>>>  }
>>>  
>>> +uint64_t pc_pci_hole64_start_aligned(uint64_t start, uint64_t size)
>>> +{
>>> +    if (nb_iova_ranges == DEFAULT_NR_USABLE_IOVAS) {
>>> +        return start;
>>> +    }
>>> +    return allowed_round_up(start, size);
>>> +}
>>> +
>>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus)
>>>  {
>>>      DeviceState *dev = NULL;
>>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>>> index 28c9bae89944..e8eaebfe1034 100644
>>> --- a/hw/pci-host/i440fx.c
>>> +++ b/hw/pci-host/i440fx.c
>>> @@ -163,8 +163,10 @@ static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
>>>      pci_bus_get_w64_range(h->bus, &w64);
>>>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>>>      if (!value && s->pci_hole64_fix) {
>>> -        value = pc_pci_hole64_start();
>>> +        value = pc_pci_hole64_start(s->pci_hole64_size);
>>>      }
>>> +    /* This returns @value when not on AMD */
>>> +    value = pc_pci_hole64_start_aligned(value, s->pci_hole64_size);
>>>      return value;
>>>  }
>>>  
>>> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
>>> index 2eb729dff585..d556eb965ddb 100644
>>> --- a/hw/pci-host/q35.c
>>> +++ b/hw/pci-host/q35.c
>>> @@ -126,8 +126,10 @@ static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
>>>      pci_bus_get_w64_range(h->bus, &w64);
>>>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
>>>      if (!value && s->pci_hole64_fix) {
>>> -        value = pc_pci_hole64_start();
>>> +        value = pc_pci_hole64_start(s->mch.pci_hole64_size);
>>>      }
>>> +    /* This returns @value when not on AMD */
>>> +    value = pc_pci_hole64_start_aligned(value, s->mch.pci_hole64_size);
>>>      return value;
>>>  }
>>>  
>>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>>> index 73b8e2900c72..b924aef3a218 100644
>>> --- a/include/hw/i386/pc.h
>>> +++ b/include/hw/i386/pc.h
>>> @@ -217,7 +217,8 @@ void pc_memory_init(PCMachineState *pcms,
>>>                      MemoryRegion *system_memory,
>>>                      MemoryRegion *rom_memory,
>>>                      MemoryRegion **ram_memory);
>>> -uint64_t pc_pci_hole64_start(void);
>>> +uint64_t pc_pci_hole64_start(uint64_t size);
>>> +uint64_t pc_pci_hole64_start_aligned(uint64_t value, uint64_t size);
>>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>>>  void pc_basic_device_init(struct PCMachineState *pcms,
>>>                            ISABus *isa_bus, qemu_irq *gsi,
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23 13:07         ` Joao Martins
@ 2021-06-28 13:25           ` Igor Mammedov
  2021-06-28 13:43             ` Joao Martins
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-28 13:25 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Wed, 23 Jun 2021 14:07:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 1:09 PM, Igor Mammedov wrote:
> > On Wed, 23 Jun 2021 10:51:59 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 6/23/21 10:03 AM, Igor Mammedov wrote:  
> >>> On Tue, 22 Jun 2021 16:49:00 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> It is assumed that the whole GPA space is available to be
> >>>> DMA addressable, within a given address space limit. Since
> >>>> v5.4 based that is not true, and VFIO will validate whether
> >>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >>>> on behalf of some specific devices or platform-defined.
> >>>>
> >>>> AMD systems with an IOMMU are examples of such platforms and
> >>>> particularly may export only these ranges as allowed:
> >>>>
> >>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>>>
> >>>> We already know of accounting for the 4G hole, albeit if the
> >>>> guest is big enough we will fail to allocate a >1010G given
> >>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>>>
> >>>> When creating the region above 4G, take into account what
> >>>> IOVAs are allowed by defining the known allowed ranges
> >>>> and search for the next free IOVA ranges. When finding a
> >>>> invalid IOVA we mark them as reserved and proceed to the
> >>>> next allowed IOVA region.
> >>>>
> >>>> After accounting for the 1Tb hole on AMD hosts, mtree should
> >>>> look like:
> >>>>
> >>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >>>> 0000010000000000-000001037fffffff (prio 0, i/o):
> >>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
> >>>
> >>> You are talking here about GPA which is guest specific thing
> >>> and then somehow it becomes tied to host. For bystanders it's
> >>> not clear from above commit message how both are related.
> >>> I'd add here an explicit explanation how AMD host is related GPAs
> >>> and clarify where you are talking about guest/host side.
> >>>     
> >> OK, makes sense.
> >>
> >> Perhaps using IOVA makes it easier to understand. I said GPA because
> >> there's an 1:1 mapping between GPA and IOVA (if you're not using vIOMMU).  
> > 
> > IOVA may be a too broad term, maybe explain it in terms of GPA and HPA
> > and why it does matter on each side (host/guest)
> >   
> 
> I used the term IOVA specially because that is applicable to Host IOVA or
> Guest IOVA (same rules apply as this is not special cased for VMs). So,
> regardless of whether we have guest mode page tables, or just host
> iommu page tables, this address range should be reserved and not used.

IOVA doesn't make it any clearer, on contrary it's more confusing.

And does host's HPA matter at all? (if host's firmware isn't broken,
it should never use nor advertise 1Tb hole). So we probably talking
here only about GPA only.
   
> >>> also what about usecases:
> >>>  * start QEMU with Intel cpu model on AMD host with intel's iommu    
> >>
> >> In principle it would be less likely to occur. But you would still need
> >> to mark the same range as reserved. The limitation is on DMA occuring
> >> on those IOVAs (host or guest) coinciding with that range, so you would
> >> want to inform the guest that at least those should be avoided.
> >>  
> >>>  * start QEMU with AMD cpu model and AMD's iommu on Intel host    
> >>
> >> Here you would probably only mark the range, solely for honoring how hardware
> >> is usually represented. But really, on Intel, nothing stops you from exposing the
> >> aforementioned range as RAM.
> >>  
> >>>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> >>>     
> >> This one is tricky. Because you can hotplug a VFIO device later on,
> >> I opted for always marking the reserved range. If you don't use VFIO you're good, but
> >> otherwise you would still need reserved. But I am not sure how qtest is used
> >> today for testing huge guests.  
> > I do not know if there are VFIO tests in qtest (probably nope, since that
> > could require a host configured for that), but we can add a test
> > for his memory quirk (assuming phys-bits won't get in the way)
> >   
> 
> 	Joao
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-28 13:25           ` Igor Mammedov
@ 2021-06-28 13:43             ` Joao Martins
  2021-06-28 15:21               ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Joao Martins @ 2021-06-28 13:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini



On 6/28/21 2:25 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 14:07:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/23/21 1:09 PM, Igor Mammedov wrote:
>>> On Wed, 23 Jun 2021 10:51:59 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> On 6/23/21 10:03 AM, Igor Mammedov wrote:  
>>>>> On Tue, 22 Jun 2021 16:49:00 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>     
>>>>>> It is assumed that the whole GPA space is available to be
>>>>>> DMA addressable, within a given address space limit. Since
>>>>>> v5.4 based that is not true, and VFIO will validate whether
>>>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>>>>>> on behalf of some specific devices or platform-defined.
>>>>>>
>>>>>> AMD systems with an IOMMU are examples of such platforms and
>>>>>> particularly may export only these ranges as allowed:
>>>>>>
>>>>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>>>>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>>>>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
>>>>>>
>>>>>> We already know of accounting for the 4G hole, albeit if the
>>>>>> guest is big enough we will fail to allocate a >1010G given
>>>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>>>>>
>>>>>> When creating the region above 4G, take into account what
>>>>>> IOVAs are allowed by defining the known allowed ranges
>>>>>> and search for the next free IOVA ranges. When finding a
>>>>>> invalid IOVA we mark them as reserved and proceed to the
>>>>>> next allowed IOVA region.
>>>>>>
>>>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
>>>>>> look like:
>>>>>>
>>>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>>>>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>>>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
>>>>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
>>>>>
>>>>> You are talking here about GPA which is guest specific thing
>>>>> and then somehow it becomes tied to host. For bystanders it's
>>>>> not clear from above commit message how both are related.
>>>>> I'd add here an explicit explanation how AMD host is related GPAs
>>>>> and clarify where you are talking about guest/host side.
>>>>>     
>>>> OK, makes sense.
>>>>
>>>> Perhaps using IOVA makes it easier to understand. I said GPA because
>>>> there's an 1:1 mapping between GPA and IOVA (if you're not using vIOMMU).  
>>>
>>> IOVA may be a too broad term, maybe explain it in terms of GPA and HPA
>>> and why it does matter on each side (host/guest)
>>>   
>>
>> I used the term IOVA specially because that is applicable to Host IOVA or
>> Guest IOVA (same rules apply as this is not special cased for VMs). So,
>> regardless of whether we have guest mode page tables, or just host
>> iommu page tables, this address range should be reserved and not used.
> 
> IOVA doesn't make it any clearer, on contrary it's more confusing.
> 
> And does host's HPA matter at all? (if host's firmware isn't broken,
> it should never use nor advertise 1Tb hole). 
> So we probably talking here only about GPA only.
> 
For the case in point for the series, yes it's only GPA that we care about.

Perhaps I misunderstood your earlier comment where you said how HPAs were
affected, so I was trying to encompass the problem statement in a Guest/Host
agnostic manner by using IOVA given this is all related to IOMMU reserved ranges.
I'll stick to GPA to avoid any confusion -- as that's what matters for this series.

>>>>> also what about usecases:
>>>>>  * start QEMU with Intel cpu model on AMD host with intel's iommu    
>>>>
>>>> In principle it would be less likely to occur. But you would still need
>>>> to mark the same range as reserved. The limitation is on DMA occuring
>>>> on those IOVAs (host or guest) coinciding with that range, so you would
>>>> want to inform the guest that at least those should be avoided.
>>>>  
>>>>>  * start QEMU with AMD cpu model and AMD's iommu on Intel host    
>>>>
>>>> Here you would probably only mark the range, solely for honoring how hardware
>>>> is usually represented. But really, on Intel, nothing stops you from exposing the
>>>> aforementioned range as RAM.
>>>>  
>>>>>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
>>>>>     
>>>> This one is tricky. Because you can hotplug a VFIO device later on,
>>>> I opted for always marking the reserved range. If you don't use VFIO you're good, but
>>>> otherwise you would still need reserved. But I am not sure how qtest is used
>>>> today for testing huge guests.  
>>> I do not know if there are VFIO tests in qtest (probably nope, since that
>>> could require a host configured for that), but we can add a test
>>> for his memory quirk (assuming phys-bits won't get in the way)
>>>   
>>
>> 	Joao
>>
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-23 13:04         ` Joao Martins
@ 2021-06-28 14:32           ` Igor Mammedov
  2021-08-06 10:41             ` Joao Martins
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2021-06-28 14:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Wed, 23 Jun 2021 14:04:19 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 12:39 PM, Igor Mammedov wrote:
> > On Wed, 23 Jun 2021 10:37:38 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 6/23/21 8:11 AM, Igor Mammedov wrote:  
> >>> On Tue, 22 Jun 2021 16:49:00 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> It is assumed that the whole GPA space is available to be
> >>>> DMA addressable, within a given address space limit. Since
> >>>> v5.4 based that is not true, and VFIO will validate whether
> >>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >>>> on behalf of some specific devices or platform-defined.
> >>>>
> >>>> AMD systems with an IOMMU are examples of such platforms and
> >>>> particularly may export only these ranges as allowed:
> >>>>
> >>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>>>
> >>>> We already know of accounting for the 4G hole, albeit if the
> >>>> guest is big enough we will fail to allocate a >1010G given
> >>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>>>
> >>>> When creating the region above 4G, take into account what
> >>>> IOVAs are allowed by defining the known allowed ranges
> >>>> and search for the next free IOVA ranges. When finding a
> >>>> invalid IOVA we mark them as reserved and proceed to the
> >>>> next allowed IOVA region.
> >>>>
> >>>> After accounting for the 1Tb hole on AMD hosts, mtree should
> >>>> look like:
> >>>>
> >>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >>>> 0000010000000000-000001037fffffff (prio 0, i/o):
> >>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
> >>>
> >>> why not push whole ram-above-4g above 1Tb mark
> >>> when RAM is sufficiently large (regardless of used host),
> >>> instead of creating yet another hole and all complexity it brings along?
> >>>     
> >>
> >> There's the problem with CMOS which describes memory above 4G, part of the
> >> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
> >> only describe up to 1T.
> >>
> >> But should we not care about that then it's an option, I suppose.  
> > we probably do not care about CMOS with so large RAM,
> > as long as QEMU generates correct E820 (cmos mattered only with old Seabios
> > which used it for generating memory map)
> >   
> OK, good to know.
> 
> Any extension on CMOS would probably also be out of spec.
ram size in CMOS is approximate at best as doesn't account for all holes,
anyways it's irrelevant if we aren't changing RAM size.

> >> We would waste 1Tb of address space because of 12G, and btw the logic here
> >> is not so different than the 4G hole, in fact could probably share this
> >> with it.  
> > the main reason I'm looking for alternative, is complexity
> > of making hole brings in. At this point, we can't do anything
> > with 4G hole as it's already there, but we can try to avoid that
> > for high RAM and keep rules there simple as it's now.
> >   
> Right. But for what is worth, that complexity is spread in two parts:
> 
> 1) dealing with a sparse RAM model (with more than one hole)
> 
> 2) offsetting everything else that assumes a linear RAM map.
> 
> I don't think that even if we shift start of RAM to after 1TB boundary that
> we would get away top solving item 2 -- which personally is where I find this
> a tad bit more hairy. So it would probably make this patch complexity smaller, but
> not vary much in how spread the changes get.

you are already shifting hotplug area behind 1Tb in [2/6],
so to be consistent just do the same for 4Gb+ alias a well.
That will automatically solve issues in patch 4/6, and all
that at cost of several lines in one place vs this 200+ LOC series.
Both approaches are a kludge but shifting everything over 1Tb mark
is the simplest one.

If there were/is actual demand for sparse/non linear RAM layouts,
I'd switch to pc-dimm based RAM (i.e. generalize it to handle
RAM below 4G and let users specify their own memory map,
see below for more).

> > Also partitioning/splitting main RAM is one of the things that
> > gets in the way converting it to PC-DIMMMs model.
> >   
> Can you expand on that? (a link to a series is enough)
There is no series so far, only a rough idea.

current initial RAM with rather arbitrary splitting rules,
doesn't map very well with pc-dimm device model.
Unfortunately I don't see a way to convert initial RAM to
pc-dimm model without breaking CLI/ABI/migration.

As a way forward, we can restrict legacy layout to old
machine versions, and switch to pc-dimm based initial RAM
for new machine versions.

Then users will be able to specify GPA where each pc-dimm shall
be mapped. For reserving GPA ranges we can change current GPA
allocator. Alternatively a bit more flexible approach could be
a dummy memory device that would consume a reserved range so
that no one could assign pc-dimm there, this way user can define
arbitrary (subject to alignment restrictions we put on it) sparse
memory layouts without modifying QEMU for yet another hole.


> > Loosing 1Tb of address space might be acceptable on a host
> > that can handle such amounts of RAM
> >   
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-24  9:32     ` Dr. David Alan Gilbert
@ 2021-06-28 14:42       ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2021-06-28 14:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini, Joao Martins

On Thu, 24 Jun 2021 10:32:05 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Igor Mammedov (imammedo@redhat.com) wrote:
> > On Tue, 22 Jun 2021 16:49:00 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> > > It is assumed that the whole GPA space is available to be
> > > DMA addressable, within a given address space limit. Since
> > > v5.4 based that is not true, and VFIO will validate whether
> > > the selected IOVA is indeed valid i.e. not reserved by IOMMU
> > > on behalf of some specific devices or platform-defined.
> > > 
> > > AMD systems with an IOMMU are examples of such platforms and
> > > particularly may export only these ranges as allowed:
> > > 
> > > 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> > > 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> > > 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> > > 
> > > We already know of accounting for the 4G hole, albeit if the
> > > guest is big enough we will fail to allocate a >1010G given
> > > the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> > > 
> > > When creating the region above 4G, take into account what
> > > IOVAs are allowed by defining the known allowed ranges
> > > and search for the next free IOVA ranges. When finding a
> > > invalid IOVA we mark them as reserved and proceed to the
> > > next allowed IOVA region.
> > > 
> > > After accounting for the 1Tb hole on AMD hosts, mtree should
> > > look like:
> > > 
> > > 0000000100000000-000000fcffffffff (prio 0, i/o):
> > > 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> > > 0000010000000000-000001037fffffff (prio 0, i/o):
> > > 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff  
> > 
> > You are talking here about GPA which is guest specific thing
> > and then somehow it becomes tied to host. For bystanders it's
> > not clear from above commit message how both are related.
> > I'd add here an explicit explanation how AMD host is related GPAs
> > and clarify where you are talking about guest/host side.
> > 
> > also what about usecases:
> >  * start QEMU with Intel cpu model on AMD host with intel's iommu  

I vaguely recall KVM fixing up vendor_id to host's CPU type
without redefining the rest of CPU model but I can't find it anymore.
Though I'm not sure how guest kernel would behave in this mixed case,
guest might get confused.

> Does that work?
> 
> >  * start QEMU with AMD cpu model and AMD's iommu on Intel host  
> 
> Isn't the case we commonly do:
>    * start QEMU with AMD cpu model on AMD host with Intel iommu?
> Dave
> 
> >  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> >   
> > > Co-developed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > > Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > > ---
> > >  hw/i386/pc.c         | 103 +++++++++++++++++++++++++++++++++++++++----
> > >  include/hw/i386/pc.h |  57 ++++++++++++++++++++++++
> > >  target/i386/cpu.h    |   3 ++
> > >  3 files changed, 154 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > > index c6d8d0d84d91..52a5473ba846 100644
> > > --- a/hw/i386/pc.c
> > > +++ b/hw/i386/pc.c
> > > @@ -91,6 +91,7 @@
> > >  #include "qapi/qmp/qerror.h"
> > >  #include "e820_memory_layout.h"
> > >  #include "fw_cfg.h"
> > > +#include "target/i386/cpu.h"
> > >  #include "trace.h"
> > >  #include CONFIG_DEVICES
> > >  
> > > @@ -860,6 +861,93 @@ void xen_load_linux(PCMachineState *pcms)
> > >      x86ms->fw_cfg = fw_cfg;
> > >  }
> > >  
> > > +struct GPARange usable_iova_ranges[] = {
> > > +    { .start = 4 * GiB, .end = UINT64_MAX, .name = "ram-above-4g" },
> > > +
> > > +/*
> > > + * AMD systems with an IOMMU have an additional hole close to the
> > > + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> > > + * on kernel version, VFIO may or may not let you DMA map those ranges.
> > > + * Starting v5.4 we validate it, and can't create guests on AMD machines
> > > + * with certain memory sizes. The range is:
> > > + *
> > > + * FD_0000_0000h - FF_FFFF_FFFFh
> > > + *
> > > + * The ranges represent the following:
> > > + *
> > > + * Base Address   Top Address  Use
> > > + *
> > > + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> > > + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> > > + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> > > + * FD_F910_0000h FD_F91F_FFFFh System Management
> > > + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> > > + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> > > + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> > > + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> > > + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> > > + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> > > + *
> > > + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> > > + * Table 3: Special Address Controls (GPA) for more information.
> > > + */
> > > +#define DEFAULT_NR_USABLE_IOVAS 1
> > > +#define AMD_MAX_PHYSADDR_BELOW_1TB  0xfcffffffff
> > > +    { .start = 1 * TiB, .end = UINT64_MAX, .name = "ram-above-1t" },
> > > +};
> > > +
> > > + uint32_t nb_iova_ranges = DEFAULT_NR_USABLE_IOVAS;
> > > +
> > > +static void init_usable_iova_ranges(void)
> > > +{
> > > +    uint32_t eax, vendor[3];
> > > +
> > > +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > > +    if (IS_AMD_VENDOR(vendor)) {
> > > +        usable_iova_ranges[0].end = AMD_MAX_PHYSADDR_BELOW_1TB;
> > > +        nb_iova_ranges++;
> > > +    }
> > > +}
> > > +
> > > +static void add_memory_region(MemoryRegion *system_memory, MemoryRegion *ram,
> > > +                                hwaddr base, hwaddr size, hwaddr offset)
> > > +{
> > > +    hwaddr start, region_size, resv_start, resv_end;
> > > +    struct GPARange *range;
> > > +    MemoryRegion *region;
> > > +    uint32_t index;
> > > +
> > > +    for_each_usable_range(index, base, size, range, start, region_size) {
> > > +        region = g_malloc(sizeof(*region));
> > > +        memory_region_init_alias(region, NULL, range->name, ram,
> > > +                                 offset, region_size);
> > > +        memory_region_add_subregion(system_memory, start, region);
> > > +        e820_add_entry(start, region_size, E820_RAM);
> > > +
> > > +        assert(size >= region_size);
> > > +        if (size == region_size) {
> > > +            return;
> > > +        }
> > > +
> > > +        /*
> > > +         * There's memory left to create a region for, so there should be
> > > +         * another valid IOVA range left.  Creating the reserved region
> > > +         * would also be pointless.
> > > +         */
> > > +        if (index + 1 == nb_iova_ranges) {
> > > +            return;
> > > +        }
> > > +
> > > +        resv_start = start + region_size;
> > > +        resv_end = usable_iova_ranges[index + 1].start;
> > > +
> > > +        /* Create a reserved region in the IOVA hole. */
> > > +        e820_add_entry(resv_start, resv_end - resv_start, E820_RESERVED);
> > > +
> > > +        offset += region_size;
> > > +    }
> > > +}
> > > +
> > >  void pc_memory_init(PCMachineState *pcms,
> > >                      MemoryRegion *system_memory,
> > >                      MemoryRegion *rom_memory,
> > > @@ -867,7 +955,7 @@ void pc_memory_init(PCMachineState *pcms,
> > >  {
> > >      int linux_boot, i;
> > >      MemoryRegion *option_rom_mr;
> > > -    MemoryRegion *ram_below_4g, *ram_above_4g;
> > > +    MemoryRegion *ram_below_4g;
> > >      FWCfgState *fw_cfg;
> > >      MachineState *machine = MACHINE(pcms);
> > >      MachineClass *mc = MACHINE_GET_CLASS(machine);
> > > @@ -877,6 +965,8 @@ void pc_memory_init(PCMachineState *pcms,
> > >      assert(machine->ram_size == x86ms->below_4g_mem_size +
> > >                                  x86ms->above_4g_mem_size);
> > >  
> > > +    init_usable_iova_ranges();
> > > +
> > >      linux_boot = (machine->kernel_filename != NULL);
> > >  
> > >      /*
> > > @@ -888,16 +978,11 @@ void pc_memory_init(PCMachineState *pcms,
> > >      memory_region_init_alias(ram_below_4g, NULL, "ram-below-4g", machine->ram,
> > >                               0, x86ms->below_4g_mem_size);
> > >      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> > > +
> > >      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> > >      if (x86ms->above_4g_mem_size > 0) {
> > > -        ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> > > -        memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> > > -                                 machine->ram,
> > > -                                 x86ms->below_4g_mem_size,
> > > -                                 x86ms->above_4g_mem_size);
> > > -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> > > -                                    ram_above_4g);
> > > -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> > > +        add_memory_region(system_memory, machine->ram, 4 * GiB,
> > > +                          x86ms->above_4g_mem_size, x86ms->below_4g_mem_size);
> > >      }
> > >  
> > >      if (!pcmc->has_reserved_memory &&
> > > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > > index 1522a3359a93..73b8e2900c72 100644
> > > --- a/include/hw/i386/pc.h
> > > +++ b/include/hw/i386/pc.h
> > > @@ -151,6 +151,63 @@ void pc_guest_info_init(PCMachineState *pcms);
> > >  #define PCI_HOST_BELOW_4G_MEM_SIZE     "below-4g-mem-size"
> > >  #define PCI_HOST_ABOVE_4G_MEM_SIZE     "above-4g-mem-size"
> > >  
> > > +struct GPARange {
> > > +    uint64_t start;
> > > +    uint64_t end;
> > > +#define range_len(r) ((r).end - (r).start + 1)
> > > +
> > > +    const char *name;
> > > +};
> > > +
> > > +extern uint32_t nb_iova_ranges;
> > > +extern struct GPARange usable_iova_ranges[];
> > > +
> > > +static inline void next_iova_range_index(uint32_t i, hwaddr base, hwaddr size,
> > > +                                         hwaddr *start, hwaddr *region_size,
> > > +                                         uint32_t *index, struct GPARange **r)
> > > +{
> > > +    struct GPARange *range;
> > > +
> > > +    while (i < nb_iova_ranges) {
> > > +        range = &usable_iova_ranges[i];
> > > +        if (range->end >= base) {
> > > +            break;
> > > +        }
> > > +        i++;
> > > +    }
> > > +
> > > +    *index = i;
> > > +    /* index is out of bounds or no region left to process */
> > > +    if (i >= nb_iova_ranges || !size) {
> > > +        return;
> > > +    }
> > > +
> > > +    *start = MAX(range->start, base);
> > > +    *region_size = MIN(range->end - *start + 1, size);
> > > +    *r = range;
> > > +}
> > > +
> > > +/*
> > > + * for_each_usable_range() - iterates over usable IOVA regions
> > > + *
> > > + * @i:      range index within usable_iova_ranges
> > > + * @base:   requested address we want to use
> > > + * @size:   total size of the region with @base
> > > + * @r:      range indexed by @i within usable_iova_ranges
> > > + * @s:      calculated usable iova address base
> > > + * @i_size: calculated usable iova region size starting @s
> > > + *
> > > + * Use this iterator to walk through usable GPA ranges. Platforms
> > > + * such as AMD with IOMMU capable hardware restrict certain address
> > > + * ranges for Hyper Transport. This iterator helper lets user avoid
> > > + * using those said reserved ranges.
> > > + */
> > > +#define for_each_usable_range(i, base, size, r, s, i_size) \
> > > +    for (s = 0, i_size = 0, r = NULL, \
> > > +         next_iova_range_index(0, base, size, &s, &i_size, &i, &r); \
> > > +         i < nb_iova_ranges && size != 0; \
> > > +         size -= i_size, \
> > > +         next_iova_range_index(i + 1, base, size, &s, &i_size, &i, &r))
> > >  
> > >  void pc_pci_as_mapping_init(Object *owner, MemoryRegion *system_memory,
> > >                              MemoryRegion *pci_address_space);
> > > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > > index 1e11071d817b..f8f15a4523c6 100644
> > > --- a/target/i386/cpu.h
> > > +++ b/target/i386/cpu.h
> > > @@ -869,6 +869,9 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
> > >  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> > >                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> > >                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> > > +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> > > +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> > > +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> > >  
> > >  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
> > >  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */  
> > 
> >   



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-28 13:43             ` Joao Martins
@ 2021-06-28 15:21               ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2021-06-28 15:21 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

On Mon, 28 Jun 2021 14:43:48 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/28/21 2:25 PM, Igor Mammedov wrote:
> > On Wed, 23 Jun 2021 14:07:29 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 6/23/21 1:09 PM, Igor Mammedov wrote:  
> >>> On Wed, 23 Jun 2021 10:51:59 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>     
> >>>> On 6/23/21 10:03 AM, Igor Mammedov wrote:    
> >>>>> On Tue, 22 Jun 2021 16:49:00 +0100
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>>>       
> >>>>>> It is assumed that the whole GPA space is available to be
> >>>>>> DMA addressable, within a given address space limit. Since
> >>>>>> v5.4 based that is not true, and VFIO will validate whether
> >>>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
> >>>>>> on behalf of some specific devices or platform-defined.
> >>>>>>
> >>>>>> AMD systems with an IOMMU are examples of such platforms and
> >>>>>> particularly may export only these ranges as allowed:
> >>>>>>
> >>>>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >>>>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >>>>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
> >>>>>>
> >>>>>> We already know of accounting for the 4G hole, albeit if the
> >>>>>> guest is big enough we will fail to allocate a >1010G given
> >>>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
> >>>>>>
> >>>>>> When creating the region above 4G, take into account what
> >>>>>> IOVAs are allowed by defining the known allowed ranges
> >>>>>> and search for the next free IOVA ranges. When finding a
> >>>>>> invalid IOVA we mark them as reserved and proceed to the
> >>>>>> next allowed IOVA region.
> >>>>>>
> >>>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
> >>>>>> look like:
> >>>>>>
> >>>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
> >>>>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
> >>>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
> >>>>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff      
> >>>>>
> >>>>> You are talking here about GPA which is guest specific thing
> >>>>> and then somehow it becomes tied to host. For bystanders it's
> >>>>> not clear from above commit message how both are related.
> >>>>> I'd add here an explicit explanation how AMD host is related GPAs
> >>>>> and clarify where you are talking about guest/host side.
> >>>>>       
> >>>> OK, makes sense.
> >>>>
> >>>> Perhaps using IOVA makes it easier to understand. I said GPA because
> >>>> there's an 1:1 mapping between GPA and IOVA (if you're not using vIOMMU).    
> >>>
> >>> IOVA may be a too broad term, maybe explain it in terms of GPA and HPA
> >>> and why it does matter on each side (host/guest)
> >>>     
> >>
> >> I used the term IOVA specially because that is applicable to Host IOVA or
> >> Guest IOVA (same rules apply as this is not special cased for VMs). So,
> >> regardless of whether we have guest mode page tables, or just host
> >> iommu page tables, this address range should be reserved and not used.  
> > 
> > IOVA doesn't make it any clearer, on contrary it's more confusing.
> > 
> > And does host's HPA matter at all? (if host's firmware isn't broken,
> > it should never use nor advertise 1Tb hole). 
> > So we probably talking here only about GPA only.
> >   
> For the case in point for the series, yes it's only GPA that we care about.
> 
> Perhaps I misunderstood your earlier comment where you said how HPAs were
> affected, so I was trying to encompass the problem statement in a Guest/Host
> agnostic manner by using IOVA given this is all related to IOMMU reserved ranges.
> I'll stick to GPA to avoid any confusion -- as that's what matters for this series.

Even better is to add here a reference to spec where it says so.

> 
> >>>>> also what about usecases:
> >>>>>  * start QEMU with Intel cpu model on AMD host with intel's iommu      
> >>>>
> >>>> In principle it would be less likely to occur. But you would still need
> >>>> to mark the same range as reserved. The limitation is on DMA occuring
> >>>> on those IOVAs (host or guest) coinciding with that range, so you would
> >>>> want to inform the guest that at least those should be avoided.
> >>>>    
> >>>>>  * start QEMU with AMD cpu model and AMD's iommu on Intel host      
> >>>>
> >>>> Here you would probably only mark the range, solely for honoring how hardware
> >>>> is usually represented. But really, on Intel, nothing stops you from exposing the
> >>>> aforementioned range as RAM.
> >>>>    
> >>>>>  * start QEMU in TCG mode on AMD host (mostly form qtest point ot view)
> >>>>>       
> >>>> This one is tricky. Because you can hotplug a VFIO device later on,
> >>>> I opted for always marking the reserved range. If you don't use VFIO you're good, but
> >>>> otherwise you would still need reserved. But I am not sure how qtest is used
> >>>> today for testing huge guests.    
> >>> I do not know if there are VFIO tests in qtest (probably nope, since that
> >>> could require a host configured for that), but we can add a test
> >>> for his memory quirk (assuming phys-bits won't get in the way)
> >>>     
> >>
> >> 	Joao
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space
  2021-06-23 13:22     ` Joao Martins
@ 2021-06-28 15:37       ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2021-06-28 15:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini, lersek

On Wed, 23 Jun 2021 14:22:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/23/21 1:30 PM, Igor Mammedov wrote:
> > On Tue, 22 Jun 2021 16:49:03 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> pci_memory initialized by q35 and i440fx is set to a range
> >> of 0 .. UINT64_MAX, and as a consequence when ACPI and pci-host
> >> pick the hole64_start it does not account for allowed IOVA ranges.
> >>
> >> Rather than blindly returning, round up the hole64_start
> >> value to the allowable IOVA range, such that it accounts for
> >> the 1Tb hole *on AMD*. On Intel it returns the input value
> >> for hole64 start.  
> > 
> > wouldn't that work only in case where guest firmware hadn't
> > mapped any PCI bars above 4Gb (possibly in not allowed region).
> > 
> > Looking at Seabios, it uses reserved_memory_end as the start
> > of PCI64 window. Not sure about OVMF,
> >  CCing Laszlo.
> >   
> Hmmm, perhaps only in the case that I don't advertise the reserved
> region (i.e. mem size not being big enough to let the guest know in
> the e820). But then that it begs the question, should we then always
> advertise the said HT region as reserved regardless of memory size?

Where in firmware that e820 reserved entry will be accounted for?
All I see in Seabios is qemu_cfg_e820() which accounts for E820_RAM (sometimes)
and there the high RAM is hardwired to 4G. Then later on it might get used
to setup the start of 64-bit PCI hole if etc/reserved-memory-end
doesn't exist.

And for E820_RAM to work, the entry with highest address has to be
the last in the table, which is fragile and it's only a question of
time when someone breaks that.
(well if it's explicitly specified in ACPI spec, I won't object
if properly documented)
So firmware side will also need changes to account for this,
and make it independent if 4G starting point.

For Seabios it might be better to make use of fwcfg file Laszlo has
suggested in his reply than relying on e820 & reserved-memory-end,
either way Seabios would need related patches for this to work
properly. For OVMF you can get away with QEMU only patches,
if using that knob.

(PS:
I'm not PCI expert so there might be more issues hidden there
)

> >> Suggested-by: David Edmondson <david.edmondson@oracle.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c         | 17 +++++++++++++++--
> >>  hw/pci-host/i440fx.c |  4 +++-
> >>  hw/pci-host/q35.c    |  4 +++-
> >>  include/hw/i386/pc.h |  3 ++-
> >>  4 files changed, 23 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index 2e2ea82a4661..65885cc16037 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -1141,7 +1141,7 @@ void pc_memory_init(PCMachineState *pcms,
> >>   * The 64bit pci hole starts after "above 4G RAM" and
> >>   * potentially the space reserved for memory hotplug.
> >>   */
> >> -uint64_t pc_pci_hole64_start(void)
> >> +uint64_t pc_pci_hole64_start(uint64_t size)
> >>  {
> >>      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
> >>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >> @@ -1155,12 +1155,25 @@ uint64_t pc_pci_hole64_start(void)
> >>              hole64_start += memory_region_size(&ms->device_memory->mr);
> >>          }
> >>      } else {
> >> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
> >> +        if (!x86ms->above_1t_mem_size) {
> >> +            hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
> >> +        } else {
> >> +            hole64_start = x86ms->above_1t_maxram_start;
> >> +        }
> >>      }  
> >   
> >> +    hole64_start = allowed_round_up(hole64_start, size);  
> > 
> > I'm not sure but, might it cause problems if there were BARs placed
> > by firmware in region below rounded up value?
> > (i.e. ACPI description which uses PCI_HOST_PROP_PCI_HOLE_START property
> > won't match whatever firmware programmed due to rounding pushing hole
> > start up)
> >   
> But wouldn't then the problem be firmware ignoring E820 to avoid putting
> firmware region where it shouldn't? Unless of course, it wasn't advertised
> like I mentioned earlier.
> 
> >>      return ROUND_UP(hole64_start, 1 * GiB);
> >>  }
> >>  
> >> +uint64_t pc_pci_hole64_start_aligned(uint64_t start, uint64_t size)
> >> +{
> >> +    if (nb_iova_ranges == DEFAULT_NR_USABLE_IOVAS) {
> >> +        return start;
> >> +    }
> >> +    return allowed_round_up(start, size);
> >> +}
> >> +
> >>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus)
> >>  {
> >>      DeviceState *dev = NULL;
> >> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> >> index 28c9bae89944..e8eaebfe1034 100644
> >> --- a/hw/pci-host/i440fx.c
> >> +++ b/hw/pci-host/i440fx.c
> >> @@ -163,8 +163,10 @@ static uint64_t i440fx_pcihost_get_pci_hole64_start_value(Object *obj)
> >>      pci_bus_get_w64_range(h->bus, &w64);
> >>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
> >>      if (!value && s->pci_hole64_fix) {
> >> -        value = pc_pci_hole64_start();
> >> +        value = pc_pci_hole64_start(s->pci_hole64_size);
> >>      }
> >> +    /* This returns @value when not on AMD */
> >> +    value = pc_pci_hole64_start_aligned(value, s->pci_hole64_size);
> >>      return value;
> >>  }
> >>  
> >> diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
> >> index 2eb729dff585..d556eb965ddb 100644
> >> --- a/hw/pci-host/q35.c
> >> +++ b/hw/pci-host/q35.c
> >> @@ -126,8 +126,10 @@ static uint64_t q35_host_get_pci_hole64_start_value(Object *obj)
> >>      pci_bus_get_w64_range(h->bus, &w64);
> >>      value = range_is_empty(&w64) ? 0 : range_lob(&w64);
> >>      if (!value && s->pci_hole64_fix) {
> >> -        value = pc_pci_hole64_start();
> >> +        value = pc_pci_hole64_start(s->mch.pci_hole64_size);
> >>      }
> >> +    /* This returns @value when not on AMD */
> >> +    value = pc_pci_hole64_start_aligned(value, s->mch.pci_hole64_size);
> >>      return value;
> >>  }
> >>  
> >> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> >> index 73b8e2900c72..b924aef3a218 100644
> >> --- a/include/hw/i386/pc.h
> >> +++ b/include/hw/i386/pc.h
> >> @@ -217,7 +217,8 @@ void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >>                      MemoryRegion **ram_memory);
> >> -uint64_t pc_pci_hole64_start(void);
> >> +uint64_t pc_pci_hole64_start(uint64_t size);
> >> +uint64_t pc_pci_hole64_start_aligned(uint64_t value, uint64_t size);
> >>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
> >>  void pc_basic_device_init(struct PCMachineState *pcms,
> >>                            ISABus *isa_bus, qemu_irq *gsi,  
> >   
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary
  2021-06-28 14:32           ` Igor Mammedov
@ 2021-08-06 10:41             ` Joao Martins
  0 siblings, 0 replies; 38+ messages in thread
From: Joao Martins @ 2021-08-06 10:41 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson,
	Suravee Suthikulpanit, Paolo Bonzini

[sorry for the thread necromancy -- got preempted in other stuff
 and am slowly coming back to this]

On 6/28/21 3:32 PM, Igor Mammedov wrote:
> On Wed, 23 Jun 2021 14:04:19 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/23/21 12:39 PM, Igor Mammedov wrote:
>>> On Wed, 23 Jun 2021 10:37:38 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> On 6/23/21 8:11 AM, Igor Mammedov wrote:  
>>>>> On Tue, 22 Jun 2021 16:49:00 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>     
>>>>>> It is assumed that the whole GPA space is available to be
>>>>>> DMA addressable, within a given address space limit. Since
>>>>>> v5.4 based that is not true, and VFIO will validate whether
>>>>>> the selected IOVA is indeed valid i.e. not reserved by IOMMU
>>>>>> on behalf of some specific devices or platform-defined.
>>>>>>
>>>>>> AMD systems with an IOMMU are examples of such platforms and
>>>>>> particularly may export only these ranges as allowed:
>>>>>>
>>>>>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>>>>>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>>>>>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb)
>>>>>>
>>>>>> We already know of accounting for the 4G hole, albeit if the
>>>>>> guest is big enough we will fail to allocate a >1010G given
>>>>>> the ~12G hole at the 1Tb boundary, reserved for HyperTransport.
>>>>>>
>>>>>> When creating the region above 4G, take into account what
>>>>>> IOVAs are allowed by defining the known allowed ranges
>>>>>> and search for the next free IOVA ranges. When finding a
>>>>>> invalid IOVA we mark them as reserved and proceed to the
>>>>>> next allowed IOVA region.
>>>>>>
>>>>>> After accounting for the 1Tb hole on AMD hosts, mtree should
>>>>>> look like:
>>>>>>
>>>>>> 0000000100000000-000000fcffffffff (prio 0, i/o):
>>>>>> 	alias ram-above-4g @pc.ram 0000000080000000-000000fc7fffffff
>>>>>> 0000010000000000-000001037fffffff (prio 0, i/o):
>>>>>> 	alias ram-above-1t @pc.ram 000000fc80000000-000000ffffffffff    
>>>>>
>>>>> why not push whole ram-above-4g above 1Tb mark
>>>>> when RAM is sufficiently large (regardless of used host),
>>>>> instead of creating yet another hole and all complexity it brings along?
>>>>>     
>>>>
>>>> There's the problem with CMOS which describes memory above 4G, part of the
>>>> reason I cap it to the 1TB minus the reserved range i.e. for AMD, CMOS would
>>>> only describe up to 1T.
>>>>
>>>> But should we not care about that then it's an option, I suppose.  
>>> we probably do not care about CMOS with so large RAM,
>>> as long as QEMU generates correct E820 (cmos mattered only with old Seabios
>>> which used it for generating memory map)
>>>   
>> OK, good to know.
>>
>> Any extension on CMOS would probably also be out of spec.
> ram size in CMOS is approximate at best as doesn't account for all holes,
> anyways it's irrelevant if we aren't changing RAM size.
> 
/me nods

>>>> We would waste 1Tb of address space because of 12G, and btw the logic here
>>>> is not so different than the 4G hole, in fact could probably share this
>>>> with it.  
>>> the main reason I'm looking for alternative, is complexity
>>> of making hole brings in. At this point, we can't do anything
>>> with 4G hole as it's already there, but we can try to avoid that
>>> for high RAM and keep rules there simple as it's now.
>>>   
>> Right. But for what is worth, that complexity is spread in two parts:
>>
>> 1) dealing with a sparse RAM model (with more than one hole)
>>
>> 2) offsetting everything else that assumes a linear RAM map.
>>
>> I don't think that even if we shift start of RAM to after 1TB boundary that
>> we would get away top solving item 2 -- which personally is where I find this
>> a tad bit more hairy. So it would probably make this patch complexity smaller, but
>> not vary much in how spread the changes get.
> 
> you are already shifting hotplug area behind 1Tb in [2/6],
> so to be consistent just do the same for 4Gb+ alias a well.
> That will automatically solve issues in patch 4/6, and all
> that at cost of several lines in one place vs this 200+ LOC series.
> Both approaches are a kludge but shifting everything over 1Tb mark
> is the simplest one.
> 

I understand that shifting above-4g-mem region to start at 1T would
be simpler and a lot less to maintain considering we can eliminate
most complexity from finding holes to place RAM ranges.

Let me try your suggestion.

But I'm still somewhat at odds that we have a gigantic hole from below
[<4G ... 1T]. I have nothing to base upon *right now* so nothing more
than a void concern, but hopefully there isn't assumptions elsewhere
in OSes/firmware and etc.


> If there were/is actual demand for sparse/non linear RAM layouts,
> I'd switch to pc-dimm based RAM (i.e. generalize it to handle
> RAM below 4G and let users specify their own memory map,
> see below for more).
> 
It seems less prone to error if we learn to deal with these odities
with the pc-dimm based RAM (longterm). At least it would open
other avenues Alex suggested in the cover letter. And more importantly,
let represent all this closer to what hardware actually does (which is
what I was attempting with this series)

>>> Also partitioning/splitting main RAM is one of the things that
>>> gets in the way converting it to PC-DIMMMs model.
>>>   
>> Can you expand on that? (a link to a series is enough)
> There is no series so far, only a rough idea.
> 
> current initial RAM with rather arbitrary splitting rules,
> doesn't map very well with pc-dimm device model.
> Unfortunately I don't see a way to convert initial RAM to
> pc-dimm model without breaking CLI/ABI/migration.
> 
> As a way forward, we can restrict legacy layout to old
> machine versions, and switch to pc-dimm based initial RAM
> for new machine versions.
> 
> Then users will be able to specify GPA where each pc-dimm shall
> be mapped. For reserving GPA ranges we can change current GPA
> allocator. Alternatively a bit more flexible approach could be
> a dummy memory device that would consume a reserved range so
> that no one could assign pc-dimm there, this way user can define
> arbitrary (subject to alignment restrictions we put on it) sparse
> memory layouts without modifying QEMU for yet another hole.
> 
Yeap -- sounds a whole lot more flexible longterm.

> 
>>> Loosing 1Tb of address space might be acceptable on a host
>>> that can handle such amounts of RAM




^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-08-06 10:43 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22 15:48 [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Joao Martins
2021-06-22 15:49 ` [PATCH RFC 1/6] i386/pc: Account IOVA reserved ranges above 4G boundary Joao Martins
2021-06-23  7:11   ` Igor Mammedov
2021-06-23  9:37     ` Joao Martins
2021-06-23 11:39       ` Igor Mammedov
2021-06-23 13:04         ` Joao Martins
2021-06-28 14:32           ` Igor Mammedov
2021-08-06 10:41             ` Joao Martins
2021-06-23  9:03   ` Igor Mammedov
2021-06-23  9:51     ` Joao Martins
2021-06-23 12:09       ` Igor Mammedov
2021-06-23 13:07         ` Joao Martins
2021-06-28 13:25           ` Igor Mammedov
2021-06-28 13:43             ` Joao Martins
2021-06-28 15:21               ` Igor Mammedov
2021-06-24  9:32     ` Dr. David Alan Gilbert
2021-06-28 14:42       ` Igor Mammedov
2021-06-22 15:49 ` [PATCH RFC 2/6] i386/pc: Round up the hotpluggable memory within valid IOVA ranges Joao Martins
2021-06-22 15:49 ` [PATCH RFC 3/6] pc/cmos: Adjust CMOS above 4G memory size according to 1Tb boundary Joao Martins
2021-06-22 15:49 ` [PATCH RFC 4/6] i386/pc: Keep PCI 64-bit hole within usable IOVA space Joao Martins
2021-06-23 12:30   ` Igor Mammedov
2021-06-23 13:22     ` Joao Martins
2021-06-28 15:37       ` Igor Mammedov
2021-06-23 16:33     ` Laszlo Ersek
2021-06-25 17:19       ` Joao Martins
2021-06-22 15:49 ` [PATCH RFC 5/6] i386/acpi: Fix SRAT ranges in accordance to usable IOVA Joao Martins
2021-06-22 15:49 ` [PATCH RFC 6/6] i386/pc: Add a machine property for AMD-only enforcing of valid IOVAs Joao Martins
2021-06-23  9:18   ` Igor Mammedov
2021-06-23  9:59     ` Joao Martins
2021-06-22 21:16 ` [PATCH RFC 0/6] i386/pc: Fix creation of >= 1Tb guests on AMD systems with IOMMU Alex Williamson
2021-06-23  7:40   ` David Edmondson
2021-06-23 19:13     ` Alex Williamson
2021-06-23  9:30   ` Joao Martins
2021-06-23 11:58     ` Igor Mammedov
2021-06-23 13:15       ` Joao Martins
2021-06-23 19:27     ` Alex Williamson
2021-06-24  9:22       ` Dr. David Alan Gilbert
2021-06-25 16:54       ` Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.