All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
@ 2022-02-23 18:44 Joao Martins
  2022-02-23 18:44 ` [PATCH v3 1/6] hw/i386: add 4g boundary start to X86MachineState Joao Martins
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

RFCv2[3] -> v3:

* Add missing brackets in single line statement, in patch 5 (David)
* Change ranges printf to use PRIx64, in patch 5 (David)
* Move the check to after changing above_4g_mem_start, in patch 5 (David)
* Make the check generic and move it to pc_memory_init rather being specific
to AMD, as the check is useful to capture invalid phys-bits
configs (patch 5, Igor).
* Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
* Consider pci_hole64_size in patch 4 (Igor)
* To consider pci_hole64_size in max used addr we need to get it from pci-host,
so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
qdev_new("q35") to be before pc_memory_init().
* Consider sgx_epc.size in max used address, in patch 4 (Igor)
* Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
* Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.

Thanks Igor/David for the comments!

---

This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
particularly when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This small series tries to address that by dealing with this AMD-specific 1Tb hole,
but rather than dealing like the 4G hole, it instead relocates RAM above 4G
to be above the 1T if the maximum RAM range crosses the HT reserved range.
It is organized as following:

patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
         address of the 4G boundary

patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
	     to get accessing to pci_hole64_size. The actual pci-host
	     initialization is kept as is, only the qdev_new.

patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
possible address acrosses the HT region.

patch 5: Warns user if phys-bits is too low

patch 6: Ensure valid IOVAs only on new machine types, but not older
ones (<= v6.2.0)

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40
(TCG_PHYS_BITS). Today there's already a precedent to depend on the user to
pick the right value of phys-bits (regardless of this series), so we warn in
case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
ram blocks, but it was mentioned over RFC that CMOS is only useful for very
old seabios. 

Additionally, the reserved region is added to E820 if the relocation is done.

Alternative options considered (in RFC[0]):

a) Dealing with the 1T hole like the 4G hole -- which also represents what
hardware closely does.

Thanks,
	Joao

Older Changelog,

RFC[0] -> RFCv2[3]:

* At Igor's suggestion in one of the patches I reworked the series enterily,
and more or less as he was thinking it is far simpler to relocate the
ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
and less intrusive. (patch 1 & 2)
* Check phys-bits is big enough prior to relocating (new patch 3)
* Remove the machine property, and it's only internal and set by new machine
version (Igor, patch 4).
* Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
* Add IOMMU SDM in the commit message (Igor, patch 2)

[0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
[3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u


Joao Martins (6):
  hw/i386: add 4g boundary start to X86MachineState
  i386/pc: create pci-host qdev prior to pc_memory_init()
  i386/pc: pass pci_hole64_size to pc_memory_init()
  i386/pc: relocate 4g start to 1T where applicable
  i386/pc: warn if phys-bits is too low
  i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
    type

 hw/i386/acpi-build.c         |   2 +-
 hw/i386/pc.c                 | 107 +++++++++++++++++++++++++++++++++--
 hw/i386/pc_piix.c            |  12 +++-
 hw/i386/pc_q35.c             |  14 ++++-
 hw/i386/sgx.c                |   2 +-
 hw/i386/x86.c                |   1 +
 hw/pci-host/i440fx.c         |  10 +++-
 include/hw/i386/pc.h         |   4 +-
 include/hw/i386/x86.h        |   3 +
 include/hw/pci-host/i440fx.h |   3 +-
 target/i386/cpu.h            |   4 ++
 11 files changed, 146 insertions(+), 16 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v3 1/6] hw/i386: add 4g boundary start to X86MachineState
  2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
@ 2022-02-23 18:44 ` Joao Martins
  2022-02-23 18:44 ` [PATCH v3 2/6] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

Rather than hardcoding the 4G boundary everywhere, introduce a
X86MachineState property @above_4g_mem_start and use it
accordingly.

This is in preparation for relocating ram-above-4g to be
dynamically start at 1T on AMD platforms.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/acpi-build.c  | 2 +-
 hw/i386/pc.c          | 9 +++++----
 hw/i386/sgx.c         | 2 +-
 hw/i386/x86.c         | 1 +
 include/hw/i386/x86.h | 3 +++
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index ebd47aa26fd8..4bf54ccdab91 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2063,7 +2063,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
                 build_srat_memory(table_data, mem_base, mem_len, i - 1,
                                   MEM_AFFINITY_ENABLED);
             }
-            mem_base = 1ULL << 32;
+            mem_base = x86ms->above_4g_mem_start;
             mem_len = next_base - x86ms->below_4g_mem_size;
             next_base = mem_base + mem_len;
         }
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index c8696ac01e85..7de0e87f4a3f 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -837,9 +837,10 @@ void pc_memory_init(PCMachineState *pcms,
                                  machine->ram,
                                  x86ms->below_4g_mem_size,
                                  x86ms->above_4g_mem_size);
-        memory_region_add_subregion(system_memory, 0x100000000ULL,
+        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
                                     ram_above_4g);
-        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
+        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
+                       E820_RAM);
     }
 
     if (pcms->sgx_epc.size != 0) {
@@ -880,7 +881,7 @@ void pc_memory_init(PCMachineState *pcms,
             machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
             machine->device_memory->base =
-                0x100000000ULL + x86ms->above_4g_mem_size;
+                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
         }
 
         machine->device_memory->base =
@@ -972,7 +973,7 @@ uint64_t pc_pci_hole64_start(void)
     } else if (pcms->sgx_epc.size != 0) {
             hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
+        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
     }
 
     return ROUND_UP(hole64_start, 1 * GiB);
diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
index a2b318dd9387..164ee1ddb8de 100644
--- a/hw/i386/sgx.c
+++ b/hw/i386/sgx.c
@@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
         return;
     }
 
-    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
+    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 
     memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
     memory_region_add_subregion(get_system_memory(), sgx_epc->base,
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index b84840a1bb99..912e96718ee8 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1319,6 +1319,7 @@ static void x86_machine_initfn(Object *obj)
     x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
     x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
     x86ms->bus_lock_ratelimit = 0;
+    x86ms->above_4g_mem_start = 0x100000000ULL;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index a145a303703f..ec6ead296064 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -58,6 +58,9 @@ struct X86MachineState {
     /* RAM information (sizes, addresses, configuration): */
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 
+    /* Start address of the initial RAM above 4G */
+    ram_addr_t above_4g_mem_start;
+
     /* CPU and apic information: */
     bool apic_xrupt_override;
     unsigned pci_irq_mask;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 2/6] i386/pc: create pci-host qdev prior to pc_memory_init()
  2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-02-23 18:44 ` [PATCH v3 1/6] hw/i386: add 4g boundary start to X86MachineState Joao Martins
@ 2022-02-23 18:44 ` Joao Martins
  2022-02-23 18:44 ` [PATCH v3 3/6] i386/pc: pass pci_hole64_size " Joao Martins
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

At the start of pc_memory_init() we usually pass a range of
0..UINT64_MAX as pci_memory, when really its 2G (i440fx) or
32G (q35). To get the real user value, we need to get pci-host
passed property for default pci_hole64_size. Thus to get that,
create the qdev prior to memory init to better make estimations
on max used/phys addr.

This is in preparation to determine that host-phys-bits are
enough and also for pci-hole64-size to be considered to relocate
ram-above-4g to be at 1T (on AMD platforms).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc_piix.c            | 5 ++++-
 hw/i386/pc_q35.c             | 6 +++---
 hw/pci-host/i440fx.c         | 3 +--
 include/hw/pci-host/i440fx.h | 2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index d9b344248dac..9ff49e672628 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    DeviceState *i440fx_dev;
 
     /*
      * Calculate ram split, for memory below and above 4G.  It's a bit
@@ -164,9 +165,11 @@ static void pc_init1(MachineState *machine,
         pci_memory = g_new(MemoryRegion, 1);
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
+        i440fx_dev = qdev_new(host_type);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
+        i440fx_dev = NULL;
     }
 
     pc_guest_info_init(pcms);
@@ -199,7 +202,7 @@ static void pc_init1(MachineState *machine,
 
         pci_bus = i440fx_init(host_type,
                               pci_type,
-                              &i440fx_state,
+                              i440fx_dev, &i440fx_state,
                               system_memory, system_io, machine->ram_size,
                               x86ms->below_4g_mem_size,
                               x86ms->above_4g_mem_size,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 1780f79bc127..2881afd75a82 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -203,12 +203,12 @@ static void pc_q35_init(MachineState *machine)
                             pcms->smbios_entry_point_type);
     }
 
-    /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
-
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
 
+    /* allocate ram and load rom/bios */
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
+
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
                              OBJECT(ram_memory), NULL);
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index e08716142b6e..5c1bab5c58ed 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -238,6 +238,7 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
 }
 
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
+                    DeviceState *dev,
                     PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
@@ -247,7 +248,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_address_space,
                     MemoryRegion *ram_memory)
 {
-    DeviceState *dev;
     PCIBus *b;
     PCIDevice *d;
     PCIHostState *s;
@@ -255,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
     unsigned i;
     I440FXState *i440fx;
 
-    dev = qdev_new(host_type);
     s = PCI_HOST_BRIDGE(dev);
     b = pci_root_bus_new(dev, NULL, pci_address_space,
                          address_space_io, 0, TYPE_PCI_BUS);
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index f068aaba8fda..c4710445e30a 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -36,7 +36,7 @@ struct PCII440FXState {
 #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
 
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
-                    PCII440FXState **pi440fx_state,
+                    DeviceState *dev, PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 3/6] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-02-23 18:44 ` [PATCH v3 1/6] hw/i386: add 4g boundary start to X86MachineState Joao Martins
  2022-02-23 18:44 ` [PATCH v3 2/6] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
@ 2022-02-23 18:44 ` Joao Martins
  2022-02-23 18:44 ` [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable Joao Martins
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

Use the pre-initialized pci-host qdev and fetch the
pci-hole64-size into pc_memory_init() newly added argument.
piix needs a bit of care given all the !pci_enabled()
and that the pci_hole64_size is private to i440fx.

This is in preparation to determine that host-phys-bits are
enough and for pci-hole64-size to be considered to relocate
ram-above-4g to be at 1T (on AMD platforms).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c                 | 3 ++-
 hw/i386/pc_piix.c            | 5 ++++-
 hw/i386/pc_q35.c             | 8 +++++++-
 hw/pci-host/i440fx.c         | 7 +++++++
 include/hw/i386/pc.h         | 3 ++-
 include/hw/pci-host/i440fx.h | 1 +
 6 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7de0e87f4a3f..360f4e10001b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -805,7 +805,8 @@ void xen_load_linux(PCMachineState *pcms)
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory)
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size)
 {
     int linux_boot, i;
     MemoryRegion *option_rom_mr;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 9ff49e672628..5a608e30e28f 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    uint64_t hole64_size;
     DeviceState *i440fx_dev;
 
     /*
@@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
         i440fx_dev = qdev_new(host_type);
+        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
         i440fx_dev = NULL;
+        hole64_size = 0;
     }
 
     pc_guest_info_init(pcms);
@@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         pc_memory_init(pcms, system_memory,
-                       rom_memory, &ram_memory);
+                       rom_memory, &ram_memory, hole64_size);
     } else {
         pc_system_flash_cleanup_unused(pcms);
         if (machine->kernel_filename != NULL) {
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 2881afd75a82..c81d21d1ebb4 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     bool acpi_pcihp;
     bool keep_pci_slot_hpc;
+    uint64_t pci_hole64_size = 0;
 
     /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
      * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
@@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
 
+    if (pcmc->pci_enabled) {
+        pci_hole64_size = q35_host->mch.pci_hole64_size;
+    }
+
     /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
+                   pci_hole64_size);
 
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index 5c1bab5c58ed..c5cc28250d5c 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }
 
+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
+{
+        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
+
+        return i440fx->pci_hole64_size;
+}
+
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     DeviceState *dev,
                     PCII440FXState **pi440fx_state,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9c9f4ac74810..d8b9c4ebd748 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory);
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size);
 uint64_t pc_pci_hole64_start(void);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(struct PCMachineState *pcms,
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index c4710445e30a..1299d6a2b0e4 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_memory,
                     MemoryRegion *ram_memory);
 
+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
 
 #endif
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2022-02-23 18:44 ` [PATCH v3 3/6] i386/pc: pass pci_hole64_size " Joao Martins
@ 2022-02-23 18:44 ` Joao Martins
  2022-02-23 21:22   ` Michael S. Tsirkin
  2022-02-24 14:27   ` Joao Martins
  2022-02-23 18:44 ` [PATCH v3 5/6] i386/pc: warn if phys-bits is too low Joao Martins
  2022-02-23 18:44 ` [PATCH v3 6/6] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
  5 siblings, 2 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++
 target/i386/cpu.h |  4 +++
 2 files changed, 83 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 360f4e10001b..6e4f5c87a2e5 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -802,6 +802,78 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static hwaddr x86_max_phys_addr(PCMachineState *pcms,
+                                uint64_t pci_hole64_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    ram_addr_t device_mem_size = 0;
+    hwaddr base;
+
+    if (pcmc->has_reserved_memory &&
+       (machine->ram_size < machine->maxram_size)) {
+        device_mem_size = machine->maxram_size - machine->ram_size;
+    }
+
+    base = ROUND_UP(x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
+                    pcms->sgx_epc.size, 1 * GiB);
+
+    return base + device_mem_size + pci_hole64_size;
+}
+
+static void x86_update_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    uint32_t eax, vendor[3];
+
+    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
+    if (!IS_AMD_VENDOR(vendor)) {
+        return;
+    }
+
+    if (x86_max_phys_addr(pcms, pci_hole64_size) < AMD_HT_START) {
+        return;
+    }
+
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -822,6 +894,8 @@ void pc_memory_init(PCMachineState *pcms,
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -832,6 +906,11 @@ void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
+    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
+        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+    }
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9911d7c8711b..1acebc569b02 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
 #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
                          (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
                          (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
+#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
+                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
+                         (vendor[2]) == CPUID_VENDOR_AMD_3)
+
 
 #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 5/6] i386/pc: warn if phys-bits is too low
  2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (3 preceding siblings ...)
  2022-02-23 18:44 ` [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-02-23 18:44 ` Joao Martins
  2022-02-24 14:42   ` Joao Martins
  2022-02-23 18:44 ` [PATCH v3 6/6] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
  5 siblings, 1 reply; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
to address 1Tb (0xff ffff ffff). On AMD platforms, if a
ram-above-4g relocation happens and the CPU wasn't configured
with a big enough phys-bits, warn the user. There isn't a
catastrophic failure exactly, the guest will still boot, but
most likely won't be able to use more than ~4G of RAM.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6e4f5c87a2e5..11598a0a39e4 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -888,6 +888,7 @@ void pc_memory_init(PCMachineState *pcms,
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr, maxusedaddr;
 
     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -896,6 +897,15 @@ void pc_memory_init(PCMachineState *pcms,
 
     x86_update_above_4g_mem_start(pcms, pci_hole64_size);
 
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    maxusedaddr = x86_max_phys_addr(pcms, pci_hole64_size);
+    if (maxphysaddr < maxusedaddr) {
+        warn_report("Address space above 4G at %"PRIx64"-%"PRIx64
+                    " phys-bits too low (%u)",
+                    x86ms->above_4g_mem_start, maxusedaddr,
+                    X86_CPU(first_cpu)->phys_bits);
+    }
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v3 6/6] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (4 preceding siblings ...)
  2022-02-23 18:44 ` [PATCH v3 5/6] i386/pc: warn if phys-bits is too low Joao Martins
@ 2022-02-23 18:44 ` Joao Martins
  5 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-23 18:44 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

The added enforcing is only relevant in the case of AMD where the
range right before the 1TB is restricted and cannot be DMA mapped
by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
or possibly other kinds of IOMMU events in the AMD IOMMU.

Although, there's a case where it may make sense to disable the
IOVA relocation/validation when migrating from a
non-valid-IOVA-aware qemu to one that supports it.

Relocating RAM regions to after the 1Tb hole has consequences for
guest ABI because we are changing the memory mapping, so make
sure that only new machine enforce but not older ones.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 6 ++++++
 hw/i386/pc_piix.c    | 2 ++
 hw/i386/pc_q35.c     | 2 ++
 include/hw/i386/pc.h | 1 +
 4 files changed, 11 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 11598a0a39e4..ef0a5325f98a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -859,9 +859,14 @@ static hwaddr x86_max_phys_addr(PCMachineState *pcms,
 static void x86_update_above_4g_mem_start(PCMachineState *pcms,
                                           uint64_t pci_hole64_size)
 {
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     uint32_t eax, vendor[3];
 
+    if (!pcmc->enforce_valid_iova) {
+        return;
+    }
+
     host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
     if (!IS_AMD_VENDOR(vendor)) {
         return;
@@ -1804,6 +1809,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     pcmc->has_reserved_memory = true;
     pcmc->kvmclock_enabled = true;
     pcmc->enforce_aligned_dimm = true;
+    pcmc->enforce_valid_iova = true;
     /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
      * to be used at the moment, 32K should be enough for a while.  */
     pcmc->acpi_data_size = 0x20000 + 0x8000;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 5a608e30e28f..c322f9494384 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -435,9 +435,11 @@ DEFINE_I440FX_MACHINE(v7_0, "pc-i440fx-7.0", NULL,
 
 static void pc_i440fx_6_2_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_i440fx_7_0_machine_options(m);
     m->alias = NULL;
     m->is_default = false;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_6_2, hw_compat_6_2_len);
     compat_props_add(m->compat_props, pc_compat_6_2, pc_compat_6_2_len);
 }
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index c81d21d1ebb4..53ed6df1e0e0 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -379,8 +379,10 @@ DEFINE_Q35_MACHINE(v7_0, "pc-q35-7.0", NULL,
 
 static void pc_q35_6_2_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_q35_7_0_machine_options(m);
     m->alias = NULL;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_6_2, hw_compat_6_2_len);
     compat_props_add(m->compat_props, pc_compat_6_2, pc_compat_6_2_len);
 }
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index d8b9c4ebd748..914340750498 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -117,6 +117,7 @@ struct PCMachineClass {
     bool has_reserved_memory;
     bool enforce_aligned_dimm;
     bool broken_reserved_end;
+    bool enforce_valid_iova;
 
     /* generate legacy CPU hotplug AML */
     bool legacy_cpu_hotplug;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23 18:44 ` [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-02-23 21:22   ` Michael S. Tsirkin
  2022-02-23 23:35     ` Joao Martins
  2022-02-24 14:27   ` Joao Martins
  1 sibling, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-23 21:22 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Daniel Jordan,
	David Edmondson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Igor Mammedov, Suravee Suthikulpanit

On Wed, Feb 23, 2022 at 06:44:53PM +0000, Joao Martins wrote:
> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

Could you point me to which driver then reserves the
other regions on Linux for AMD platforms?

> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c      | 79 +++++++++++++++++++++++++++++++++++++++++++++++
>  target/i386/cpu.h |  4 +++
>  2 files changed, 83 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 360f4e10001b..6e4f5c87a2e5 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -802,6 +802,78 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> +                                uint64_t pci_hole64_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    ram_addr_t device_mem_size = 0;
> +    hwaddr base;
> +
> +    if (pcmc->has_reserved_memory &&
> +       (machine->ram_size < machine->maxram_size)) {
> +        device_mem_size = machine->maxram_size - machine->ram_size;
> +    }
> +
> +    base = ROUND_UP(x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> +                    pcms->sgx_epc.size, 1 * GiB);
> +
> +    return base + device_mem_size + pci_hole64_size;
> +}
> +
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    uint32_t eax, vendor[3];
> +
> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +    if (!IS_AMD_VENDOR(vendor)) {
> +        return;
> +    }

Wait a sec, should this actually be tying things to the host CPU ID?
It's really about what we present to the guest though,
isn't it?

Also, can't we tie this to whether the AMD IOMMU is present?


> +
> +    if (x86_max_phys_addr(pcms, pci_hole64_size) < AMD_HT_START) {
> +        return;
> +    }
> +
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -822,6 +894,8 @@ void pc_memory_init(PCMachineState *pcms,
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -832,6 +906,11 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +
> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +    }
> +
>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 9911d7c8711b..1acebc569b02 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> +
>  
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
> -- 
> 2.17.2



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23 21:22   ` Michael S. Tsirkin
@ 2022-02-23 23:35     ` Joao Martins
  2022-02-24 16:07       ` Joao Martins
  0 siblings, 1 reply; 29+ messages in thread
From: Joao Martins @ 2022-02-23 23:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Daniel Jordan,
	David Edmondson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Igor Mammedov, Suravee Suthikulpanit

On 2/23/22 21:22, Michael S. Tsirkin wrote:
> On Wed, Feb 23, 2022 at 06:44:53PM +0000, Joao Martins wrote:
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> Could you point me to which driver then reserves the
> other regions on Linux for AMD platforms?
> 
It's two regions only. The 4G hole which its use is the same use as AMD[0]/Intel[1],
and part of that hole is the IOMMU MSI reserved range. And the 1T hole, is reserved
for HyperTransport[2]. This is hardware behaviour, so drivers just mark them reserved
and avoid using those at all.

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/amd/iommu.c#n2203
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/intel/iommu.c#n5328
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/amd/iommu.c#n2210

Now for the 256T on AMD, it isn't reserved anywhere and the only code reference that I can
give you is KVM selftests that had issues before[4] fixed by Paolo. The errata also gives
a glimpse[3].

[3] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
[4]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c8cc43c1eae2910ac96daa4216e0fb3391ad0504

>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
>> +                                uint64_t pci_hole64_size)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    ram_addr_t device_mem_size = 0;
>> +    hwaddr base;
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +       (machine->ram_size < machine->maxram_size)) {
>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>> +    }
>> +
>> +    base = ROUND_UP(x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>> +                    pcms->sgx_epc.size, 1 * GiB);
>> +
>> +    return base + device_mem_size + pci_hole64_size;
>> +}
>> +
>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    uint32_t eax, vendor[3];
>> +
>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> +    if (!IS_AMD_VENDOR(vendor)) {
>> +        return;
>> +    }
> 
> Wait a sec, should this actually be tying things to the host CPU ID?
> It's really about what we present to the guest though,
> isn't it?
> 

It was the easier catch all to use cpuid without going into
Linux UAPI specifics. But it doesn't have to tie in there, it is only
for systems with an IOMMU present.

> Also, can't we tie this to whether the AMD IOMMU is present?
> 
I think so, I can add that. Something like a amd_iommu_exists() helper
in util/vfio-helpers.c which checks if there's any sysfs child entries
that start with ivhd in /sys/class/iommu/. Given that this HT region is
hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
even worth checking the range exists in:

	/sys/kernel/iommu_groups/0/reserved_regions

(Also that sysfs ABI is >= 4.11 only)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23 18:44 ` [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable Joao Martins
  2022-02-23 21:22   ` Michael S. Tsirkin
@ 2022-02-24 14:27   ` Joao Martins
  1 sibling, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-24 14:27 UTC (permalink / raw)
  To: qemu-devel, Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Ani Sinha,
	Paolo Bonzini, Suravee Suthikulpanit

On 2/23/22 18:44, Joao Martins wrote:
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 360f4e10001b..6e4f5c87a2e5 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -802,6 +802,78 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> +                                uint64_t pci_hole64_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    ram_addr_t device_mem_size = 0;
> +    hwaddr base;
> +

I am adding this extra check for 32-bit coverage (phys-bits=32)

+    if (!x86ms->above_4g_mem_size) {
+       /*
+        * 32-bit pci hole goes from
+       * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+        */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 5/6] i386/pc: warn if phys-bits is too low
  2022-02-23 18:44 ` [PATCH v3 5/6] i386/pc: warn if phys-bits is too low Joao Martins
@ 2022-02-24 14:42   ` Joao Martins
  0 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-24 14:42 UTC (permalink / raw)
  To: qemu-devel, Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Ani Sinha,
	Paolo Bonzini, Suravee Suthikulpanit

On 2/23/22 18:44, Joao Martins wrote:
> @@ -896,6 +897,15 @@ void pc_memory_init(PCMachineState *pcms,
>  
>      x86_update_above_4g_mem_start(pcms, pci_hole64_size);
>  
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    maxusedaddr = x86_max_phys_addr(pcms, pci_hole64_size);
> +    if (maxphysaddr < maxusedaddr) {
> +        warn_report("Address space above 4G at %"PRIx64"-%"PRIx64
> +                    " phys-bits too low (%u)",
> +                    x86ms->above_4g_mem_start, maxusedaddr,
> +                    X86_CPU(first_cpu)->phys_bits);
> +    }
> +
And in addition to the change in patch 4, for 32-bit I will change this
to an error_report(...) and exit right after, and updating commit message
accordingly. The error message changes slightly too given that it was
too specific to the above 4G region. All qtests pass.

diffstat below:

@@ -904,6 +905,16 @@ void pc_memory_init(PCMachineState *pcms,

     x86_update_above_4g_mem_start(pcms, pci_hole64_size);

+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    maxusedaddr = x86_max_phys_addr(pcms, pci_hole64_size);
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u)",
+                     maxphysaddr, maxusedaddr,
+                     X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23 23:35     ` Joao Martins
@ 2022-02-24 16:07       ` Joao Martins
  2022-02-24 17:23         ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: Joao Martins @ 2022-02-24 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Daniel Jordan,
	David Edmondson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Igor Mammedov, Suravee Suthikulpanit

On 2/23/22 23:35, Joao Martins wrote:
> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>> +                                          uint64_t pci_hole64_size)
>>> +{
>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +    uint32_t eax, vendor[3];
>>> +
>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>> +        return;
>>> +    }
>>
>> Wait a sec, should this actually be tying things to the host CPU ID?
>> It's really about what we present to the guest though,
>> isn't it?
> 
> It was the easier catch all to use cpuid without going into
> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> for systems with an IOMMU present.
> 
>> Also, can't we tie this to whether the AMD IOMMU is present?
>>
> I think so, I can add that. Something like a amd_iommu_exists() helper
> in util/vfio-helpers.c which checks if there's any sysfs child entries
> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> even worth checking the range exists in:
> 
> 	/sys/kernel/iommu_groups/0/reserved_regions
> 
> (Also that sysfs ABI is >= 4.11 only)

Here's what I have staged in local tree, to address your comment.

Naturally the first chunk is what's affected by this patch the rest is a
precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
all the tests and what not.

I am not entirely sure this is the right place to put such a helper, open
to suggestions. wrt to the naming of the helper, I tried to follow the rest
of the file's style.

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a9be5d33a291..2ea4430d5dcc 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
                                           uint64_t pci_hole64_size)
 {
     X86MachineState *x86ms = X86_MACHINE(pcms);
-    uint32_t eax, vendor[3];

-    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
-    if (!IS_AMD_VENDOR(vendor)) {
+    if (!qemu_amd_iommu_is_present()) {
         return;
     }

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 7bcce3bceb0f..eb4ea071ecec 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
  */
 size_t qemu_get_host_physmem(void);

+/**
+ * qemu_amd_iommu_is_present:
+ *
+ * Operating system agnostic way of querying if an AMD IOMMU
+ * is present.
+ *
+ */
+bool qemu_amd_iommu_is_present(void);
+
 /*
  * Toggle write/execute on the pages marked MAP_JIT
  * for the current thread.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f2be7321c59f..54cef21217c4 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
 #endif
     return 0;
 }
+
+bool qemu_amd_iommu_is_present(void)
+{
+    bool found = false;
+#ifdef CONFIG_LINUX
+    struct dirent *entry;
+    char *path;
+    DIR *dir;
+
+    path = g_strdup_printf("/sys/class/iommu");
+    dir = opendir(path);
+    if (!dir) {
+        g_free(path);
+        return found;
+    }
+
+    do {
+            entry = readdir(dir);
+            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
+                found = true;
+                break;
+            }
+    } while (entry);
+
+    g_free(path);
+    closedir(dir);
+#endif
+    return found;
+}
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index af559ef3398d..c08826e7e19b 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -652,3 +652,8 @@ size_t qemu_get_host_physmem(void)
     }
     return 0;
 }
+
+bool qemu_amd_iommu_is_present(void)
+{
+    return false;
+}


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 16:07       ` Joao Martins
@ 2022-02-24 17:23         ` Michael S. Tsirkin
  2022-02-24 17:54           ` Joao Martins
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-24 17:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Daniel Jordan,
	David Edmondson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Igor Mammedov, Suravee Suthikulpanit

On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> On 2/23/22 23:35, Joao Martins wrote:
> > On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>> +                                          uint64_t pci_hole64_size)
> >>> +{
> >>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +    uint32_t eax, vendor[3];
> >>> +
> >>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>> +        return;
> >>> +    }
> >>
> >> Wait a sec, should this actually be tying things to the host CPU ID?
> >> It's really about what we present to the guest though,
> >> isn't it?
> > 
> > It was the easier catch all to use cpuid without going into
> > Linux UAPI specifics. But it doesn't have to tie in there, it is only
> > for systems with an IOMMU present.
> > 
> >> Also, can't we tie this to whether the AMD IOMMU is present?
> >>
> > I think so, I can add that. Something like a amd_iommu_exists() helper
> > in util/vfio-helpers.c which checks if there's any sysfs child entries
> > that start with ivhd in /sys/class/iommu/. Given that this HT region is
> > hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> > even worth checking the range exists in:
> > 
> > 	/sys/kernel/iommu_groups/0/reserved_regions
> > 
> > (Also that sysfs ABI is >= 4.11 only)
> 
> Here's what I have staged in local tree, to address your comment.
> 
> Naturally the first chunk is what's affected by this patch the rest is a
> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> all the tests and what not.
> 
> I am not entirely sure this is the right place to put such a helper, open
> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> of the file's style.
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index a9be5d33a291..2ea4430d5dcc 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>                                            uint64_t pci_hole64_size)
>  {
>      X86MachineState *x86ms = X86_MACHINE(pcms);
> -    uint32_t eax, vendor[3];
> 
> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> -    if (!IS_AMD_VENDOR(vendor)) {
> +    if (!qemu_amd_iommu_is_present()) {
>          return;
>      }
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 7bcce3bceb0f..eb4ea071ecec 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>   */
>  size_t qemu_get_host_physmem(void);
> 
> +/**
> + * qemu_amd_iommu_is_present:
> + *
> + * Operating system agnostic way of querying if an AMD IOMMU
> + * is present.
> + *
> + */
> +bool qemu_amd_iommu_is_present(void);
> +
>  /*
>   * Toggle write/execute on the pages marked MAP_JIT
>   * for the current thread.
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index f2be7321c59f..54cef21217c4 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>  #endif
>      return 0;
>  }
> +
> +bool qemu_amd_iommu_is_present(void)
> +{
> +    bool found = false;
> +#ifdef CONFIG_LINUX
> +    struct dirent *entry;
> +    char *path;
> +    DIR *dir;
> +
> +    path = g_strdup_printf("/sys/class/iommu");
> +    dir = opendir(path);
> +    if (!dir) {
> +        g_free(path);
> +        return found;
> +    }
> +
> +    do {
> +            entry = readdir(dir);
> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> +                found = true;
> +                break;
> +            }
> +    } while (entry);
> +
> +    g_free(path);
> +    closedir(dir);
> +#endif
> +    return found;
> +}

why are we checking whether an AMD IOMMU is present
on the host? Isn't what we care about whether it's
present in the VM? What we are reserving after all
is a range of GPAs, not actual host PA's ...



> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index af559ef3398d..c08826e7e19b 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -652,3 +652,8 @@ size_t qemu_get_host_physmem(void)
>      }
>      return 0;
>  }
> +
> +bool qemu_amd_iommu_is_present(void)
> +{
> +    return false;
> +}



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 17:23         ` Michael S. Tsirkin
@ 2022-02-24 17:54           ` Joao Martins
  2022-02-24 18:30             ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: Joao Martins @ 2022-02-24 17:54 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Daniel Jordan,
	David Edmondson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Igor Mammedov, Suravee Suthikulpanit

On 2/24/22 17:23, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>> On 2/23/22 23:35, Joao Martins wrote:
>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>> +                                          uint64_t pci_hole64_size)
>>>>> +{
>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>> +    uint32_t eax, vendor[3];
>>>>> +
>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>> +        return;
>>>>> +    }
>>>>
>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>> It's really about what we present to the guest though,
>>>> isn't it?
>>>
>>> It was the easier catch all to use cpuid without going into
>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>> for systems with an IOMMU present.
>>>
>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>
>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>> even worth checking the range exists in:
>>>
>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>
>>> (Also that sysfs ABI is >= 4.11 only)
>>
>> Here's what I have staged in local tree, to address your comment.
>>
>> Naturally the first chunk is what's affected by this patch the rest is a
>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>> all the tests and what not.
>>
>> I am not entirely sure this is the right place to put such a helper, open
>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>> of the file's style.
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index a9be5d33a291..2ea4430d5dcc 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>                                            uint64_t pci_hole64_size)
>>  {
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>> -    uint32_t eax, vendor[3];
>>
>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> -    if (!IS_AMD_VENDOR(vendor)) {
>> +    if (!qemu_amd_iommu_is_present()) {
>>          return;
>>      }
>>
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 7bcce3bceb0f..eb4ea071ecec 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>   */
>>  size_t qemu_get_host_physmem(void);
>>
>> +/**
>> + * qemu_amd_iommu_is_present:
>> + *
>> + * Operating system agnostic way of querying if an AMD IOMMU
>> + * is present.
>> + *
>> + */
>> +bool qemu_amd_iommu_is_present(void);
>> +
>>  /*
>>   * Toggle write/execute on the pages marked MAP_JIT
>>   * for the current thread.
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index f2be7321c59f..54cef21217c4 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>  #endif
>>      return 0;
>>  }
>> +
>> +bool qemu_amd_iommu_is_present(void)
>> +{
>> +    bool found = false;
>> +#ifdef CONFIG_LINUX
>> +    struct dirent *entry;
>> +    char *path;
>> +    DIR *dir;
>> +
>> +    path = g_strdup_printf("/sys/class/iommu");
>> +    dir = opendir(path);
>> +    if (!dir) {
>> +        g_free(path);
>> +        return found;
>> +    }
>> +
>> +    do {
>> +            entry = readdir(dir);
>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>> +                found = true;
>> +                break;
>> +            }
>> +    } while (entry);
>> +
>> +    g_free(path);
>> +    closedir(dir);
>> +#endif
>> +    return found;
>> +}
> 
> why are we checking whether an AMD IOMMU is present
> on the host? 
> Isn't what we care about whether it's
> present in the VM? What we are reserving after all
> is a range of GPAs, not actual host PA's ...
> 
Right.

But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
and so we need to not map that portion of address space entirely
and use some other portion of the GPA-space. This has to
do with host IOMMU which is where the DMA mapping of guest PA takes
place. So, if you do not have an host IOMMU, you can't
service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
don't need this problem.

Whether one uses a guest IOMMU or not does not affect the result,
it would be more of a case of mimicking real hardware not fixing
the issue of this series.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 17:54           ` Joao Martins
@ 2022-02-24 18:30             ` Michael S. Tsirkin
  2022-02-24 19:44               ` Joao Martins
  2022-02-25  3:52               ` Jason Wang
  0 siblings, 2 replies; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-24 18:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> >> On 2/23/22 23:35, Joao Martins wrote:
> >>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>> +                                          uint64_t pci_hole64_size)
> >>>>> +{
> >>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>> +    uint32_t eax, vendor[3];
> >>>>> +
> >>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>> +        return;
> >>>>> +    }
> >>>>
> >>>> Wait a sec, should this actually be tying things to the host CPU ID?
> >>>> It's really about what we present to the guest though,
> >>>> isn't it?
> >>>
> >>> It was the easier catch all to use cpuid without going into
> >>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>> for systems with an IOMMU present.
> >>>
> >>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>
> >>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> >>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> >>> even worth checking the range exists in:
> >>>
> >>> 	/sys/kernel/iommu_groups/0/reserved_regions
> >>>
> >>> (Also that sysfs ABI is >= 4.11 only)
> >>
> >> Here's what I have staged in local tree, to address your comment.
> >>
> >> Naturally the first chunk is what's affected by this patch the rest is a
> >> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> >> all the tests and what not.
> >>
> >> I am not entirely sure this is the right place to put such a helper, open
> >> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> >> of the file's style.
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index a9be5d33a291..2ea4430d5dcc 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>                                            uint64_t pci_hole64_size)
> >>  {
> >>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >> -    uint32_t eax, vendor[3];
> >>
> >> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >> -    if (!IS_AMD_VENDOR(vendor)) {
> >> +    if (!qemu_amd_iommu_is_present()) {
> >>          return;
> >>      }
> >>
> >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >> index 7bcce3bceb0f..eb4ea071ecec 100644
> >> --- a/include/qemu/osdep.h
> >> +++ b/include/qemu/osdep.h
> >> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>   */
> >>  size_t qemu_get_host_physmem(void);
> >>
> >> +/**
> >> + * qemu_amd_iommu_is_present:
> >> + *
> >> + * Operating system agnostic way of querying if an AMD IOMMU
> >> + * is present.
> >> + *
> >> + */
> >> +bool qemu_amd_iommu_is_present(void);
> >> +
> >>  /*
> >>   * Toggle write/execute on the pages marked MAP_JIT
> >>   * for the current thread.
> >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >> index f2be7321c59f..54cef21217c4 100644
> >> --- a/util/oslib-posix.c
> >> +++ b/util/oslib-posix.c
> >> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>  #endif
> >>      return 0;
> >>  }
> >> +
> >> +bool qemu_amd_iommu_is_present(void)
> >> +{
> >> +    bool found = false;
> >> +#ifdef CONFIG_LINUX
> >> +    struct dirent *entry;
> >> +    char *path;
> >> +    DIR *dir;
> >> +
> >> +    path = g_strdup_printf("/sys/class/iommu");
> >> +    dir = opendir(path);
> >> +    if (!dir) {
> >> +        g_free(path);
> >> +        return found;
> >> +    }
> >> +
> >> +    do {
> >> +            entry = readdir(dir);
> >> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >> +                found = true;
> >> +                break;
> >> +            }
> >> +    } while (entry);
> >> +
> >> +    g_free(path);
> >> +    closedir(dir);
> >> +#endif
> >> +    return found;
> >> +}
> > 
> > why are we checking whether an AMD IOMMU is present
> > on the host? 
> > Isn't what we care about whether it's
> > present in the VM? What we are reserving after all
> > is a range of GPAs, not actual host PA's ...
> > 
> Right.
> 
> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> and so we need to not map that portion of address space entirely
> and use some other portion of the GPA-space. This has to
> do with host IOMMU which is where the DMA mapping of guest PA takes
> place. So, if you do not have an host IOMMU, you can't
> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> don't need this problem.
> 
> Whether one uses a guest IOMMU or not does not affect the result,
> it would be more of a case of mimicking real hardware not fixing
> the issue of this series.


Hmm okay ... my first reaction was to say let's put it under VFIO then.
And ideally move the logic reporting reserved ranges there too.
However, I think vdpa has the same issue too.
CC Jason for an opinion.
Also, some concerns
- I think this changes memory layout for working VMs not using VFIO.
  Better preserve the old layout for old machine types?

- You mention the phys bits issue very briefly, and it's pretty
  concerning.  Do we maybe want to also disable the work around if phys
  bits is too small? Also, we'd need to test a bunch of old
  guests to see what happens. Which guests were tested? 

-- 
MST



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 18:30             ` Michael S. Tsirkin
@ 2022-02-24 19:44               ` Joao Martins
  2022-02-24 19:54                 ` Michael S. Tsirkin
  2022-02-25  3:52               ` Jason Wang
  1 sibling, 1 reply; 29+ messages in thread
From: Joao Martins @ 2022-02-24 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On 2/24/22 18:30, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>> +{
>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>> +
>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>
>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>> It's really about what we present to the guest though,
>>>>>> isn't it?
>>>>>
>>>>> It was the easier catch all to use cpuid without going into
>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>> for systems with an IOMMU present.
>>>>>
>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>
>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>>>> even worth checking the range exists in:
>>>>>
>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>>>
>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>
>>>> Here's what I have staged in local tree, to address your comment.
>>>>
>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>> all the tests and what not.
>>>>
>>>> I am not entirely sure this is the right place to put such a helper, open
>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>> of the file's style.
>>>>
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index a9be5d33a291..2ea4430d5dcc 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>                                            uint64_t pci_hole64_size)
>>>>  {
>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> -    uint32_t eax, vendor[3];
>>>>
>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>> -    if (!IS_AMD_VENDOR(vendor)) {
>>>> +    if (!qemu_amd_iommu_is_present()) {
>>>>          return;
>>>>      }
>>>>
>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
>>>> --- a/include/qemu/osdep.h
>>>> +++ b/include/qemu/osdep.h
>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>>>   */
>>>>  size_t qemu_get_host_physmem(void);
>>>>
>>>> +/**
>>>> + * qemu_amd_iommu_is_present:
>>>> + *
>>>> + * Operating system agnostic way of querying if an AMD IOMMU
>>>> + * is present.
>>>> + *
>>>> + */
>>>> +bool qemu_amd_iommu_is_present(void);
>>>> +
>>>>  /*
>>>>   * Toggle write/execute on the pages marked MAP_JIT
>>>>   * for the current thread.
>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>> index f2be7321c59f..54cef21217c4 100644
>>>> --- a/util/oslib-posix.c
>>>> +++ b/util/oslib-posix.c
>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>>>  #endif
>>>>      return 0;
>>>>  }
>>>> +
>>>> +bool qemu_amd_iommu_is_present(void)
>>>> +{
>>>> +    bool found = false;
>>>> +#ifdef CONFIG_LINUX
>>>> +    struct dirent *entry;
>>>> +    char *path;
>>>> +    DIR *dir;
>>>> +
>>>> +    path = g_strdup_printf("/sys/class/iommu");
>>>> +    dir = opendir(path);
>>>> +    if (!dir) {
>>>> +        g_free(path);
>>>> +        return found;
>>>> +    }
>>>> +
>>>> +    do {
>>>> +            entry = readdir(dir);
>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>>>> +                found = true;
>>>> +                break;
>>>> +            }
>>>> +    } while (entry);
>>>> +
>>>> +    g_free(path);
>>>> +    closedir(dir);
>>>> +#endif
>>>> +    return found;
>>>> +}
>>>
>>> why are we checking whether an AMD IOMMU is present
>>> on the host? 
>>> Isn't what we care about whether it's
>>> present in the VM? What we are reserving after all
>>> is a range of GPAs, not actual host PA's ...
>>>
>> Right.
>>
>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>> and so we need to not map that portion of address space entirely
>> and use some other portion of the GPA-space. This has to
>> do with host IOMMU which is where the DMA mapping of guest PA takes
>> place. So, if you do not have an host IOMMU, you can't
>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
>> don't need this problem.
>>
>> Whether one uses a guest IOMMU or not does not affect the result,
>> it would be more of a case of mimicking real hardware not fixing
>> the issue of this series.
> 
> 
> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> And ideally move the logic reporting reserved ranges there too.
> However, I think vdpa has the same issue too.
> CC Jason for an opinion.

It does have the same problem.

This is not specific to VFIO, it's to anything that uses the iommu. It's
just that VFIO doesn't let you shoot in the foot :)

vDPA would need to validate iova ranges as well to prevent mapping on
reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
map request is within a valid iova range"). Now you could argue that
it's an IOMMU core problem, maybe.

But I this as a separate problem,
because regardless of said validation, your guest provisioning
is still going to fail for guests with >=1010G of max GPA and even if
it doesn't fail you shouldn't be letting it DMA map those in the first
place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
from IOMMU if DMA is attempted over the first megabyte of that 1T hole).

> Also, some concerns
> - I think this changes memory layout for working VMs not using VFIO.
>   Better preserve the old layout for old machine types?
> 
Oh definitely, and I do that in this series. See the last commit, and
in the past it was also a machine-property exposed to userspace.
Otherwise I would be breaking (badly) compat/migration.

I would like to emphasize that these memory layout changes are *exclusive* and
*only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
It's not every guest, but a fairly limited subset of big-memory guests that
are not the norm.

Albeit the phys-bits property errors might gives us a bit more trouble, so
it might help being more conservative.

> - You mention the phys bits issue very briefly, and it's pretty
>   concerning.  Do we maybe want to also disable the work around if phys
>   bits is too small? 

We are doing that here (well, v4), as suggested by Igor. Note that in this series
it's a warning, but v4 I have it as an error and with the 32-bit issues that
I found through qtest.

I share the same concern as you over making this an error because of compatibility.
Perhaps, we could go back to the previous version and turn back into a warning and
additionally even disabling the relocation all together. Or if desired even
depending on a machine-property to enable it.

> Also, we'd need to test a bunch of old
>   guests to see what happens. Which guests were tested? 
> 
Do you envision that old guests would run (or care) into this sort of 1TB config? Mainly
testing more modern guests on Linux (>= 4.14) over this last set of versions, and in the
past Windows 2012+. Let me be extra extra pedantic on this part for the next submission
(and report back if something odd happens).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 19:44               ` Joao Martins
@ 2022-02-24 19:54                 ` Michael S. Tsirkin
  2022-02-24 20:04                   ` Joao Martins
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-24 19:54 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> >> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> >>>> On 2/23/22 23:35, Joao Martins wrote:
> >>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>> +                                          uint64_t pci_hole64_size)
> >>>>>>> +{
> >>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>> +
> >>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>> +        return;
> >>>>>>> +    }
> >>>>>>
> >>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
> >>>>>> It's really about what we present to the guest though,
> >>>>>> isn't it?
> >>>>>
> >>>>> It was the easier catch all to use cpuid without going into
> >>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>>>> for systems with an IOMMU present.
> >>>>>
> >>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>
> >>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> >>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> >>>>> even worth checking the range exists in:
> >>>>>
> >>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
> >>>>>
> >>>>> (Also that sysfs ABI is >= 4.11 only)
> >>>>
> >>>> Here's what I have staged in local tree, to address your comment.
> >>>>
> >>>> Naturally the first chunk is what's affected by this patch the rest is a
> >>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> >>>> all the tests and what not.
> >>>>
> >>>> I am not entirely sure this is the right place to put such a helper, open
> >>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> >>>> of the file's style.
> >>>>
> >>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>> --- a/hw/i386/pc.c
> >>>> +++ b/hw/i386/pc.c
> >>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>                                            uint64_t pci_hole64_size)
> >>>>  {
> >>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> -    uint32_t eax, vendor[3];
> >>>>
> >>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>          return;
> >>>>      }
> >>>>
> >>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>> --- a/include/qemu/osdep.h
> >>>> +++ b/include/qemu/osdep.h
> >>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>   */
> >>>>  size_t qemu_get_host_physmem(void);
> >>>>
> >>>> +/**
> >>>> + * qemu_amd_iommu_is_present:
> >>>> + *
> >>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>> + * is present.
> >>>> + *
> >>>> + */
> >>>> +bool qemu_amd_iommu_is_present(void);
> >>>> +
> >>>>  /*
> >>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>   * for the current thread.
> >>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>> index f2be7321c59f..54cef21217c4 100644
> >>>> --- a/util/oslib-posix.c
> >>>> +++ b/util/oslib-posix.c
> >>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>  #endif
> >>>>      return 0;
> >>>>  }
> >>>> +
> >>>> +bool qemu_amd_iommu_is_present(void)
> >>>> +{
> >>>> +    bool found = false;
> >>>> +#ifdef CONFIG_LINUX
> >>>> +    struct dirent *entry;
> >>>> +    char *path;
> >>>> +    DIR *dir;
> >>>> +
> >>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>> +    dir = opendir(path);
> >>>> +    if (!dir) {
> >>>> +        g_free(path);
> >>>> +        return found;
> >>>> +    }
> >>>> +
> >>>> +    do {
> >>>> +            entry = readdir(dir);
> >>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>> +                found = true;
> >>>> +                break;
> >>>> +            }
> >>>> +    } while (entry);
> >>>> +
> >>>> +    g_free(path);
> >>>> +    closedir(dir);
> >>>> +#endif
> >>>> +    return found;
> >>>> +}
> >>>
> >>> why are we checking whether an AMD IOMMU is present
> >>> on the host? 
> >>> Isn't what we care about whether it's
> >>> present in the VM? What we are reserving after all
> >>> is a range of GPAs, not actual host PA's ...
> >>>
> >> Right.
> >>
> >> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >> and so we need to not map that portion of address space entirely
> >> and use some other portion of the GPA-space. This has to
> >> do with host IOMMU which is where the DMA mapping of guest PA takes
> >> place. So, if you do not have an host IOMMU, you can't
> >> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> >> don't need this problem.
> >>
> >> Whether one uses a guest IOMMU or not does not affect the result,
> >> it would be more of a case of mimicking real hardware not fixing
> >> the issue of this series.
> > 
> > 
> > Hmm okay ... my first reaction was to say let's put it under VFIO then.
> > And ideally move the logic reporting reserved ranges there too.
> > However, I think vdpa has the same issue too.
> > CC Jason for an opinion.
> 
> It does have the same problem.
> 
> This is not specific to VFIO, it's to anything that uses the iommu.

Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
that we don't have a global "enable-vfio" flag since vfio devices
can be hot-plugged. I think this is not the first time we wanted
something like this, right Alex?

> It's
> just that VFIO doesn't let you shoot in the foot :)
> 
> vDPA would need to validate iova ranges as well to prevent mapping on
> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> map request is within a valid iova range"). Now you could argue that
> it's an IOMMU core problem, maybe.
> 
> But I this as a separate problem,
> because regardless of said validation, your guest provisioning
> is still going to fail for guests with >=1010G of max GPA and even if
> it doesn't fail you shouldn't be letting it DMA map those in the first
> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).

I wonder what's the status on a system without an IOMMU.

> > Also, some concerns
> > - I think this changes memory layout for working VMs not using VFIO.
> >   Better preserve the old layout for old machine types?
> > 
> Oh definitely, and I do that in this series. See the last commit, and
> in the past it was also a machine-property exposed to userspace.
> Otherwise I would be breaking (badly) compat/migration.
> 
> I would like to emphasize that these memory layout changes are *exclusive* and
> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
> It's not every guest, but a fairly limited subset of big-memory guests that
> are not the norm.
> 
> Albeit the phys-bits property errors might gives us a bit more trouble, so
> it might help being more conservative.
> 
> > - You mention the phys bits issue very briefly, and it's pretty
> >   concerning.  Do we maybe want to also disable the work around if phys
> >   bits is too small? 
> 
> We are doing that here (well, v4), as suggested by Igor. Note that in this series
> it's a warning, but v4 I have it as an error and with the 32-bit issues that
> I found through qtest.
> 
> I share the same concern as you over making this an error because of compatibility.
> Perhaps, we could go back to the previous version and turn back into a warning and
> additionally even disabling the relocation all together. Or if desired even
> depending on a machine-property to enable it.

I would be inclined to disable the relocation. And maybe block vfio? I'm
not 100% sure about that last one, but this can be a vfio decision to
make.

> > Also, we'd need to test a bunch of old
> >   guests to see what happens. Which guests were tested? 
> > 
> Do you envision that old guests would run (or care) into this sort of 1TB config? Mainly
> testing more modern guests on Linux (>= 4.14) over this last set of versions, and in the
> past Windows 2012+. Let me be extra extra pedantic on this part for the next submission
> (and report back if something odd happens).

Not 100% sure but let's at least document.

-- 
MST



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 19:54                 ` Michael S. Tsirkin
@ 2022-02-24 20:04                   ` Joao Martins
  2022-02-24 20:12                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 29+ messages in thread
From: Joao Martins @ 2022-02-24 20:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit



On 2/24/22 19:54, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>> +{
>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>> +
>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>> +        return;
>>>>>>>>> +    }
>>>>>>>>
>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>> It's really about what we present to the guest though,
>>>>>>>> isn't it?
>>>>>>>
>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>> for systems with an IOMMU present.
>>>>>>>
>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>
>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>>>>>> even worth checking the range exists in:
>>>>>>>
>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>
>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>
>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>
>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>>>> all the tests and what not.
>>>>>>
>>>>>> I am not entirely sure this is the right place to put such a helper, open
>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>>>> of the file's style.
>>>>>>
>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
>>>>>> --- a/hw/i386/pc.c
>>>>>> +++ b/hw/i386/pc.c
>>>>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>                                            uint64_t pci_hole64_size)
>>>>>>  {
>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> -    uint32_t eax, vendor[3];
>>>>>>
>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
>>>>>> +    if (!qemu_amd_iommu_is_present()) {
>>>>>>          return;
>>>>>>      }
>>>>>>
>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
>>>>>> --- a/include/qemu/osdep.h
>>>>>> +++ b/include/qemu/osdep.h
>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>>>>>   */
>>>>>>  size_t qemu_get_host_physmem(void);
>>>>>>
>>>>>> +/**
>>>>>> + * qemu_amd_iommu_is_present:
>>>>>> + *
>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
>>>>>> + * is present.
>>>>>> + *
>>>>>> + */
>>>>>> +bool qemu_amd_iommu_is_present(void);
>>>>>> +
>>>>>>  /*
>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
>>>>>>   * for the current thread.
>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>>>> index f2be7321c59f..54cef21217c4 100644
>>>>>> --- a/util/oslib-posix.c
>>>>>> +++ b/util/oslib-posix.c
>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>>>>>  #endif
>>>>>>      return 0;
>>>>>>  }
>>>>>> +
>>>>>> +bool qemu_amd_iommu_is_present(void)
>>>>>> +{
>>>>>> +    bool found = false;
>>>>>> +#ifdef CONFIG_LINUX
>>>>>> +    struct dirent *entry;
>>>>>> +    char *path;
>>>>>> +    DIR *dir;
>>>>>> +
>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
>>>>>> +    dir = opendir(path);
>>>>>> +    if (!dir) {
>>>>>> +        g_free(path);
>>>>>> +        return found;
>>>>>> +    }
>>>>>> +
>>>>>> +    do {
>>>>>> +            entry = readdir(dir);
>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>>>>>> +                found = true;
>>>>>> +                break;
>>>>>> +            }
>>>>>> +    } while (entry);
>>>>>> +
>>>>>> +    g_free(path);
>>>>>> +    closedir(dir);
>>>>>> +#endif
>>>>>> +    return found;
>>>>>> +}
>>>>>
>>>>> why are we checking whether an AMD IOMMU is present
>>>>> on the host? 
>>>>> Isn't what we care about whether it's
>>>>> present in the VM? What we are reserving after all
>>>>> is a range of GPAs, not actual host PA's ...
>>>>>
>>>> Right.
>>>>
>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>> and so we need to not map that portion of address space entirely
>>>> and use some other portion of the GPA-space. This has to
>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>> place. So, if you do not have an host IOMMU, you can't
>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
>>>> don't need this problem.
>>>>
>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>> it would be more of a case of mimicking real hardware not fixing
>>>> the issue of this series.
>>>
>>>
>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>> And ideally move the logic reporting reserved ranges there too.
>>> However, I think vdpa has the same issue too.
>>> CC Jason for an opinion.
>>
>> It does have the same problem.
>>
>> This is not specific to VFIO, it's to anything that uses the iommu.
> 
> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> that we don't have a global "enable-vfio" flag since vfio devices
> can be hot-plugged. I think this is not the first time we wanted
> something like this, right Alex?
> 
>> It's
>> just that VFIO doesn't let you shoot in the foot :)
>>
>> vDPA would need to validate iova ranges as well to prevent mapping on
>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>> map request is within a valid iova range"). Now you could argue that
>> it's an IOMMU core problem, maybe.
>>
>> But I this as a separate problem,
>> because regardless of said validation, your guest provisioning
>> is still going to fail for guests with >=1010G of max GPA and even if
>> it doesn't fail you shouldn't be letting it DMA map those in the first
>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
> 
> I wonder what's the status on a system without an IOMMU.
> 
In theory you should be OK. Also it's worth keeping in mind that AMD machines
with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
is the initiator should be fine on worst case scenario.

>>> Also, some concerns
>>> - I think this changes memory layout for working VMs not using VFIO.
>>>   Better preserve the old layout for old machine types?
>>>
>> Oh definitely, and I do that in this series. See the last commit, and
>> in the past it was also a machine-property exposed to userspace.
>> Otherwise I would be breaking (badly) compat/migration.
>>
>> I would like to emphasize that these memory layout changes are *exclusive* and
>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>> It's not every guest, but a fairly limited subset of big-memory guests that
>> are not the norm.
>>
>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>> it might help being more conservative.
>>
>>> - You mention the phys bits issue very briefly, and it's pretty
>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>   bits is too small? 
>>
>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
>> I found through qtest.
>>
>> I share the same concern as you over making this an error because of compatibility.
>> Perhaps, we could go back to the previous version and turn back into a warning and
>> additionally even disabling the relocation all together. Or if desired even
>> depending on a machine-property to enable it.
> 
> I would be inclined to disable the relocation. And maybe block vfio? I'm
> not 100% sure about that last one, but this can be a vfio decision to
> make.
> 
I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
and thus will also let us avoid DMA mapping in these GPA ranges.

The reason I mention VFIO in cover letter is that it's one visible UAPI change that
users will notice that suddenly their 1TB+ VMs break.

>>> Also, we'd need to test a bunch of old
>>>   guests to see what happens. Which guests were tested? 
>>>
>> Do you envision that old guests would run (or care) into this sort of 1TB config? Mainly
>> testing more modern guests on Linux (>= 4.14) over this last set of versions, and in the
>> past Windows 2012+. Let me be extra extra pedantic on this part for the next submission
>> (and report back if something odd happens).
> 
> Not 100% sure but let's at least document.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 20:04                   ` Joao Martins
@ 2022-02-24 20:12                     ` Michael S. Tsirkin
  2022-02-24 20:34                       ` Joao Martins
  0 siblings, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-24 20:12 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
> 
> 
> On 2/24/22 19:54, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
> >> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> >>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> >>>>>> On 2/23/22 23:35, Joao Martins wrote:
> >>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>> +                                          uint64_t pci_hole64_size)
> >>>>>>>>> +{
> >>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>>>> +
> >>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>>> +        return;
> >>>>>>>>> +    }
> >>>>>>>>
> >>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
> >>>>>>>> It's really about what we present to the guest though,
> >>>>>>>> isn't it?
> >>>>>>>
> >>>>>>> It was the easier catch all to use cpuid without going into
> >>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>>>>>> for systems with an IOMMU present.
> >>>>>>>
> >>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>>>
> >>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> >>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> >>>>>>> even worth checking the range exists in:
> >>>>>>>
> >>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
> >>>>>>>
> >>>>>>> (Also that sysfs ABI is >= 4.11 only)
> >>>>>>
> >>>>>> Here's what I have staged in local tree, to address your comment.
> >>>>>>
> >>>>>> Naturally the first chunk is what's affected by this patch the rest is a
> >>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> >>>>>> all the tests and what not.
> >>>>>>
> >>>>>> I am not entirely sure this is the right place to put such a helper, open
> >>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> >>>>>> of the file's style.
> >>>>>>
> >>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>>>> --- a/hw/i386/pc.c
> >>>>>> +++ b/hw/i386/pc.c
> >>>>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>                                            uint64_t pci_hole64_size)
> >>>>>>  {
> >>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>> -    uint32_t eax, vendor[3];
> >>>>>>
> >>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>>>          return;
> >>>>>>      }
> >>>>>>
> >>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>>>> --- a/include/qemu/osdep.h
> >>>>>> +++ b/include/qemu/osdep.h
> >>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>>>   */
> >>>>>>  size_t qemu_get_host_physmem(void);
> >>>>>>
> >>>>>> +/**
> >>>>>> + * qemu_amd_iommu_is_present:
> >>>>>> + *
> >>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>>>> + * is present.
> >>>>>> + *
> >>>>>> + */
> >>>>>> +bool qemu_amd_iommu_is_present(void);
> >>>>>> +
> >>>>>>  /*
> >>>>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>>>   * for the current thread.
> >>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>>> index f2be7321c59f..54cef21217c4 100644
> >>>>>> --- a/util/oslib-posix.c
> >>>>>> +++ b/util/oslib-posix.c
> >>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>>>  #endif
> >>>>>>      return 0;
> >>>>>>  }
> >>>>>> +
> >>>>>> +bool qemu_amd_iommu_is_present(void)
> >>>>>> +{
> >>>>>> +    bool found = false;
> >>>>>> +#ifdef CONFIG_LINUX
> >>>>>> +    struct dirent *entry;
> >>>>>> +    char *path;
> >>>>>> +    DIR *dir;
> >>>>>> +
> >>>>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>>>> +    dir = opendir(path);
> >>>>>> +    if (!dir) {
> >>>>>> +        g_free(path);
> >>>>>> +        return found;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    do {
> >>>>>> +            entry = readdir(dir);
> >>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>>>> +                found = true;
> >>>>>> +                break;
> >>>>>> +            }
> >>>>>> +    } while (entry);
> >>>>>> +
> >>>>>> +    g_free(path);
> >>>>>> +    closedir(dir);
> >>>>>> +#endif
> >>>>>> +    return found;
> >>>>>> +}
> >>>>>
> >>>>> why are we checking whether an AMD IOMMU is present
> >>>>> on the host? 
> >>>>> Isn't what we care about whether it's
> >>>>> present in the VM? What we are reserving after all
> >>>>> is a range of GPAs, not actual host PA's ...
> >>>>>
> >>>> Right.
> >>>>
> >>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >>>> and so we need to not map that portion of address space entirely
> >>>> and use some other portion of the GPA-space. This has to
> >>>> do with host IOMMU which is where the DMA mapping of guest PA takes
> >>>> place. So, if you do not have an host IOMMU, you can't
> >>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> >>>> don't need this problem.
> >>>>
> >>>> Whether one uses a guest IOMMU or not does not affect the result,
> >>>> it would be more of a case of mimicking real hardware not fixing
> >>>> the issue of this series.
> >>>
> >>>
> >>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> >>> And ideally move the logic reporting reserved ranges there too.
> >>> However, I think vdpa has the same issue too.
> >>> CC Jason for an opinion.
> >>
> >> It does have the same problem.
> >>
> >> This is not specific to VFIO, it's to anything that uses the iommu.
> > 
> > Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> > that we don't have a global "enable-vfio" flag since vfio devices
> > can be hot-plugged. I think this is not the first time we wanted
> > something like this, right Alex?
> > 
> >> It's
> >> just that VFIO doesn't let you shoot in the foot :)
> >>
> >> vDPA would need to validate iova ranges as well to prevent mapping on
> >> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> >> map request is within a valid iova range"). Now you could argue that
> >> it's an IOMMU core problem, maybe.
> >>
> >> But I this as a separate problem,
> >> because regardless of said validation, your guest provisioning
> >> is still going to fail for guests with >=1010G of max GPA and even if
> >> it doesn't fail you shouldn't be letting it DMA map those in the first
> >> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> >> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
> > 
> > I wonder what's the status on a system without an IOMMU.
> > 
> In theory you should be OK. Also it's worth keeping in mind that AMD machines
> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
> is the initiator should be fine on worst case scenario.

Not sure I understand this last sentence.

> >>> Also, some concerns
> >>> - I think this changes memory layout for working VMs not using VFIO.
> >>>   Better preserve the old layout for old machine types?
> >>>
> >> Oh definitely, and I do that in this series. See the last commit, and
> >> in the past it was also a machine-property exposed to userspace.
> >> Otherwise I would be breaking (badly) compat/migration.
> >>
> >> I would like to emphasize that these memory layout changes are *exclusive* and
> >> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
> >> It's not every guest, but a fairly limited subset of big-memory guests that
> >> are not the norm.
> >>
> >> Albeit the phys-bits property errors might gives us a bit more trouble, so
> >> it might help being more conservative.
> >>
> >>> - You mention the phys bits issue very briefly, and it's pretty
> >>>   concerning.  Do we maybe want to also disable the work around if phys
> >>>   bits is too small? 
> >>
> >> We are doing that here (well, v4), as suggested by Igor. Note that in this series
> >> it's a warning, but v4 I have it as an error and with the 32-bit issues that
> >> I found through qtest.
> >>
> >> I share the same concern as you over making this an error because of compatibility.
> >> Perhaps, we could go back to the previous version and turn back into a warning and
> >> additionally even disabling the relocation all together. Or if desired even
> >> depending on a machine-property to enable it.
> > 
> > I would be inclined to disable the relocation. And maybe block vfio? I'm
> > not 100% sure about that last one, but this can be a vfio decision to
> > make.
> > 
> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
> and thus will also let us avoid DMA mapping in these GPA ranges.
> 
> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
> users will notice that suddenly their 1TB+ VMs break.

What I mean is that most VMs don't use either VFIO or VDPA.
If we had VFIO,VDPA as an accelerator that needs to be listed
upfront when qemu is started, we could solve this with
a bit less risk by not changing anything for VMs without these two.

Alex what do you think about adding this?

Given we do not have such a thing right now, one can get
into a situation where phys bits limit CPU, then
more memory is supplied than would fit above reserved region, then
we stick the memory over the reserved region, and finally
then vfio device is added.

In this last configuration, should vfio device add fail?
Or just warn and try to continue? I think we can code this last
decision as part of vfio code and then Alex will get to decide ;)
And yes, a similar thing with vdpa.


> >>> Also, we'd need to test a bunch of old
> >>>   guests to see what happens. Which guests were tested? 
> >>>
> >> Do you envision that old guests would run (or care) into this sort of 1TB config? Mainly
> >> testing more modern guests on Linux (>= 4.14) over this last set of versions, and in the
> >> past Windows 2012+. Let me be extra extra pedantic on this part for the next submission
> >> (and report back if something odd happens).
> > 
> > Not 100% sure but let's at least document.
> > 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 20:12                     ` Michael S. Tsirkin
@ 2022-02-24 20:34                       ` Joao Martins
  2022-02-24 21:40                         ` Alex Williamson
  2022-02-25  5:22                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-24 20:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On 2/24/22 20:12, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
>> On 2/24/22 19:54, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>>>> +{
>>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>>>> +
>>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>>>> +        return;
>>>>>>>>>>> +    }
>>>>>>>>>>
>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>>>> It's really about what we present to the guest though,
>>>>>>>>>> isn't it?
>>>>>>>>>
>>>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>>>> for systems with an IOMMU present.
>>>>>>>>>
>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>>>
>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>>>>>>>> even worth checking the range exists in:
>>>>>>>>>
>>>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>>>
>>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>>>
>>>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>>>
>>>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>>>>>> all the tests and what not.
>>>>>>>>
>>>>>>>> I am not entirely sure this is the right place to put such a helper, open
>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>>>>>> of the file's style.
>>>>>>>>
>>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
>>>>>>>> --- a/hw/i386/pc.c
>>>>>>>> +++ b/hw/i386/pc.c
>>>>>>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>                                            uint64_t pci_hole64_size)
>>>>>>>>  {
>>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>> -    uint32_t eax, vendor[3];
>>>>>>>>
>>>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>> +    if (!qemu_amd_iommu_is_present()) {
>>>>>>>>          return;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
>>>>>>>> --- a/include/qemu/osdep.h
>>>>>>>> +++ b/include/qemu/osdep.h
>>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>>>>>>>   */
>>>>>>>>  size_t qemu_get_host_physmem(void);
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * qemu_amd_iommu_is_present:
>>>>>>>> + *
>>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
>>>>>>>> + * is present.
>>>>>>>> + *
>>>>>>>> + */
>>>>>>>> +bool qemu_amd_iommu_is_present(void);
>>>>>>>> +
>>>>>>>>  /*
>>>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
>>>>>>>>   * for the current thread.
>>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>>>>>> index f2be7321c59f..54cef21217c4 100644
>>>>>>>> --- a/util/oslib-posix.c
>>>>>>>> +++ b/util/oslib-posix.c
>>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>>>>>>>  #endif
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>> +
>>>>>>>> +bool qemu_amd_iommu_is_present(void)
>>>>>>>> +{
>>>>>>>> +    bool found = false;
>>>>>>>> +#ifdef CONFIG_LINUX
>>>>>>>> +    struct dirent *entry;
>>>>>>>> +    char *path;
>>>>>>>> +    DIR *dir;
>>>>>>>> +
>>>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
>>>>>>>> +    dir = opendir(path);
>>>>>>>> +    if (!dir) {
>>>>>>>> +        g_free(path);
>>>>>>>> +        return found;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    do {
>>>>>>>> +            entry = readdir(dir);
>>>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>>>>>>>> +                found = true;
>>>>>>>> +                break;
>>>>>>>> +            }
>>>>>>>> +    } while (entry);
>>>>>>>> +
>>>>>>>> +    g_free(path);
>>>>>>>> +    closedir(dir);
>>>>>>>> +#endif
>>>>>>>> +    return found;
>>>>>>>> +}
>>>>>>>
>>>>>>> why are we checking whether an AMD IOMMU is present
>>>>>>> on the host? 
>>>>>>> Isn't what we care about whether it's
>>>>>>> present in the VM? What we are reserving after all
>>>>>>> is a range of GPAs, not actual host PA's ...
>>>>>>>
>>>>>> Right.
>>>>>>
>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>>>> and so we need to not map that portion of address space entirely
>>>>>> and use some other portion of the GPA-space. This has to
>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>>>> place. So, if you do not have an host IOMMU, you can't
>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
>>>>>> don't need this problem.
>>>>>>
>>>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>>>> it would be more of a case of mimicking real hardware not fixing
>>>>>> the issue of this series.
>>>>>
>>>>>
>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>>>> And ideally move the logic reporting reserved ranges there too.
>>>>> However, I think vdpa has the same issue too.
>>>>> CC Jason for an opinion.
>>>>
>>>> It does have the same problem.
>>>>
>>>> This is not specific to VFIO, it's to anything that uses the iommu.
>>>
>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
>>> that we don't have a global "enable-vfio" flag since vfio devices
>>> can be hot-plugged. I think this is not the first time we wanted
>>> something like this, right Alex?
>>>
>>>> It's
>>>> just that VFIO doesn't let you shoot in the foot :)
>>>>
>>>> vDPA would need to validate iova ranges as well to prevent mapping on
>>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>>>> map request is within a valid iova range"). Now you could argue that
>>>> it's an IOMMU core problem, maybe.
>>>>
>>>> But I this as a separate problem,
>>>> because regardless of said validation, your guest provisioning
>>>> is still going to fail for guests with >=1010G of max GPA and even if
>>>> it doesn't fail you shouldn't be letting it DMA map those in the first
>>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
>>>
>>> I wonder what's the status on a system without an IOMMU.
>>>
>> In theory you should be OK. Also it's worth keeping in mind that AMD machines
>> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
>> is the initiator should be fine on worst case scenario.
> 
> Not sure I understand this last sentence.
> 
I was trying to say that the E820 from hardware is marked as Reserved and any DMA
from/to endpoints from kernel-supplied CPU addresses will use those reserved ranges.

>>>>> Also, some concerns
>>>>> - I think this changes memory layout for working VMs not using VFIO.
>>>>>   Better preserve the old layout for old machine types?
>>>>>
>>>> Oh definitely, and I do that in this series. See the last commit, and
>>>> in the past it was also a machine-property exposed to userspace.
>>>> Otherwise I would be breaking (badly) compat/migration.
>>>>
>>>> I would like to emphasize that these memory layout changes are *exclusive* and
>>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>>>> It's not every guest, but a fairly limited subset of big-memory guests that
>>>> are not the norm.
>>>>
>>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>>>> it might help being more conservative.
>>>>
>>>>> - You mention the phys bits issue very briefly, and it's pretty
>>>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>>>   bits is too small? 
>>>>
>>>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
>>>> I found through qtest.
>>>>
>>>> I share the same concern as you over making this an error because of compatibility.
>>>> Perhaps, we could go back to the previous version and turn back into a warning and
>>>> additionally even disabling the relocation all together. Or if desired even
>>>> depending on a machine-property to enable it.
>>>
>>> I would be inclined to disable the relocation. And maybe block vfio? I'm
>>> not 100% sure about that last one, but this can be a vfio decision to
>>> make.
>>>
>> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
>> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
>> and thus will also let us avoid DMA mapping in these GPA ranges.
>>
>> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
>> users will notice that suddenly their 1TB+ VMs break.
> 
> What I mean is that most VMs don't use either VFIO or VDPA.
> If we had VFIO,VDPA as an accelerator that needs to be listed
> upfront when qemu is started, we could solve this with
> a bit less risk by not changing anything for VMs without these two.
> 
That wouldn't work for the vfio/vdpa hotplug case (which we do use),
and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
you what are those allowed IOVA ranges when you create the container.

The machine property, though, could communicate /the intent/ to add
any VFIO or vDPA devices in the future and maybe cover for that. That
let's one avoid any 'accelerator-only' problems and restrict the issues
of this series to VMs with VFIO/VDPA if the risk is a concern.

> Alex what do you think about adding this?
> 
> Given we do not have such a thing right now, one can get
> into a situation where phys bits limit CPU, then
> more memory is supplied than would fit above reserved region, then
> we stick the memory over the reserved region, and finally
> then vfio device is added.
> 
The current code considers the maximum possible address considering
memory hotplug, PCI hole64 and etc. So we would need to worry about
whether VFIO or VDPA is going to be hotplugged at some point in the
future during guest lifecycle, do decide to alter the memory layout
at guest provisioning.

> In this last configuration, should vfio device add fail?
> Or just warn and try to continue? I think we can code this last
> decision as part of vfio code and then Alex will get to decide ;)
> And yes, a similar thing with vdpa.
> 

Of all those cases I would feel the machine-property is better,
and more flexible than having VFIO/VDPA deal with a bad memory-layout and
discovering late stage that the user is doing something wrong (and thus
fail the DMA_MAP operation for those who do check invalid iovas)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 20:34                       ` Joao Martins
@ 2022-02-24 21:40                         ` Alex Williamson
  2022-02-25 12:36                           ` Joao Martins
  2022-02-25  5:22                         ` Michael S. Tsirkin
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2022-02-24 21:40 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S. Tsirkin, Jason Wang,
	Richard Henderson, qemu-devel, Daniel Jordan, David Edmondson,
	Paolo Bonzini, Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Thu, 24 Feb 2022 20:34:40 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:  
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:  
> >>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:  
> >>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:  
> >>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:  
> >>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:  
> >>>>>>>> On 2/23/22 23:35, Joao Martins wrote:  
> >>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:  
> >>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>>>> +                                          uint64_t pci_hole64_size)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>>>>>> +
> >>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>>>>> +        return;
> >>>>>>>>>>> +    }  
> >>>>>>>>>>
> >>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
> >>>>>>>>>> It's really about what we present to the guest though,
> >>>>>>>>>> isn't it?  
> >>>>>>>>>
> >>>>>>>>> It was the easier catch all to use cpuid without going into
> >>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>>>>>>>> for systems with an IOMMU present.
> >>>>>>>>>  
> >>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>>>>>  
> >>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> >>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> >>>>>>>>> even worth checking the range exists in:
> >>>>>>>>>
> >>>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
> >>>>>>>>>
> >>>>>>>>> (Also that sysfs ABI is >= 4.11 only)  
> >>>>>>>>
> >>>>>>>> Here's what I have staged in local tree, to address your comment.
> >>>>>>>>
> >>>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
> >>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> >>>>>>>> all the tests and what not.
> >>>>>>>>
> >>>>>>>> I am not entirely sure this is the right place to put such a helper, open
> >>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> >>>>>>>> of the file's style.
> >>>>>>>>
> >>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>>>>>> --- a/hw/i386/pc.c
> >>>>>>>> +++ b/hw/i386/pc.c
> >>>>>>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>                                            uint64_t pci_hole64_size)
> >>>>>>>>  {
> >>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>> -    uint32_t eax, vendor[3];
> >>>>>>>>
> >>>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>>>>>          return;
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>>>>>> --- a/include/qemu/osdep.h
> >>>>>>>> +++ b/include/qemu/osdep.h
> >>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>>>>>   */
> >>>>>>>>  size_t qemu_get_host_physmem(void);
> >>>>>>>>
> >>>>>>>> +/**
> >>>>>>>> + * qemu_amd_iommu_is_present:
> >>>>>>>> + *
> >>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>>>>>> + * is present.
> >>>>>>>> + *
> >>>>>>>> + */
> >>>>>>>> +bool qemu_amd_iommu_is_present(void);
> >>>>>>>> +
> >>>>>>>>  /*
> >>>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>>>>>   * for the current thread.
> >>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>>>>> index f2be7321c59f..54cef21217c4 100644
> >>>>>>>> --- a/util/oslib-posix.c
> >>>>>>>> +++ b/util/oslib-posix.c
> >>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>>>>>  #endif
> >>>>>>>>      return 0;
> >>>>>>>>  }
> >>>>>>>> +
> >>>>>>>> +bool qemu_amd_iommu_is_present(void)
> >>>>>>>> +{
> >>>>>>>> +    bool found = false;
> >>>>>>>> +#ifdef CONFIG_LINUX
> >>>>>>>> +    struct dirent *entry;
> >>>>>>>> +    char *path;
> >>>>>>>> +    DIR *dir;
> >>>>>>>> +
> >>>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>>>>>> +    dir = opendir(path);
> >>>>>>>> +    if (!dir) {
> >>>>>>>> +        g_free(path);
> >>>>>>>> +        return found;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    do {
> >>>>>>>> +            entry = readdir(dir);
> >>>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>>>>>> +                found = true;
> >>>>>>>> +                break;
> >>>>>>>> +            }
> >>>>>>>> +    } while (entry);
> >>>>>>>> +
> >>>>>>>> +    g_free(path);
> >>>>>>>> +    closedir(dir);
> >>>>>>>> +#endif
> >>>>>>>> +    return found;
> >>>>>>>> +}  
> >>>>>>>
> >>>>>>> why are we checking whether an AMD IOMMU is present
> >>>>>>> on the host? 
> >>>>>>> Isn't what we care about whether it's
> >>>>>>> present in the VM? What we are reserving after all
> >>>>>>> is a range of GPAs, not actual host PA's ...
> >>>>>>>  
> >>>>>> Right.
> >>>>>>
> >>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >>>>>> and so we need to not map that portion of address space entirely
> >>>>>> and use some other portion of the GPA-space. This has to
> >>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
> >>>>>> place. So, if you do not have an host IOMMU, you can't
> >>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> >>>>>> don't need this problem.
> >>>>>>
> >>>>>> Whether one uses a guest IOMMU or not does not affect the result,
> >>>>>> it would be more of a case of mimicking real hardware not fixing
> >>>>>> the issue of this series.  
> >>>>>
> >>>>>
> >>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> >>>>> And ideally move the logic reporting reserved ranges there too.
> >>>>> However, I think vdpa has the same issue too.
> >>>>> CC Jason for an opinion.  
> >>>>
> >>>> It does have the same problem.
> >>>>
> >>>> This is not specific to VFIO, it's to anything that uses the iommu.  
> >>>
> >>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> >>> that we don't have a global "enable-vfio" flag since vfio devices
> >>> can be hot-plugged. I think this is not the first time we wanted
> >>> something like this, right Alex?


I would very much NOT like to see such a flag ever existing.


> >>>> It's
> >>>> just that VFIO doesn't let you shoot in the foot :)
> >>>>
> >>>> vDPA would need to validate iova ranges as well to prevent mapping on
> >>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> >>>> map request is within a valid iova range"). Now you could argue that
> >>>> it's an IOMMU core problem, maybe.
> >>>>
> >>>> But I this as a separate problem,
> >>>> because regardless of said validation, your guest provisioning
> >>>> is still going to fail for guests with >=1010G of max GPA and even if
> >>>> it doesn't fail you shouldn't be letting it DMA map those in the first
> >>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> >>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).  
> >>>
> >>> I wonder what's the status on a system without an IOMMU.
> >>>  
> >> In theory you should be OK. Also it's worth keeping in mind that AMD machines
> >> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
> >> is the initiator should be fine on worst case scenario.  
> > 
> > Not sure I understand this last sentence.
> >   
> I was trying to say that the E820 from hardware is marked as Reserved and any DMA
> from/to endpoints from kernel-supplied CPU addresses will use those reserved ranges.
> 
> >>>>> Also, some concerns
> >>>>> - I think this changes memory layout for working VMs not using VFIO.
> >>>>>   Better preserve the old layout for old machine types?
> >>>>>  
> >>>> Oh definitely, and I do that in this series. See the last commit, and
> >>>> in the past it was also a machine-property exposed to userspace.
> >>>> Otherwise I would be breaking (badly) compat/migration.
> >>>>
> >>>> I would like to emphasize that these memory layout changes are *exclusive* and
> >>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
> >>>> It's not every guest, but a fairly limited subset of big-memory guests that
> >>>> are not the norm.
> >>>>
> >>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
> >>>> it might help being more conservative.
> >>>>  
> >>>>> - You mention the phys bits issue very briefly, and it's pretty
> >>>>>   concerning.  Do we maybe want to also disable the work around if phys
> >>>>>   bits is too small?   
> >>>>
> >>>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
> >>>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
> >>>> I found through qtest.
> >>>>
> >>>> I share the same concern as you over making this an error because of compatibility.
> >>>> Perhaps, we could go back to the previous version and turn back into a warning and
> >>>> additionally even disabling the relocation all together. Or if desired even
> >>>> depending on a machine-property to enable it.  
> >>>
> >>> I would be inclined to disable the relocation. And maybe block vfio? I'm
> >>> not 100% sure about that last one, but this can be a vfio decision to
> >>> make.
> >>>  
> >> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
> >> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
> >> and thus will also let us avoid DMA mapping in these GPA ranges.
> >>
> >> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
> >> users will notice that suddenly their 1TB+ VMs break.  
> > 
> > What I mean is that most VMs don't use either VFIO or VDPA.
> > If we had VFIO,VDPA as an accelerator that needs to be listed
> > upfront when qemu is started, we could solve this with
> > a bit less risk by not changing anything for VMs without these two.
> >   
> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
> you what are those allowed IOVA ranges when you create the container.
> 
> The machine property, though, could communicate /the intent/ to add
> any VFIO or vDPA devices in the future and maybe cover for that. That
> let's one avoid any 'accelerator-only' problems and restrict the issues
> of this series to VMs with VFIO/VDPA if the risk is a concern.
> 
> > Alex what do you think about adding this?
> > 
> > Given we do not have such a thing right now, one can get
> > into a situation where phys bits limit CPU, then
> > more memory is supplied than would fit above reserved region, then
> > we stick the memory over the reserved region, and finally
> > then vfio device is added.
> >   
> The current code considers the maximum possible address considering
> memory hotplug, PCI hole64 and etc. So we would need to worry about
> whether VFIO or VDPA is going to be hotplugged at some point in the
> future during guest lifecycle, do decide to alter the memory layout
> at guest provisioning.
> 
> > In this last configuration, should vfio device add fail?
> > Or just warn and try to continue? I think we can code this last
> > decision as part of vfio code and then Alex will get to decide ;)
> > And yes, a similar thing with vdpa.
> >   
> 
> Of all those cases I would feel the machine-property is better,
> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> discovering late stage that the user is doing something wrong (and thus
> fail the DMA_MAP operation for those who do check invalid iovas)

The trouble is that anything we can glean from the host system where we
instantiate the VM is mostly meaningless relative to data center
orchestration.  We're relatively insulated from these sorts of issues
on x86 (apparently aside from this case), AIUI ARM is even worse about
having arbitrary reserved ranges within their IOVA space.

For a comprehensive solution, it's not a machine accelerator property
or enable such-and-such functionality flag, it's the ability to specify
a VM memory map on the QEMU command line and data center orchestration
tools gaining insight across all their systems to specify a memory
layout that can work regardless of how a VM might be migrated.  Maybe
there's a "host" option to that memory map command line option that
would take into account the common case of a static host or at least
homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
to generate a least common denominator set such that the VM is
compatible to any host in the cluster.

On the device end, I really would prefer not to see device driver
specific enables and we simply cannot hot-add a device of the given
type without a pre-enabled VM.  Give the user visibility and
configurability to the issue and simply fail the device add (ideally
with a useful error message) if the device IOVA space cannot support
the VM memory layout (this is what vfio already does afaik).

When we have iommufd support common for vfio and vdpa, hopefully we'll
also be able to recommend a common means for learning about system and
IOMMU restrictions to IOVA spaces.  For now we have reserved_regions
reported in sysfs per IOMMU group:

 $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep -v direct-relaxable

Thanks,

Alex



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 18:30             ` Michael S. Tsirkin
  2022-02-24 19:44               ` Joao Martins
@ 2022-02-25  3:52               ` Jason Wang
  1 sibling, 0 replies; 29+ messages in thread
From: Jason Wang @ 2022-02-25  3:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Daniel Jordan,
	David Edmondson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Igor Mammedov, Joao Martins, Suravee Suthikulpanit

On Fri, Feb 25, 2022 at 2:30 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> > On 2/24/22 17:23, Michael S. Tsirkin wrote:
> > > On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> > >> On 2/23/22 23:35, Joao Martins wrote:
> > >>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> > >>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> > >>>>> +                                          uint64_t pci_hole64_size)
> > >>>>> +{
> > >>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> > >>>>> +    uint32_t eax, vendor[3];
> > >>>>> +
> > >>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > >>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> > >>>>> +        return;
> > >>>>> +    }
> > >>>>
> > >>>> Wait a sec, should this actually be tying things to the host CPU ID?
> > >>>> It's really about what we present to the guest though,
> > >>>> isn't it?
> > >>>
> > >>> It was the easier catch all to use cpuid without going into
> > >>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> > >>> for systems with an IOMMU present.
> > >>>
> > >>>> Also, can't we tie this to whether the AMD IOMMU is present?
> > >>>>
> > >>> I think so, I can add that. Something like a amd_iommu_exists() helper
> > >>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> > >>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> > >>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> > >>> even worth checking the range exists in:
> > >>>
> > >>>   /sys/kernel/iommu_groups/0/reserved_regions
> > >>>
> > >>> (Also that sysfs ABI is >= 4.11 only)
> > >>
> > >> Here's what I have staged in local tree, to address your comment.
> > >>
> > >> Naturally the first chunk is what's affected by this patch the rest is a
> > >> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> > >> all the tests and what not.
> > >>
> > >> I am not entirely sure this is the right place to put such a helper, open
> > >> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> > >> of the file's style.
> > >>
> > >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > >> index a9be5d33a291..2ea4430d5dcc 100644
> > >> --- a/hw/i386/pc.c
> > >> +++ b/hw/i386/pc.c
> > >> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> > >>                                            uint64_t pci_hole64_size)
> > >>  {
> > >>      X86MachineState *x86ms = X86_MACHINE(pcms);
> > >> -    uint32_t eax, vendor[3];
> > >>
> > >> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > >> -    if (!IS_AMD_VENDOR(vendor)) {
> > >> +    if (!qemu_amd_iommu_is_present()) {
> > >>          return;
> > >>      }
> > >>
> > >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > >> index 7bcce3bceb0f..eb4ea071ecec 100644
> > >> --- a/include/qemu/osdep.h
> > >> +++ b/include/qemu/osdep.h
> > >> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> > >>   */
> > >>  size_t qemu_get_host_physmem(void);
> > >>
> > >> +/**
> > >> + * qemu_amd_iommu_is_present:
> > >> + *
> > >> + * Operating system agnostic way of querying if an AMD IOMMU
> > >> + * is present.
> > >> + *
> > >> + */
> > >> +bool qemu_amd_iommu_is_present(void);
> > >> +
> > >>  /*
> > >>   * Toggle write/execute on the pages marked MAP_JIT
> > >>   * for the current thread.
> > >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > >> index f2be7321c59f..54cef21217c4 100644
> > >> --- a/util/oslib-posix.c
> > >> +++ b/util/oslib-posix.c
> > >> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> > >>  #endif
> > >>      return 0;
> > >>  }
> > >> +
> > >> +bool qemu_amd_iommu_is_present(void)
> > >> +{
> > >> +    bool found = false;
> > >> +#ifdef CONFIG_LINUX
> > >> +    struct dirent *entry;
> > >> +    char *path;
> > >> +    DIR *dir;
> > >> +
> > >> +    path = g_strdup_printf("/sys/class/iommu");
> > >> +    dir = opendir(path);
> > >> +    if (!dir) {
> > >> +        g_free(path);
> > >> +        return found;
> > >> +    }
> > >> +
> > >> +    do {
> > >> +            entry = readdir(dir);
> > >> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> > >> +                found = true;
> > >> +                break;
> > >> +            }
> > >> +    } while (entry);
> > >> +
> > >> +    g_free(path);
> > >> +    closedir(dir);
> > >> +#endif
> > >> +    return found;
> > >> +}
> > >
> > > why are we checking whether an AMD IOMMU is present
> > > on the host?
> > > Isn't what we care about whether it's
> > > present in the VM? What we are reserving after all
> > > is a range of GPAs, not actual host PA's ...
> > >
> > Right.
> >
> > But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> > and so we need to not map that portion of address space entirely
> > and use some other portion of the GPA-space. This has to
> > do with host IOMMU which is where the DMA mapping of guest PA takes
> > place. So, if you do not have an host IOMMU, you can't
> > service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> > don't need this problem.
> >
> > Whether one uses a guest IOMMU or not does not affect the result,
> > it would be more of a case of mimicking real hardware not fixing
> > the issue of this series.
>
>
> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> And ideally move the logic reporting reserved ranges there too.
> However, I think vdpa has the same issue too.
> CC Jason for an opinion.

vDPA is even more complicated than VFIO as it allows the device to
reserve specific IOVA ranges:

static void vhost_vdpa_set_iova_range(struct vhost_vdpa *v)
{
        struct vdpa_iova_range *range = &v->range;
        struct vdpa_device *vdpa = v->vdpa;
        const struct vdpa_config_ops *ops = vdpa->config;

        if (ops->get_iova_range) {
                *range = ops->get_iova_range(vdpa);
...
}

I wonder if we need to expose those via netlink protocol.

Thanks

> Also, some concerns
> - I think this changes memory layout for working VMs not using VFIO.
>   Better preserve the old layout for old machine types?
>
> - You mention the phys bits issue very briefly, and it's pretty
>   concerning.  Do we maybe want to also disable the work around if phys
>   bits is too small? Also, we'd need to test a bunch of old
>   guests to see what happens. Which guests were tested?
>
> --
> MST
>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 20:34                       ` Joao Martins
  2022-02-24 21:40                         ` Alex Williamson
@ 2022-02-25  5:22                         ` Michael S. Tsirkin
  2022-02-25 12:36                           ` Joao Martins
  1 sibling, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-25  5:22 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote:
> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
> >>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> >>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
> >>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
> >>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
> >>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>>>> +                                          uint64_t pci_hole64_size)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>>>>>> +
> >>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>>>>> +        return;
> >>>>>>>>>>> +    }
> >>>>>>>>>>
> >>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
> >>>>>>>>>> It's really about what we present to the guest though,
> >>>>>>>>>> isn't it?
> >>>>>>>>>
> >>>>>>>>> It was the easier catch all to use cpuid without going into
> >>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>>>>>>>> for systems with an IOMMU present.
> >>>>>>>>>
> >>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
> >>>>>>>>>>
> >>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> >>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
> >>>>>>>>> even worth checking the range exists in:
> >>>>>>>>>
> >>>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
> >>>>>>>>>
> >>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
> >>>>>>>>
> >>>>>>>> Here's what I have staged in local tree, to address your comment.
> >>>>>>>>
> >>>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
> >>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> >>>>>>>> all the tests and what not.
> >>>>>>>>
> >>>>>>>> I am not entirely sure this is the right place to put such a helper, open
> >>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> >>>>>>>> of the file's style.
> >>>>>>>>
> >>>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>>>>>> index a9be5d33a291..2ea4430d5dcc 100644
> >>>>>>>> --- a/hw/i386/pc.c
> >>>>>>>> +++ b/hw/i386/pc.c
> >>>>>>>> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>>>>>>>                                            uint64_t pci_hole64_size)
> >>>>>>>>  {
> >>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>>> -    uint32_t eax, vendor[3];
> >>>>>>>>
> >>>>>>>> -    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>>> -    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>>> +    if (!qemu_amd_iommu_is_present()) {
> >>>>>>>>          return;
> >>>>>>>>      }
> >>>>>>>>
> >>>>>>>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >>>>>>>> index 7bcce3bceb0f..eb4ea071ecec 100644
> >>>>>>>> --- a/include/qemu/osdep.h
> >>>>>>>> +++ b/include/qemu/osdep.h
> >>>>>>>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>>>>>>>   */
> >>>>>>>>  size_t qemu_get_host_physmem(void);
> >>>>>>>>
> >>>>>>>> +/**
> >>>>>>>> + * qemu_amd_iommu_is_present:
> >>>>>>>> + *
> >>>>>>>> + * Operating system agnostic way of querying if an AMD IOMMU
> >>>>>>>> + * is present.
> >>>>>>>> + *
> >>>>>>>> + */
> >>>>>>>> +bool qemu_amd_iommu_is_present(void);
> >>>>>>>> +
> >>>>>>>>  /*
> >>>>>>>>   * Toggle write/execute on the pages marked MAP_JIT
> >>>>>>>>   * for the current thread.
> >>>>>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >>>>>>>> index f2be7321c59f..54cef21217c4 100644
> >>>>>>>> --- a/util/oslib-posix.c
> >>>>>>>> +++ b/util/oslib-posix.c
> >>>>>>>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>>>>>>>  #endif
> >>>>>>>>      return 0;
> >>>>>>>>  }
> >>>>>>>> +
> >>>>>>>> +bool qemu_amd_iommu_is_present(void)
> >>>>>>>> +{
> >>>>>>>> +    bool found = false;
> >>>>>>>> +#ifdef CONFIG_LINUX
> >>>>>>>> +    struct dirent *entry;
> >>>>>>>> +    char *path;
> >>>>>>>> +    DIR *dir;
> >>>>>>>> +
> >>>>>>>> +    path = g_strdup_printf("/sys/class/iommu");
> >>>>>>>> +    dir = opendir(path);
> >>>>>>>> +    if (!dir) {
> >>>>>>>> +        g_free(path);
> >>>>>>>> +        return found;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    do {
> >>>>>>>> +            entry = readdir(dir);
> >>>>>>>> +            if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >>>>>>>> +                found = true;
> >>>>>>>> +                break;
> >>>>>>>> +            }
> >>>>>>>> +    } while (entry);
> >>>>>>>> +
> >>>>>>>> +    g_free(path);
> >>>>>>>> +    closedir(dir);
> >>>>>>>> +#endif
> >>>>>>>> +    return found;
> >>>>>>>> +}
> >>>>>>>
> >>>>>>> why are we checking whether an AMD IOMMU is present
> >>>>>>> on the host? 
> >>>>>>> Isn't what we care about whether it's
> >>>>>>> present in the VM? What we are reserving after all
> >>>>>>> is a range of GPAs, not actual host PA's ...
> >>>>>>>
> >>>>>> Right.
> >>>>>>
> >>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >>>>>> and so we need to not map that portion of address space entirely
> >>>>>> and use some other portion of the GPA-space. This has to
> >>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
> >>>>>> place. So, if you do not have an host IOMMU, you can't
> >>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> >>>>>> don't need this problem.
> >>>>>>
> >>>>>> Whether one uses a guest IOMMU or not does not affect the result,
> >>>>>> it would be more of a case of mimicking real hardware not fixing
> >>>>>> the issue of this series.
> >>>>>
> >>>>>
> >>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
> >>>>> And ideally move the logic reporting reserved ranges there too.
> >>>>> However, I think vdpa has the same issue too.
> >>>>> CC Jason for an opinion.
> >>>>
> >>>> It does have the same problem.
> >>>>
> >>>> This is not specific to VFIO, it's to anything that uses the iommu.
> >>>
> >>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> >>> that we don't have a global "enable-vfio" flag since vfio devices
> >>> can be hot-plugged. I think this is not the first time we wanted
> >>> something like this, right Alex?
> >>>
> >>>> It's
> >>>> just that VFIO doesn't let you shoot in the foot :)
> >>>>
> >>>> vDPA would need to validate iova ranges as well to prevent mapping on
> >>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
> >>>> map request is within a valid iova range"). Now you could argue that
> >>>> it's an IOMMU core problem, maybe.
> >>>>
> >>>> But I this as a separate problem,
> >>>> because regardless of said validation, your guest provisioning
> >>>> is still going to fail for guests with >=1010G of max GPA and even if
> >>>> it doesn't fail you shouldn't be letting it DMA map those in the first
> >>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
> >>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
> >>>
> >>> I wonder what's the status on a system without an IOMMU.
> >>>
> >> In theory you should be OK. Also it's worth keeping in mind that AMD machines
> >> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
> >> is the initiator should be fine on worst case scenario.
> > 
> > Not sure I understand this last sentence.
> > 
> I was trying to say that the E820 from hardware is marked as Reserved and any DMA
> from/to endpoints from kernel-supplied CPU addresses will use those reserved ranges.

meaning "will *not* use" I guess?

> >>>>> Also, some concerns
> >>>>> - I think this changes memory layout for working VMs not using VFIO.
> >>>>>   Better preserve the old layout for old machine types?
> >>>>>
> >>>> Oh definitely, and I do that in this series. See the last commit, and
> >>>> in the past it was also a machine-property exposed to userspace.
> >>>> Otherwise I would be breaking (badly) compat/migration.
> >>>>
> >>>> I would like to emphasize that these memory layout changes are *exclusive* and
> >>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
> >>>> It's not every guest, but a fairly limited subset of big-memory guests that
> >>>> are not the norm.
> >>>>
> >>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
> >>>> it might help being more conservative.
> >>>>
> >>>>> - You mention the phys bits issue very briefly, and it's pretty
> >>>>>   concerning.  Do we maybe want to also disable the work around if phys
> >>>>>   bits is too small? 
> >>>>
> >>>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
> >>>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
> >>>> I found through qtest.
> >>>>
> >>>> I share the same concern as you over making this an error because of compatibility.
> >>>> Perhaps, we could go back to the previous version and turn back into a warning and
> >>>> additionally even disabling the relocation all together. Or if desired even
> >>>> depending on a machine-property to enable it.
> >>>
> >>> I would be inclined to disable the relocation. And maybe block vfio? I'm
> >>> not 100% sure about that last one, but this can be a vfio decision to
> >>> make.
> >>>
> >> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
> >> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
> >> and thus will also let us avoid DMA mapping in these GPA ranges.
> >>
> >> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
> >> users will notice that suddenly their 1TB+ VMs break.
> > 
> > What I mean is that most VMs don't use either VFIO or VDPA.
> > If we had VFIO,VDPA as an accelerator that needs to be listed
> > upfront when qemu is started, we could solve this with
> > a bit less risk by not changing anything for VMs without these two.
> > 
> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
> you what are those allowed IOVA ranges when you create the container.
> 
> The machine property, though, could communicate /the intent/ to add
> any VFIO or vDPA devices in the future and maybe cover for that. That
> let's one avoid any 'accelerator-only' problems and restrict the issues
> of this series to VMs with VFIO/VDPA if the risk is a concern.

Well Alex nacked that idea (and I certainly trust him where VFIO is
concerned), I guess for now we'll just live with the risk.

> > Alex what do you think about adding this?
> > 
> > Given we do not have such a thing right now, one can get
> > into a situation where phys bits limit CPU, then
> > more memory is supplied than would fit above reserved region, then
> > we stick the memory over the reserved region, and finally
> > then vfio device is added.
> > 
> The current code considers the maximum possible address considering
> memory hotplug, PCI hole64 and etc. So we would need to worry about
> whether VFIO or VDPA is going to be hotplugged at some point in the
> future during guest lifecycle, do decide to alter the memory layout
> at guest provisioning.

Right. And given we currently have no way of knowing, we have to assume
yes. At least we can check whether qemu was configured with VFIO or VDPA.

> > In this last configuration, should vfio device add fail?
> > Or just warn and try to continue? I think we can code this last
> > decision as part of vfio code and then Alex will get to decide ;)
> > And yes, a similar thing with vdpa.
> > 
> 
> Of all those cases I would feel the machine-property is better,
> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> discovering late stage that the user is doing something wrong (and thus
> fail the DMA_MAP operation for those who do check invalid iovas)

-- 
MST



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-24 21:40                         ` Alex Williamson
@ 2022-02-25 12:36                           ` Joao Martins
  2022-02-25 12:49                             ` Michael S. Tsirkin
  2022-02-25 16:15                             ` Alex Williamson
  0 siblings, 2 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-25 12:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eduardo Habkost, Michael S. Tsirkin, Jason Wang,
	Richard Henderson, qemu-devel, Daniel Jordan, David Edmondson,
	Paolo Bonzini, Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On 2/24/22 21:40, Alex Williamson wrote:
> On Thu, 24 Feb 2022 20:34:40 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 2/24/22 20:12, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:  
>>>> On 2/24/22 19:54, Michael S. Tsirkin wrote:  
>>>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:  
>>>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:  
>>>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:  
>>>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:  
>>>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:  
>>>>>>>>>> On 2/23/22 23:35, Joao Martins wrote:  
>>>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:  
>>>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>>>>>> +        return;
>>>>>>>>>>>>> +    }  
>>>>>>>>>>>>
>>>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>>>>>> It's really about what we present to the guest though,
>>>>>>>>>>>> isn't it?  
>>>>>>>>>>>
>>>>>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>>>>>> for systems with an IOMMU present.
>>>>>>>>>>>  
>>>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>>>>>  
>>>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>>>>>>>>>> even worth checking the range exists in:
>>>>>>>>>>>
>>>>>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>>>>>
>>>>>>>>>>> (Also that sysfs ABI is >= 4.11 only)  
>>>>>>>>>>
>>>>>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>>>>>
>>>>>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>>>>>>>> all the tests and what not.
>>>>>>>>>>
>>>>>>>>>> I am not entirely sure this is the right place to put such a helper, open
>>>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>>>>>>>> of the file's style.
>>>>>>>>>>

[snip]

>>>>>>>>>
>>>>>>>>> why are we checking whether an AMD IOMMU is present
>>>>>>>>> on the host? 
>>>>>>>>> Isn't what we care about whether it's
>>>>>>>>> present in the VM? What we are reserving after all
>>>>>>>>> is a range of GPAs, not actual host PA's ...
>>>>>>>>>  
>>>>>>>> Right.
>>>>>>>>
>>>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>>>>>> and so we need to not map that portion of address space entirely
>>>>>>>> and use some other portion of the GPA-space. This has to
>>>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>>>>>> place. So, if you do not have an host IOMMU, you can't
>>>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
>>>>>>>> don't need this problem.
>>>>>>>>
>>>>>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>>>>>> it would be more of a case of mimicking real hardware not fixing
>>>>>>>> the issue of this series.  
>>>>>>>
>>>>>>>
>>>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>>>>>> And ideally move the logic reporting reserved ranges there too.
>>>>>>> However, I think vdpa has the same issue too.
>>>>>>> CC Jason for an opinion.  
>>>>>>
>>>>>> It does have the same problem.
>>>>>>
>>>>>> This is not specific to VFIO, it's to anything that uses the iommu.  
>>>>>
>>>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
>>>>> that we don't have a global "enable-vfio" flag since vfio devices
>>>>> can be hot-plugged. I think this is not the first time we wanted
>>>>> something like this, right Alex?
> 
> 
> I would very much NOT like to see such a flag ever existing.
> 
>>>>>>> Also, some concerns
>>>>>>> - I think this changes memory layout for working VMs not using VFIO.
>>>>>>>   Better preserve the old layout for old machine types?
>>>>>>>  
>>>>>> Oh definitely, and I do that in this series. See the last commit, and
>>>>>> in the past it was also a machine-property exposed to userspace.
>>>>>> Otherwise I would be breaking (badly) compat/migration.
>>>>>>
>>>>>> I would like to emphasize that these memory layout changes are *exclusive* and
>>>>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>>>>>> It's not every guest, but a fairly limited subset of big-memory guests that
>>>>>> are not the norm.
>>>>>>
>>>>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>>>>>> it might help being more conservative.
>>>>>>  
>>>>>>> - You mention the phys bits issue very briefly, and it's pretty
>>>>>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>>>>>   bits is too small?   
>>>>>>
>>>>>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
>>>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
>>>>>> I found through qtest.
>>>>>>
>>>>>> I share the same concern as you over making this an error because of compatibility.
>>>>>> Perhaps, we could go back to the previous version and turn back into a warning and
>>>>>> additionally even disabling the relocation all together. Or if desired even
>>>>>> depending on a machine-property to enable it.  
>>>>>
>>>>> I would be inclined to disable the relocation. And maybe block vfio? I'm
>>>>> not 100% sure about that last one, but this can be a vfio decision to
>>>>> make.
>>>>>  
>>>> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
>>>> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
>>>> and thus will also let us avoid DMA mapping in these GPA ranges.
>>>>
>>>> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
>>>> users will notice that suddenly their 1TB+ VMs break.  
>>>
>>> What I mean is that most VMs don't use either VFIO or VDPA.
>>> If we had VFIO,VDPA as an accelerator that needs to be listed
>>> upfront when qemu is started, we could solve this with
>>> a bit less risk by not changing anything for VMs without these two.
>>>   
>> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
>> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
>> you what are those allowed IOVA ranges when you create the container.
>>
>> The machine property, though, could communicate /the intent/ to add
>> any VFIO or vDPA devices in the future and maybe cover for that. That
>> let's one avoid any 'accelerator-only' problems and restrict the issues
>> of this series to VMs with VFIO/VDPA if the risk is a concern.
>>
>>> Alex what do you think about adding this?
>>>
>>> Given we do not have such a thing right now, one can get
>>> into a situation where phys bits limit CPU, then
>>> more memory is supplied than would fit above reserved region, then
>>> we stick the memory over the reserved region, and finally
>>> then vfio device is added.
>>>   
>> The current code considers the maximum possible address considering
>> memory hotplug, PCI hole64 and etc. So we would need to worry about
>> whether VFIO or VDPA is going to be hotplugged at some point in the
>> future during guest lifecycle, do decide to alter the memory layout
>> at guest provisioning.
>>
>>> In this last configuration, should vfio device add fail?
>>> Or just warn and try to continue? I think we can code this last
>>> decision as part of vfio code and then Alex will get to decide ;)
>>> And yes, a similar thing with vdpa.
>>>   
>>
>> Of all those cases I would feel the machine-property is better,
>> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
>> discovering late stage that the user is doing something wrong (and thus
>> fail the DMA_MAP operation for those who do check invalid iovas)
> 
> The trouble is that anything we can glean from the host system where we
> instantiate the VM is mostly meaningless relative to data center
> orchestration.  We're relatively insulated from these sorts of issues
> on x86 (apparently aside from this case), AIUI ARM is even worse about
> having arbitrary reserved ranges within their IOVA space.
> 
In the multi-socket servers we have for ARM I haven't seen much
issues /yet/ with VFIO. I only have this reserved region:

0x0000000008000000 0x00000000080fffff msi

But of course ARM servers aren't very good representatives of the
shifting nature of other ARM machine models. ISTR some thread about GIC ITS ranges
being reserved by IOMMU in some hardware. Perhaps that's what you might
be referring to about:

https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyijun@huawei.com/

> For a comprehensive solution, it's not a machine accelerator property
> or enable such-and-such functionality flag, it's the ability to specify
> a VM memory map on the QEMU command line and data center orchestration
> tools gaining insight across all their systems to specify a memory
> layout that can work regardless of how a VM might be migrated. 
> Maybe
> there's a "host" option to that memory map command line option that
> would take into account the common case of a static host or at least
> homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
> to generate a least common denominator set such that the VM is
> compatible to any host in the cluster.
> 

I remember you iterated over the initial RFC over such idea. I do like that
option of adjusting memory map... should any new restrictions appear in the
IOVA space appear as opposed to have to change the machine code everytime
that happens.


I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
with something that hopefully is less painful to bear and unbreaks users doing
multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
DMA mapping bad IOVAs that may lead guests own spurious failures.
For the longterm, qemu would need some sort of handling of configurable a sparse
map of all guest RAM which currently does not exist (and it's stuffed inside on a
per-machine basis as you're aware). What I am unsure is the churn associated
with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus benefit
if it's "just" one class of x86 platforms (Intel not affected) -- which is what I find
attractive with the past 2 revisions via smaller change.

> On the device end, I really would prefer not to see device driver
> specific enables and we simply cannot hot-add a device of the given
> type without a pre-enabled VM.  Give the user visibility and
> configurability to the issue and simply fail the device add (ideally
> with a useful error message) if the device IOVA space cannot support
> the VM memory layout (this is what vfio already does afaik).
> 
> When we have iommufd support common for vfio and vdpa, hopefully we'll
> also be able to recommend a common means for learning about system and
> IOMMU restrictions to IOVA spaces. 

Perhaps even advertising platform-wide regions (without a domain allocated) that
are common in any protection domain (for example on x86 this is one
such case where MSI/HT ranges are hardcoded in Intel/AMD).

> For now we have reserved_regions
> reported in sysfs per IOMMU group:
> 
>  $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep -v direct-relaxable

And hopefully iommufd might not want to allow iommu_map() on those reserved
IOVA regions as opposed to letting that go through. Essentially what VFIO does. Unless of
course there's actually a case where this is required to iommu_map reserved regions (which
I don't know).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-25  5:22                         ` Michael S. Tsirkin
@ 2022-02-25 12:36                           ` Joao Martins
  0 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-25 12:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On 2/25/22 05:22, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:34:40PM +0000, Joao Martins wrote:
>> On 2/24/22 20:12, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 08:04:48PM +0000, Joao Martins wrote:
>>>> On 2/24/22 19:54, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 24, 2022 at 07:44:26PM +0000, Joao Martins wrote:
>>>>>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 24, 2022 at 05:54:58PM +0000, Joao Martins wrote:
>>>>>>>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 24, 2022 at 04:07:22PM +0000, Joao Martins wrote:
>>>>>>>>>> On 2/23/22 23:35, Joao Martins wrote:
>>>>>>>>>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>>>>>>>>>>>> +                                          uint64_t pci_hole64_size)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>>>>>>>> +        return;
>>>>>>>>>>>>> +    }
>>>>>>>>>>>>
>>>>>>>>>>>> Wait a sec, should this actually be tying things to the host CPU ID?
>>>>>>>>>>>> It's really about what we present to the guest though,
>>>>>>>>>>>> isn't it?
>>>>>>>>>>>
>>>>>>>>>>> It was the easier catch all to use cpuid without going into
>>>>>>>>>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>>>>>>>>>> for systems with an IOMMU present.
>>>>>>>>>>>
>>>>>>>>>>>> Also, can't we tie this to whether the AMD IOMMU is present?
>>>>>>>>>>>>
>>>>>>>>>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>>>>>>>>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>>>>>>>>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>>>>>>>>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think it's
>>>>>>>>>>> even worth checking the range exists in:
>>>>>>>>>>>
>>>>>>>>>>> 	/sys/kernel/iommu_groups/0/reserved_regions
>>>>>>>>>>>
>>>>>>>>>>> (Also that sysfs ABI is >= 4.11 only)
>>>>>>>>>>
>>>>>>>>>> Here's what I have staged in local tree, to address your comment.
>>>>>>>>>>
>>>>>>>>>> Naturally the first chunk is what's affected by this patch the rest is a
>>>>>>>>>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>>>>>>>>>> all the tests and what not.
>>>>>>>>>>
>>>>>>>>>> I am not entirely sure this is the right place to put such a helper, open
>>>>>>>>>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>>>>>>>>>> of the file's style.
>>>>>>>>>>

[snip]

>>>>>>>>>
>>>>>>>>> why are we checking whether an AMD IOMMU is present
>>>>>>>>> on the host? 
>>>>>>>>> Isn't what we care about whether it's
>>>>>>>>> present in the VM? What we are reserving after all
>>>>>>>>> is a range of GPAs, not actual host PA's ...
>>>>>>>>>
>>>>>>>> Right.
>>>>>>>>
>>>>>>>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>>>>>>>> and so we need to not map that portion of address space entirely
>>>>>>>> and use some other portion of the GPA-space. This has to
>>>>>>>> do with host IOMMU which is where the DMA mapping of guest PA takes
>>>>>>>> place. So, if you do not have an host IOMMU, you can't
>>>>>>>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
>>>>>>>> don't need this problem.
>>>>>>>>
>>>>>>>> Whether one uses a guest IOMMU or not does not affect the result,
>>>>>>>> it would be more of a case of mimicking real hardware not fixing
>>>>>>>> the issue of this series.
>>>>>>>
>>>>>>>
>>>>>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>>>>>> And ideally move the logic reporting reserved ranges there too.
>>>>>>> However, I think vdpa has the same issue too.
>>>>>>> CC Jason for an opinion.
>>>>>>
>>>>>> It does have the same problem.
>>>>>>
>>>>>> This is not specific to VFIO, it's to anything that uses the iommu.
>>>>>
>>>>> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
>>>>> that we don't have a global "enable-vfio" flag since vfio devices
>>>>> can be hot-plugged. I think this is not the first time we wanted
>>>>> something like this, right Alex?
>>>>>
>>>>>> It's
>>>>>> just that VFIO doesn't let you shoot in the foot :)
>>>>>>
>>>>>> vDPA would need to validate iova ranges as well to prevent mapping on
>>>>>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>>>>>> map request is within a valid iova range"). Now you could argue that
>>>>>> it's an IOMMU core problem, maybe.
>>>>>>
>>>>>> But I this as a separate problem,
>>>>>> because regardless of said validation, your guest provisioning
>>>>>> is still going to fail for guests with >=1010G of max GPA and even if
>>>>>> it doesn't fail you shouldn't be letting it DMA map those in the first
>>>>>> place (e.g. you could see spurious INVALID_DEVICE_REQUEST fault events
>>>>>> from IOMMU if DMA is attempted over the first megabyte of that 1T hole).
>>>>>
>>>>> I wonder what's the status on a system without an IOMMU.
>>>>>
>>>> In theory you should be OK. Also it's worth keeping in mind that AMD machines
>>>> with >=1T have this 12G hole marked as Reserved, so even DMA at last when CPU
>>>> is the initiator should be fine on worst case scenario.
>>>
>>> Not sure I understand this last sentence.
>>>
>> I was trying to say that the E820 from hardware is marked as Reserved and any DMA
>> from/to endpoints from kernel-supplied CPU addresses will use those reserved ranges.
> 
> meaning "will *not* use" I guess?
> 
Yes, correct.

Sorry, I missed a word there. Happened quite a lot these days to me :(

>>>>>>> Also, some concerns
>>>>>>> - I think this changes memory layout for working VMs not using VFIO.
>>>>>>>   Better preserve the old layout for old machine types?
>>>>>>>
>>>>>> Oh definitely, and I do that in this series. See the last commit, and
>>>>>> in the past it was also a machine-property exposed to userspace.
>>>>>> Otherwise I would be breaking (badly) compat/migration.
>>>>>>
>>>>>> I would like to emphasize that these memory layout changes are *exclusive* and
>>>>>> *only* for hosts on AMD with guests memory being bigger than ~950G-~1010G.
>>>>>> It's not every guest, but a fairly limited subset of big-memory guests that
>>>>>> are not the norm.
>>>>>>
>>>>>> Albeit the phys-bits property errors might gives us a bit more trouble, so
>>>>>> it might help being more conservative.
>>>>>>
>>>>>>> - You mention the phys bits issue very briefly, and it's pretty
>>>>>>>   concerning.  Do we maybe want to also disable the work around if phys
>>>>>>>   bits is too small? 
>>>>>>
>>>>>> We are doing that here (well, v4), as suggested by Igor. Note that in this series
>>>>>> it's a warning, but v4 I have it as an error and with the 32-bit issues that
>>>>>> I found through qtest.
>>>>>>
>>>>>> I share the same concern as you over making this an error because of compatibility.
>>>>>> Perhaps, we could go back to the previous version and turn back into a warning and
>>>>>> additionally even disabling the relocation all together. Or if desired even
>>>>>> depending on a machine-property to enable it.
>>>>>
>>>>> I would be inclined to disable the relocation. And maybe block vfio? I'm
>>>>> not 100% sure about that last one, but this can be a vfio decision to
>>>>> make.
>>>>>
>>>> I don't see this as a VFIO problem (VFIO is actually being a good citizen and doing the
>>>> right thing :)). From my perspective this fix is also useful to vDPA (which we also care),
>>>> and thus will also let us avoid DMA mapping in these GPA ranges.
>>>>
>>>> The reason I mention VFIO in cover letter is that it's one visible UAPI change that
>>>> users will notice that suddenly their 1TB+ VMs break.
>>>
>>> What I mean is that most VMs don't use either VFIO or VDPA.
>>> If we had VFIO,VDPA as an accelerator that needs to be listed
>>> upfront when qemu is started, we could solve this with
>>> a bit less risk by not changing anything for VMs without these two.
>>>
>> That wouldn't work for the vfio/vdpa hotplug case (which we do use),
>> and part of the reason I picked to always avoid the 1TB hole. VFIO even tells
>> you what are those allowed IOVA ranges when you create the container.
>>
>> The machine property, though, could communicate /the intent/ to add
>> any VFIO or vDPA devices in the future and maybe cover for that. That
>> let's one avoid any 'accelerator-only' problems and restrict the issues
>> of this series to VMs with VFIO/VDPA if the risk is a concern.
> 
> Well Alex nacked that idea (and I certainly trust him where VFIO is
> concerned), I guess for now we'll just live with the risk.
> 

My reading was that he nacked a 'VFIO-only' global property not necessarily
the machine property for valid-iova. Hmm, I might be misreading it as at the
end of the day the result will lead to the same thing.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-25 12:36                           ` Joao Martins
@ 2022-02-25 12:49                             ` Michael S. Tsirkin
  2022-02-25 17:40                               ` Joao Martins
  2022-02-25 16:15                             ` Alex Williamson
  1 sibling, 1 reply; 29+ messages in thread
From: Michael S. Tsirkin @ 2022-02-25 12:49 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Fri, Feb 25, 2022 at 12:36:24PM +0000, Joao Martins wrote:
> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
> with something that hopefully is less painful to bear and unbreaks users doing
> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
> DMA mapping bad IOVAs that may lead guests own spurious failures.
> For the longterm, qemu would need some sort of handling of configurable a sparse
> map of all guest RAM which currently does not exist (and it's stuffed inside on a
> per-machine basis as you're aware). What I am unsure is the churn associated
> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus benefit
> if it's "just" one class of x86 platforms (Intel not affected) -- which is what I find
> attractive with the past 2 revisions via smaller change.

Right. I pondered this for a while and I wonder whether you considered
making this depend on the guest cpu vendor and max phys bits.  Things
are easier to debug if the memory map is the same whatever the host. The
guest vendor typically matches the host cpu vendor after all, and there
just could be guests avoiding the reserved memory ranges on principle.

We'll need a bunch of code comments explaining all this hackery, as well
as machine type compat things, but that is par for the course.

Additionally, we could have a host check and then fail to init vdpa and
vfio devices if the memory map will make some memory inaccessible.

Does this sound reasonable to others? Alex? Joao?

-- 
MST



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-25 12:36                           ` Joao Martins
  2022-02-25 12:49                             ` Michael S. Tsirkin
@ 2022-02-25 16:15                             ` Alex Williamson
  2022-02-25 17:40                               ` Joao Martins
  1 sibling, 1 reply; 29+ messages in thread
From: Alex Williamson @ 2022-02-25 16:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S. Tsirkin, Jason Wang,
	Richard Henderson, qemu-devel, Daniel Jordan, David Edmondson,
	Paolo Bonzini, Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Fri, 25 Feb 2022 12:36:24 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/24/22 21:40, Alex Williamson wrote:
> > On Thu, 24 Feb 2022 20:34:40 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >> Of all those cases I would feel the machine-property is better,
> >> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> >> discovering late stage that the user is doing something wrong (and thus
> >> fail the DMA_MAP operation for those who do check invalid iovas)  
> > 
> > The trouble is that anything we can glean from the host system where we
> > instantiate the VM is mostly meaningless relative to data center
> > orchestration.  We're relatively insulated from these sorts of issues
> > on x86 (apparently aside from this case), AIUI ARM is even worse about
> > having arbitrary reserved ranges within their IOVA space.
> >   
> In the multi-socket servers we have for ARM I haven't seen much
> issues /yet/ with VFIO. I only have this reserved region:
> 
> 0x0000000008000000 0x00000000080fffff msi
> 
> But of course ARM servers aren't very good representatives of the
> shifting nature of other ARM machine models. ISTR some thread about GIC ITS ranges
> being reserved by IOMMU in some hardware. Perhaps that's what you might
> be referring to about:
> 
> https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyijun@huawei.com/


Right, and notice there also that the msi range is different.  On x86
the msi block is defined by the processor, not the platform and we have
commonality between Intel and AMD on that range.  We emulate the same
range in the guest, so for any x86 guest running on an x86 host, the
msi range is a non-issue because they overlap due to the architectural
standards.

How do you create an ARM guest that reserves a block at both 0x8000000
for your host and 0xc6000000 for the host in the above link?  Whatever
solution we develop to resolve that issue should equally apply to the
AMD reserved block:

0x000000fd00000000 0x000000ffffffffff reserved

> > For a comprehensive solution, it's not a machine accelerator property
> > or enable such-and-such functionality flag, it's the ability to specify
> > a VM memory map on the QEMU command line and data center orchestration
> > tools gaining insight across all their systems to specify a memory
> > layout that can work regardless of how a VM might be migrated. 
> > Maybe
> > there's a "host" option to that memory map command line option that
> > would take into account the common case of a static host or at least
> > homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
> > to generate a least common denominator set such that the VM is
> > compatible to any host in the cluster.
> >   
> 
> I remember you iterated over the initial RFC over such idea. I do like that
> option of adjusting memory map... should any new restrictions appear in the
> IOVA space appear as opposed to have to change the machine code everytime
> that happens.
> 
> 
> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
> with something that hopefully is less painful to bear and unbreaks users doing
> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
> DMA mapping bad IOVAs that may lead guests own spurious failures.
> For the longterm, qemu would need some sort of handling of configurable a sparse
> map of all guest RAM which currently does not exist (and it's stuffed inside on a
> per-machine basis as you're aware). What I am unsure is the churn associated
> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus benefit
> if it's "just" one class of x86 platforms (Intel not affected) -- which is what I find
> attractive with the past 2 revisions via smaller change.
> 
> > On the device end, I really would prefer not to see device driver
> > specific enables and we simply cannot hot-add a device of the given
> > type without a pre-enabled VM.  Give the user visibility and
> > configurability to the issue and simply fail the device add (ideally
> > with a useful error message) if the device IOVA space cannot support
> > the VM memory layout (this is what vfio already does afaik).
> > 
> > When we have iommufd support common for vfio and vdpa, hopefully we'll
> > also be able to recommend a common means for learning about system and
> > IOMMU restrictions to IOVA spaces.   
> 
> Perhaps even advertising platform-wide regions (without a domain allocated) that
> are common in any protection domain (for example on x86 this is one
> such case where MSI/HT ranges are hardcoded in Intel/AMD).
> 
> > For now we have reserved_regions
> > reported in sysfs per IOMMU group:
> > 
> >  $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep -v direct-relaxable  
> 
> And hopefully iommufd might not want to allow iommu_map() on those reserved
> IOVA regions as opposed to letting that go through. Essentially what VFIO does. Unless of
> course there's actually a case where this is required to iommu_map reserved regions (which
> I don't know).

iommufd is being designed to support a direct replacement for the vfio
specific type1 IOMMU backend, so it will need to have this feature.
Allowing userspace to create invalid mappings would be irresponsible.

I'd tend to agree with MST's recommendation for a more piece-wise
solution, tie the memory map to the vCPU vendor rather than to some
property of the host to account for this reserved range on AMD.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-25 12:49                             ` Michael S. Tsirkin
@ 2022-02-25 17:40                               ` Joao Martins
  0 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-25 17:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Eduardo Habkost, Jason Wang, Richard Henderson, qemu-devel,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On 2/25/22 12:49, Michael S. Tsirkin wrote:
> On Fri, Feb 25, 2022 at 12:36:24PM +0000, Joao Martins wrote:
>> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
>> with something that hopefully is less painful to bear and unbreaks users doing
>> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
>> DMA mapping bad IOVAs that may lead guests own spurious failures.
>> For the longterm, qemu would need some sort of handling of configurable sparse
>> map of all guest RAM which currently does not exist (and it's stuffed inside on a
>> per-machine basis as you're aware). What I am unsure is the churn associated
>> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus benefit
>> if it's "just" one class of x86 platforms (Intel not affected) -- which is what I find
>> attractive with the past 2 revisions via smaller change.
> 
> Right. I pondered this for a while and I wonder whether you considered
> making this depend on the guest cpu vendor and max phys bits. 

Hmmm, I am considering phys-bits already (or +host-phys-bits) but not
max_host_phys_bits. But I am not sure the latter is relevant for this case.
phys-bits is what we need to gate as that's what's ultimately exposed to
the guest based on the various -cpu options. I can bring back to like v2
and prior to consider relocating if phys-bits aren't enough and bail out.

> Things
> are easier to debug if the memory map is the same whatever the host. The
> guest vendor typically matches the host cpu vendor after all, and there
> just could be guests avoiding the reserved memory ranges on principle.
> 
Regarding guest cpu vendor, if we gate to guest CPU vendor alone this actually
increases the span of guests it might affect to, compared to just checking
host AMD IOMMU existence. The checking of AMD IOMMU would exclude no-host-IOMMU
1T AMD guests which do not need to consider this HT reserved range.

I can restrict this to guest CPU being AMD solely (assuming
-cpu host is covered too), if folks have mixed feelings towards checking
host amd IOMMU.

To be clear checking guest CPU vendor alone, would not capture the case of
using a -cpu {Skylake,...} on a AMD host, so the failure would occur just
the same. I assume you're OK with that.

> We'll need a bunch of code comments explaining all this hackery, as well
> as machine type compat things, but that is par for the course.
> 
> Additionally, we could have a host check and then fail to init vdpa and
> vfio devices if the memory map will make some memory inaccessible.
> 
> Does this sound reasonable to others? Alex? Joao?
> 
Sounds reasonable the earlier part.

Regarding device init failure logic I think the only one that might need checking
is vDPA as vfio already validates that sort of thing (on >= 5.4). Albeit given
how agnostic this is to the -devices this passes the memory map gets adjusted
to make it work for vfio/vdpa (should this be solely on guest AMD vendor or amd
host iommu existence) ... then I am not sure this is needed.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable
  2022-02-25 16:15                             ` Alex Williamson
@ 2022-02-25 17:40                               ` Joao Martins
  0 siblings, 0 replies; 29+ messages in thread
From: Joao Martins @ 2022-02-25 17:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Eduardo Habkost, Michael S. Tsirkin, Jason Wang,
	Richard Henderson, qemu-devel, Daniel Jordan, David Edmondson,
	Paolo Bonzini, Ani Sinha, Igor Mammedov, Suravee Suthikulpanit



On 2/25/22 16:15, Alex Williamson wrote:
> On Fri, 25 Feb 2022 12:36:24 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 2/24/22 21:40, Alex Williamson wrote:
>>> On Thu, 24 Feb 2022 20:34:40 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> Of all those cases I would feel the machine-property is better,
>>>> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
>>>> discovering late stage that the user is doing something wrong (and thus
>>>> fail the DMA_MAP operation for those who do check invalid iovas)  
>>>
>>> The trouble is that anything we can glean from the host system where we
>>> instantiate the VM is mostly meaningless relative to data center
>>> orchestration.  We're relatively insulated from these sorts of issues
>>> on x86 (apparently aside from this case), AIUI ARM is even worse about
>>> having arbitrary reserved ranges within their IOVA space.
>>>   
>> In the multi-socket servers we have for ARM I haven't seen much
>> issues /yet/ with VFIO. I only have this reserved region:
>>
>> 0x0000000008000000 0x00000000080fffff msi
>>
>> But of course ARM servers aren't very good representatives of the
>> shifting nature of other ARM machine models. ISTR some thread about GIC ITS ranges
>> being reserved by IOMMU in some hardware. Perhaps that's what you might
>> be referring to about:
>>
>> https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyijun@huawei.com/
> 
> 
> Right, and notice there also that the msi range is different.  On x86
> the msi block is defined by the processor, not the platform and we have
> commonality between Intel and AMD on that range. 

Wasn't aware of that part of ARM platform defining it as opposed to
being architectural -- thanks for the insight.

> We emulate the same
> range in the guest, so for any x86 guest running on an x86 host, the
> msi range is a non-issue because they overlap due to the architectural
> standards.
> 
> How do you create an ARM guest that reserves a block at both 0x8000000
> for your host and 0xc6000000 for the host in the above link?  Whatever
> solution we develop to resolve that issue should equally apply to the
> AMD reserved block:
> 
> 0x000000fd00000000 0x000000ffffffffff reserved
> 
>>> For a comprehensive solution, it's not a machine accelerator property
>>> or enable such-and-such functionality flag, it's the ability to specify
>>> a VM memory map on the QEMU command line and data center orchestration
>>> tools gaining insight across all their systems to specify a memory
>>> layout that can work regardless of how a VM might be migrated. 
>>> Maybe
>>> there's a "host" option to that memory map command line option that
>>> would take into account the common case of a static host or at least
>>> homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
>>> to generate a least common denominator set such that the VM is
>>> compatible to any host in the cluster.
>>>   
>>
>> I remember you iterated over the initial RFC over such idea. I do like that
>> option of adjusting memory map... should any new restrictions appear in the
>> IOVA space appear as opposed to have to change the machine code everytime
>> that happens.
>>
>>
>> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
>> with something that hopefully is less painful to bear and unbreaks users doing
>> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
>> DMA mapping bad IOVAs that may lead guests own spurious failures.
>> For the longterm, qemu would need some sort of handling of configurable a sparse
>> map of all guest RAM which currently does not exist (and it's stuffed inside on a
>> per-machine basis as you're aware). What I am unsure is the churn associated
>> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus benefit
>> if it's "just" one class of x86 platforms (Intel not affected) -- which is what I find
>> attractive with the past 2 revisions via smaller change.
>>
>>> On the device end, I really would prefer not to see device driver
>>> specific enables and we simply cannot hot-add a device of the given
>>> type without a pre-enabled VM.  Give the user visibility and
>>> configurability to the issue and simply fail the device add (ideally
>>> with a useful error message) if the device IOVA space cannot support
>>> the VM memory layout (this is what vfio already does afaik).
>>>
>>> When we have iommufd support common for vfio and vdpa, hopefully we'll
>>> also be able to recommend a common means for learning about system and
>>> IOMMU restrictions to IOVA spaces.   
>>
>> Perhaps even advertising platform-wide regions (without a domain allocated) that
>> are common in any protection domain (for example on x86 this is one
>> such case where MSI/HT ranges are hardcoded in Intel/AMD).
>>
>>> For now we have reserved_regions
>>> reported in sysfs per IOMMU group:
>>>
>>>  $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep -v direct-relaxable  
>>
>> And hopefully iommufd might not want to allow iommu_map() on those reserved
>> IOVA regions as opposed to letting that go through. Essentially what VFIO does. Unless of
>> course there's actually a case where this is required to iommu_map reserved regions (which
>> I don't know).
> 
> iommufd is being designed to support a direct replacement for the vfio
> specific type1 IOMMU backend, 

Yeap, I am aware :)

> so it will need to have this feature.
> Allowing userspace to create invalid mappings would be irresponsible.
> 
> I'd tend to agree with MST's recommendation for a more piece-wise
> solution, tie the memory map to the vCPU vendor rather than to some
> property of the host to account for this reserved range on AMD.  Thanks,

/me nods

Should we tie this to phys-bits, me opt-ing for a particular property of the
host (AMD IOMMU existence versus just AMD cpu), is actually to minimize the
span of guests this affects. The HT restriction we are talking about doesn't
apply to no-iommu enabled AMD platforms, regardless of whether the host
reserves that portion of memmap space for its own biding.

Intel hosts can still try to expose some form of AMD guest cpu -- but perhaps
I am being paranoid to consider that it's something people do out there with
qemu.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-02-25 18:26 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-23 18:44 [PATCH v3 0/6] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-02-23 18:44 ` [PATCH v3 1/6] hw/i386: add 4g boundary start to X86MachineState Joao Martins
2022-02-23 18:44 ` [PATCH v3 2/6] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
2022-02-23 18:44 ` [PATCH v3 3/6] i386/pc: pass pci_hole64_size " Joao Martins
2022-02-23 18:44 ` [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable Joao Martins
2022-02-23 21:22   ` Michael S. Tsirkin
2022-02-23 23:35     ` Joao Martins
2022-02-24 16:07       ` Joao Martins
2022-02-24 17:23         ` Michael S. Tsirkin
2022-02-24 17:54           ` Joao Martins
2022-02-24 18:30             ` Michael S. Tsirkin
2022-02-24 19:44               ` Joao Martins
2022-02-24 19:54                 ` Michael S. Tsirkin
2022-02-24 20:04                   ` Joao Martins
2022-02-24 20:12                     ` Michael S. Tsirkin
2022-02-24 20:34                       ` Joao Martins
2022-02-24 21:40                         ` Alex Williamson
2022-02-25 12:36                           ` Joao Martins
2022-02-25 12:49                             ` Michael S. Tsirkin
2022-02-25 17:40                               ` Joao Martins
2022-02-25 16:15                             ` Alex Williamson
2022-02-25 17:40                               ` Joao Martins
2022-02-25  5:22                         ` Michael S. Tsirkin
2022-02-25 12:36                           ` Joao Martins
2022-02-25  3:52               ` Jason Wang
2022-02-24 14:27   ` Joao Martins
2022-02-23 18:44 ` [PATCH v3 5/6] i386/pc: warn if phys-bits is too low Joao Martins
2022-02-24 14:42   ` Joao Martins
2022-02-23 18:44 ` [PATCH v3 6/6] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.