All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
@ 2022-05-20 10:45 Joao Martins
  2022-05-20 10:45 ` [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState Joao Martins
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Joao Martins @ 2022-05-20 10:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, Joao Martins

v4[5] -> v5:
* Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
* Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
commit message;

---

This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
particularly when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This small series tries to address that by dealing with this AMD-specific 1Tb hole,
but rather than dealing like the 4G hole, it instead relocates RAM above 4G
to be above the 1T if the maximum RAM range crosses the HT reserved range.
It is organized as following:

patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
         address of the 4G boundary

patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
	     to get accessing to pci_hole64_size. The actual pci-host
	     initialization is kept as is, only the qdev_new.

patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
possible address acrosses the HT region. Errors out if the phys-bits is too
low, which is only the case for >=1010G configurations or something that
crosses the HT region.

patch 5: Ensure valid IOVAs only on new machine types, but not older
ones (<= v7.0.0)

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40
(TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
pick the right value of phys-bits (regardless of this series), so we warn in
case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
ram blocks, but it was mentioned over RFC that CMOS is only useful for very
old seabios. 

Additionally, the reserved region is added to E820 if the relocation is done.

Alternative options considered (in RFC[0]):

a) Dealing with the 1T hole like the 4G hole -- which also represents what
hardware closely does.

Thanks,
	Joao

Older Changelog,

v3[4] -> v4[5]:
(changes in patch 4 and 5 only)
* Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
* Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
* Squash previous patch 5 into patch 4 to tie in the phys-bits check
  into the relocate-4g-start logic: We now error out if the phys-bits
  aren't enough on configurations that require above-4g ram relocation. (Michael Tsirkin)
* Make the error message more explicit when phys-bits isn't enough to also
  mention: "cannot avoid AMD HT range"
* Add comments inside x86_update_above_4g_mem_start() explaining the
  logic behind it. (Michael Tsirkin)
* Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based kernels
  alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael Tsirkin)
  Validated -numa topologies too as well as making sure qtests observe no regressions;

 Notes from v4:

* the machine attribute that enables this new logic (see last patch)
is called ::enforce_valid_iova since the RFC. Let me know if folks think it
is poorly named, and whether something a bit more obvious is preferred
(e.g. ::amd_relocate_1t).

* @mst one of the comments you said was to add "host checks" in vdpa/vfio devices.
In discussion with Alex and you over the last version of the patches it seems
that we weren't keen on making this device-specific or behind any machine
property flags (besides machine-compat). Just to reiterate there, making sure we do
the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
vendor (as this series) already ensures thtat this is going to be right for
offending configuration with VDPA/VFIO device that might be
configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
start poking into machine-specific details when we fail to relocate due to the
lack of phys-bits? Otherwise Qemu, just doesn't have enough information to tell
what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is the one
that needs fixing (as VFIO did in v5.4).

RFCv2[3] -> v3[4]:

* Add missing brackets in single line statement, in patch 5 (David)
* Change ranges printf to use PRIx64, in patch 5 (David)
* Move the check to after changing above_4g_mem_start, in patch 5 (David)
* Make the check generic and move it to pc_memory_init rather being specific
to AMD, as the check is useful to capture invalid phys-bits
configs (patch 5, Igor).
* Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
* Consider pci_hole64_size in patch 4 (Igor)
* To consider pci_hole64_size in max used addr we need to get it from pci-host,
so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
qdev_new("q35") to be before pc_memory_init().
* Consider sgx_epc.size in max used address, in patch 4 (Igor)
* Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
* Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.

RFC[0] -> RFCv2[3]:

* At Igor's suggestion in one of the patches I reworked the series enterily,
and more or less as he was thinking it is far simpler to relocate the
ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
and less intrusive. (patch 1 & 2)
* Check phys-bits is big enough prior to relocating (new patch 3)
* Remove the machine property, and it's only internal and set by new machine
version (Igor, patch 4).
* Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
* Add IOMMU SDM in the commit message (Igor, patch 2)

[0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
[3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u
[4] https://lore.kernel.org/all/20220223184455.9057-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.martins@oracle.com/

Joao Martins (5):
  hw/i386: add 4g boundary start to X86MachineState
  i386/pc: create pci-host qdev prior to pc_memory_init()
  i386/pc: pass pci_hole64_size to pc_memory_init()
  i386/pc: relocate 4g start to 1T where applicable
  i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
    type

 hw/i386/acpi-build.c         |   2 +-
 hw/i386/pc.c                 | 126 +++++++++++++++++++++++++++++++++--
 hw/i386/pc_piix.c            |  12 +++-
 hw/i386/pc_q35.c             |  14 +++-
 hw/i386/sgx.c                |   2 +-
 hw/i386/x86.c                |   1 +
 hw/pci-host/i440fx.c         |  10 ++-
 include/hw/i386/pc.h         |   4 +-
 include/hw/i386/x86.h        |   3 +
 include/hw/pci-host/i440fx.h |   3 +-
 10 files changed, 161 insertions(+), 16 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
@ 2022-05-20 10:45 ` Joao Martins
  2022-06-16 13:05   ` Igor Mammedov
  2022-05-20 10:45 ` [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-05-20 10:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, Joao Martins

Rather than hardcoding the 4G boundary everywhere, introduce a
X86MachineState property @above_4g_mem_start and use it
accordingly.

This is in preparation for relocating ram-above-4g to be
dynamically start at 1T on AMD platforms.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/acpi-build.c  | 2 +-
 hw/i386/pc.c          | 9 +++++----
 hw/i386/sgx.c         | 2 +-
 hw/i386/x86.c         | 1 +
 include/hw/i386/x86.h | 3 +++
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index c125939ed6f9..3160b20c9574 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2120,7 +2120,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
                 build_srat_memory(table_data, mem_base, mem_len, i - 1,
                                   MEM_AFFINITY_ENABLED);
             }
-            mem_base = 1ULL << 32;
+            mem_base = x86ms->above_4g_mem_start;
             mem_len = next_base - x86ms->below_4g_mem_size;
             next_base = mem_base + mem_len;
         }
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7c39c913355b..f7da1d5dd40d 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -832,9 +832,10 @@ void pc_memory_init(PCMachineState *pcms,
                                  machine->ram,
                                  x86ms->below_4g_mem_size,
                                  x86ms->above_4g_mem_size);
-        memory_region_add_subregion(system_memory, 0x100000000ULL,
+        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
                                     ram_above_4g);
-        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
+        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
+                       E820_RAM);
     }
 
     if (pcms->sgx_epc.size != 0) {
@@ -875,7 +876,7 @@ void pc_memory_init(PCMachineState *pcms,
             machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
             machine->device_memory->base =
-                0x100000000ULL + x86ms->above_4g_mem_size;
+                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
         }
 
         machine->device_memory->base =
@@ -1019,7 +1020,7 @@ uint64_t pc_pci_hole64_start(void)
     } else if (pcms->sgx_epc.size != 0) {
             hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
+        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
     }
 
     return ROUND_UP(hole64_start, 1 * GiB);
diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
index a44d66ba2afc..09d9c7c73d9f 100644
--- a/hw/i386/sgx.c
+++ b/hw/i386/sgx.c
@@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
         return;
     }
 
-    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
+    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 
     memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
     memory_region_add_subregion(get_system_memory(), sgx_epc->base,
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78b05ab7a2d1..af3c790a2830 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1373,6 +1373,7 @@ static void x86_machine_initfn(Object *obj)
     x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
     x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
     x86ms->bus_lock_ratelimit = 0;
+    x86ms->above_4g_mem_start = 0x100000000ULL;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index 9089bdd99c3a..df82c5fd4252 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -56,6 +56,9 @@ struct X86MachineState {
     /* RAM information (sizes, addresses, configuration): */
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 
+    /* Start address of the initial RAM above 4G */
+    uint64_t above_4g_mem_start;
+
     /* CPU and apic information: */
     bool apic_xrupt_override;
     unsigned pci_irq_mask;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init()
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-05-20 10:45 ` [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState Joao Martins
@ 2022-05-20 10:45 ` Joao Martins
  2022-06-16 13:21   ` Reviewed-by: Igor Mammedov
  2022-05-20 10:45 ` [PATCH v5 3/5] i386/pc: pass pci_hole64_size " Joao Martins
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-05-20 10:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, Joao Martins

At the start of pc_memory_init() we usually pass a range of
0..UINT64_MAX as pci_memory, when really its 2G (i440fx) or
32G (q35). To get the real user value, we need to get pci-host
passed property for default pci_hole64_size. Thus to get that,
create the qdev prior to memory init to better make estimations
on max used/phys addr.

This is in preparation to determine that host-phys-bits are
enough and also for pci-hole64-size to be considered to relocate
ram-above-4g to be at 1T (on AMD platforms).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc_piix.c            | 5 ++++-
 hw/i386/pc_q35.c             | 6 +++---
 hw/pci-host/i440fx.c         | 3 +--
 include/hw/pci-host/i440fx.h | 2 +-
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 578e537b3525..12d4a279c793 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    DeviceState *i440fx_dev;
 
     /*
      * Calculate ram split, for memory below and above 4G.  It's a bit
@@ -164,9 +165,11 @@ static void pc_init1(MachineState *machine,
         pci_memory = g_new(MemoryRegion, 1);
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
+        i440fx_dev = qdev_new(host_type);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
+        i440fx_dev = NULL;
     }
 
     pc_guest_info_init(pcms);
@@ -199,7 +202,7 @@ static void pc_init1(MachineState *machine,
 
         pci_bus = i440fx_init(host_type,
                               pci_type,
-                              &i440fx_state,
+                              i440fx_dev, &i440fx_state,
                               system_memory, system_io, machine->ram_size,
                               x86ms->below_4g_mem_size,
                               x86ms->above_4g_mem_size,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 42eb8b97079a..8d867bdb274a 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -203,12 +203,12 @@ static void pc_q35_init(MachineState *machine)
                             pcms->smbios_entry_point_type);
     }
 
-    /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
-
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
 
+    /* allocate ram and load rom/bios */
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
+
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
                              OBJECT(ram_memory), NULL);
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index e08716142b6e..5c1bab5c58ed 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -238,6 +238,7 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
 }
 
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
+                    DeviceState *dev,
                     PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
@@ -247,7 +248,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_address_space,
                     MemoryRegion *ram_memory)
 {
-    DeviceState *dev;
     PCIBus *b;
     PCIDevice *d;
     PCIHostState *s;
@@ -255,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
     unsigned i;
     I440FXState *i440fx;
 
-    dev = qdev_new(host_type);
     s = PCI_HOST_BRIDGE(dev);
     b = pci_root_bus_new(dev, NULL, pci_address_space,
                          address_space_io, 0, TYPE_PCI_BUS);
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index f068aaba8fda..c4710445e30a 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -36,7 +36,7 @@ struct PCII440FXState {
 #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
 
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
-                    PCII440FXState **pi440fx_state,
+                    DeviceState *dev, PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 3/5] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-05-20 10:45 ` [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState Joao Martins
  2022-05-20 10:45 ` [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
@ 2022-05-20 10:45 ` Joao Martins
  2022-06-16 13:30   ` Igor Mammedov
  2022-05-20 10:45 ` [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable Joao Martins
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-05-20 10:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, Joao Martins

Use the pre-initialized pci-host qdev and fetch the
pci-hole64-size into pc_memory_init() newly added argument.
piix needs a bit of care given all the !pci_enabled()
and that the pci_hole64_size is private to i440fx.

This is in preparation to determine that host-phys-bits are
enough and for pci-hole64-size to be considered to relocate
ram-above-4g to be at 1T (on AMD platforms).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c                 | 3 ++-
 hw/i386/pc_piix.c            | 5 ++++-
 hw/i386/pc_q35.c             | 8 +++++++-
 hw/pci-host/i440fx.c         | 7 +++++++
 include/hw/i386/pc.h         | 3 ++-
 include/hw/pci-host/i440fx.h | 1 +
 6 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f7da1d5dd40d..af52d4ff89ef 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -799,7 +799,8 @@ void xen_load_linux(PCMachineState *pcms)
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory)
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size)
 {
     int linux_boot, i;
     MemoryRegion *option_rom_mr;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 12d4a279c793..57bb5b8f2aea 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    uint64_t hole64_size;
     DeviceState *i440fx_dev;
 
     /*
@@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
         i440fx_dev = qdev_new(host_type);
+        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
         i440fx_dev = NULL;
+        hole64_size = 0;
     }
 
     pc_guest_info_init(pcms);
@@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         pc_memory_init(pcms, system_memory,
-                       rom_memory, &ram_memory);
+                       rom_memory, &ram_memory, hole64_size);
     } else {
         pc_system_flash_cleanup_unused(pcms);
         if (machine->kernel_filename != NULL) {
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 8d867bdb274a..4d5c2fbd976b 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     bool acpi_pcihp;
     bool keep_pci_slot_hpc;
+    uint64_t pci_hole64_size = 0;
 
     /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
      * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
@@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
 
+    if (pcmc->pci_enabled) {
+        pci_hole64_size = q35_host->mch.pci_hole64_size;
+    }
+
     /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
+                   pci_hole64_size);
 
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index 5c1bab5c58ed..c5cc28250d5c 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }
 
+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
+{
+        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
+
+        return i440fx->pci_hole64_size;
+}
+
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     DeviceState *dev,
                     PCII440FXState **pi440fx_state,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index ffcac5121ed9..9c847faea2f8 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory);
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size);
 uint64_t pc_pci_hole64_start(void);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(struct PCMachineState *pcms,
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index c4710445e30a..1299d6a2b0e4 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_memory,
                     MemoryRegion *ram_memory);
 
+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
 
 #endif
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2022-05-20 10:45 ` [PATCH v5 3/5] i386/pc: pass pci_hole64_size " Joao Martins
@ 2022-05-20 10:45 ` Joao Martins
  2022-06-16 14:23   ` Igor Mammedov
  2022-05-20 10:45 ` [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-05-20 10:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, Joao Martins

It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
to address 1Tb (0xff ffff ffff). On AMD platforms, if a
ram-above-4g relocation may be desired and the CPU wasn't configured
with a big enough phys-bits, print an error message to the user
and do not make the relocation of the above-4g-region if phys-bits
is too low.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index af52d4ff89ef..652ae8ff9ccf 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static hwaddr x86_max_phys_addr(PCMachineState *pcms,
+                                hwaddr above_4g_mem_start,
+                                uint64_t pci_hole64_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    ram_addr_t device_mem_size = 0;
+    hwaddr base;
+
+    if (!x86ms->above_4g_mem_size) {
+       /*
+        * 32-bit pci hole goes from
+        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+        */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }
+
+    if (pcmc->has_reserved_memory &&
+       (machine->ram_size < machine->maxram_size)) {
+        device_mem_size = machine->maxram_size - machine->ram_size;
+    }
+
+    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
+                    pcms->sgx_epc.size, 1 * GiB);
+
+    return base + device_mem_size + pci_hole64_size;
+}
+
+static void x86_update_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    CPUX86State *env = &X86_CPU(first_cpu)->env;
+    hwaddr start = x86ms->above_4g_mem_start;
+    hwaddr maxphysaddr, maxusedaddr;
+
+    /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (!IS_AMD_CPU(env)) {
+        return;
+    }
+
+    /* Bail out if max possible address does not cross HT range */
+    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
+        return;
+    }
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    start = AMD_ABOVE_1TB_START;
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+
+    x86ms->above_4g_mem_start = start;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
+    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
+        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+    }
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (3 preceding siblings ...)
  2022-05-20 10:45 ` [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-05-20 10:45 ` Joao Martins
  2022-06-16 14:27   ` Igor Mammedov
  2022-06-08 10:37 ` [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-06-22 22:37 ` Alex Williamson
  6 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-05-20 10:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, Joao Martins

The added enforcing is only relevant in the case of AMD where the
range right before the 1TB is restricted and cannot be DMA mapped
by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
or possibly other kinds of IOMMU events in the AMD IOMMU.

Although, there's a case where it may make sense to disable the
IOVA relocation/validation when migrating from a
non-valid-IOVA-aware qemu to one that supports it.

Relocating RAM regions to after the 1Tb hole has consequences for
guest ABI because we are changing the memory mapping, so make
sure that only new machine enforce but not older ones.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 7 +++++--
 hw/i386/pc_piix.c    | 2 ++
 hw/i386/pc_q35.c     | 2 ++
 include/hw/i386/pc.h | 1 +
 4 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 652ae8ff9ccf..62f9af91f19f 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -862,6 +862,7 @@ static hwaddr x86_max_phys_addr(PCMachineState *pcms,
 static void x86_update_above_4g_mem_start(PCMachineState *pcms,
                                           uint64_t pci_hole64_size)
 {
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     CPUX86State *env = &X86_CPU(first_cpu)->env;
     hwaddr start = x86ms->above_4g_mem_start;
@@ -870,9 +871,10 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
     /*
      * The HyperTransport range close to the 1T boundary is unique to AMD
      * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
-     * to above 1T to AMD vCPUs only.
+     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
+     * older machine types (<= 7.0) for compatibility purposes.
      */
-    if (!IS_AMD_CPU(env)) {
+    if (!IS_AMD_CPU(env) || !pcmc->enforce_valid_iova) {
         return;
     }
 
@@ -1881,6 +1883,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     pcmc->has_reserved_memory = true;
     pcmc->kvmclock_enabled = true;
     pcmc->enforce_aligned_dimm = true;
+    pcmc->enforce_valid_iova = true;
     /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
      * to be used at the moment, 32K should be enough for a while.  */
     pcmc->acpi_data_size = 0x20000 + 0x8000;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 57bb5b8f2aea..74176a210d56 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -437,9 +437,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
 
 static void pc_i440fx_7_0_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_i440fx_7_1_machine_options(m);
     m->alias = NULL;
     m->is_default = false;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
     compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
 }
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 4d5c2fbd976b..bc38a6ba4c67 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
 
 static void pc_q35_7_0_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_q35_7_1_machine_options(m);
     m->alias = NULL;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
     compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
 }
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9c847faea2f8..22119131eca7 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -117,6 +117,7 @@ struct PCMachineClass {
     bool has_reserved_memory;
     bool enforce_aligned_dimm;
     bool broken_reserved_end;
+    bool enforce_valid_iova;
 
     /* generate legacy CPU hotplug AML */
     bool legacy_cpu_hotplug;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (4 preceding siblings ...)
  2022-05-20 10:45 ` [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
@ 2022-06-08 10:37 ` Joao Martins
  2022-06-22 22:37 ` Alex Williamson
  6 siblings, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-08 10:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit

On 5/20/22 11:45, Joao Martins wrote:
> v4[5] -> v5:
> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> commit message;
> 
> ---
> 
> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> particularly when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> 	failed to setup container for group 258: memory listener initialization failed:
> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
> 
> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> It is organized as following:
> 
> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>          address of the 4G boundary
> 
> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> 	     to get accessing to pci_hole64_size. The actual pci-host
> 	     initialization is kept as is, only the qdev_new.
> 
> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> possible address acrosses the HT region. Errors out if the phys-bits is too
> low, which is only the case for >=1010G configurations or something that
> crosses the HT region.
> 
> patch 5: Ensure valid IOVAs only on new machine types, but not older
> ones (<= v7.0.0)
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40
> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
> pick the right value of phys-bits (regardless of this series), so we warn in
> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> old seabios. 
> 
> Additionally, the reserved region is added to E820 if the relocation is done.
> 
> Alternative options considered (in RFC[0]):
> 
> a) Dealing with the 1T hole like the 4G hole -- which also represents what
> hardware closely does.
> 
> Thanks,
> 	Joao
> 

Ping?

> Older Changelog,
> 
> v3[4] -> v4[5]:
> (changes in patch 4 and 5 only)
> * Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
> * Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
> * Squash previous patch 5 into patch 4 to tie in the phys-bits check
>   into the relocate-4g-start logic: We now error out if the phys-bits
>   aren't enough on configurations that require above-4g ram relocation. (Michael Tsirkin)
> * Make the error message more explicit when phys-bits isn't enough to also
>   mention: "cannot avoid AMD HT range"
> * Add comments inside x86_update_above_4g_mem_start() explaining the
>   logic behind it. (Michael Tsirkin)
> * Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based kernels
>   alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael Tsirkin)
>   Validated -numa topologies too as well as making sure qtests observe no regressions;
> 
>  Notes from v4:
> 
> * the machine attribute that enables this new logic (see last patch)
> is called ::enforce_valid_iova since the RFC. Let me know if folks think it
> is poorly named, and whether something a bit more obvious is preferred
> (e.g. ::amd_relocate_1t).
> 
> * @mst one of the comments you said was to add "host checks" in vdpa/vfio devices.
> In discussion with Alex and you over the last version of the patches it seems
> that we weren't keen on making this device-specific or behind any machine
> property flags (besides machine-compat). Just to reiterate there, making sure we do
> the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
> vendor (as this series) already ensures thtat this is going to be right for
> offending configuration with VDPA/VFIO device that might be
> configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
> start poking into machine-specific details when we fail to relocate due to the
> lack of phys-bits? Otherwise Qemu, just doesn't have enough information to tell
> what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is the one
> that needs fixing (as VFIO did in v5.4).
> 
> RFCv2[3] -> v3[4]:
> 
> * Add missing brackets in single line statement, in patch 5 (David)
> * Change ranges printf to use PRIx64, in patch 5 (David)
> * Move the check to after changing above_4g_mem_start, in patch 5 (David)
> * Make the check generic and move it to pc_memory_init rather being specific
> to AMD, as the check is useful to capture invalid phys-bits
> configs (patch 5, Igor).
> * Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
> * Consider pci_hole64_size in patch 4 (Igor)
> * To consider pci_hole64_size in max used addr we need to get it from pci-host,
> so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
> qdev_new("q35") to be before pc_memory_init().
> * Consider sgx_epc.size in max used address, in patch 4 (Igor)
> * Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
> * Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
> due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.
> 
> RFC[0] -> RFCv2[3]:
> 
> * At Igor's suggestion in one of the patches I reworked the series enterily,
> and more or less as he was thinking it is far simpler to relocate the
> ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
> and less intrusive. (patch 1 & 2)
> * Check phys-bits is big enough prior to relocating (new patch 3)
> * Remove the machine property, and it's only internal and set by new machine
> version (Igor, patch 4).
> * Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
> * Add IOMMU SDM in the commit message (Igor, patch 2)
> 
> [0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> [3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u
> [4] https://lore.kernel.org/all/20220223184455.9057-1-joao.m.martins@oracle.com/
> [5] https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.martins@oracle.com/
> 
> Joao Martins (5):
>   hw/i386: add 4g boundary start to X86MachineState
>   i386/pc: create pci-host qdev prior to pc_memory_init()
>   i386/pc: pass pci_hole64_size to pc_memory_init()
>   i386/pc: relocate 4g start to 1T where applicable
>   i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
>     type
> 
>  hw/i386/acpi-build.c         |   2 +-
>  hw/i386/pc.c                 | 126 +++++++++++++++++++++++++++++++++--
>  hw/i386/pc_piix.c            |  12 +++-
>  hw/i386/pc_q35.c             |  14 +++-
>  hw/i386/sgx.c                |   2 +-
>  hw/i386/x86.c                |   1 +
>  hw/pci-host/i440fx.c         |  10 ++-
>  include/hw/i386/pc.h         |   4 +-
>  include/hw/i386/x86.h        |   3 +
>  include/hw/pci-host/i440fx.h |   3 +-
>  10 files changed, 161 insertions(+), 16 deletions(-)
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState
  2022-05-20 10:45 ` [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState Joao Martins
@ 2022-06-16 13:05   ` Igor Mammedov
  2022-06-17 10:57     ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Igor Mammedov @ 2022-06-16 13:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 20 May 2022 11:45:28 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Rather than hardcoding the 4G boundary everywhere, introduce a
> X86MachineState property @above_4g_mem_start and use it
so far it's just field not a property /fix commit message/

> accordingly.
> 
> This is in preparation for relocating ram-above-4g to be
> dynamically start at 1T on AMD platforms.

possibly needs to be rebased on top of current master to include cxl_base

with comments fixed

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/acpi-build.c  | 2 +-
>  hw/i386/pc.c          | 9 +++++----
>  hw/i386/sgx.c         | 2 +-
>  hw/i386/x86.c         | 1 +
>  include/hw/i386/x86.h | 3 +++
>  5 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index c125939ed6f9..3160b20c9574 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2120,7 +2120,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>                  build_srat_memory(table_data, mem_base, mem_len, i - 1,
>                                    MEM_AFFINITY_ENABLED);
>              }
> -            mem_base = 1ULL << 32;
> +            mem_base = x86ms->above_4g_mem_start;
>              mem_len = next_base - x86ms->below_4g_mem_size;
>              next_base = mem_base + mem_len;
>          }
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 7c39c913355b..f7da1d5dd40d 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -832,9 +832,10 @@ void pc_memory_init(PCMachineState *pcms,
>                                   machine->ram,
>                                   x86ms->below_4g_mem_size,
>                                   x86ms->above_4g_mem_size);
> -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> +        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
>                                      ram_above_4g);
> -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> +        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
> +                       E820_RAM);
>      }
>  
>      if (pcms->sgx_epc.size != 0) {
> @@ -875,7 +876,7 @@ void pc_memory_init(PCMachineState *pcms,
>              machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
>          } else {
>              machine->device_memory->base =
> -                0x100000000ULL + x86ms->above_4g_mem_size;
> +                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>          }
>  
>          machine->device_memory->base =
> @@ -1019,7 +1020,7 @@ uint64_t pc_pci_hole64_start(void)
>      } else if (pcms->sgx_epc.size != 0) {
>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>      } else {
> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
> +        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>      }
>  
>      return ROUND_UP(hole64_start, 1 * GiB);
> diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
> index a44d66ba2afc..09d9c7c73d9f 100644
> --- a/hw/i386/sgx.c
> +++ b/hw/i386/sgx.c
> @@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
>          return;
>      }
>  
> -    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
> +    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>  
>      memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
>      memory_region_add_subregion(get_system_memory(), sgx_epc->base,
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 78b05ab7a2d1..af3c790a2830 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -1373,6 +1373,7 @@ static void x86_machine_initfn(Object *obj)
>      x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
>      x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
>      x86ms->bus_lock_ratelimit = 0;
> +    x86ms->above_4g_mem_start = 0x100000000ULL;

s/0x.../4 * GiB/

>  }
>  
>  static void x86_machine_class_init(ObjectClass *oc, void *data)
> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
> index 9089bdd99c3a..df82c5fd4252 100644
> --- a/include/hw/i386/x86.h
> +++ b/include/hw/i386/x86.h
> @@ -56,6 +56,9 @@ struct X86MachineState {
>      /* RAM information (sizes, addresses, configuration): */
>      ram_addr_t below_4g_mem_size, above_4g_mem_size;
>  
> +    /* Start address of the initial RAM above 4G */
> +    uint64_t above_4g_mem_start;
> +
>      /* CPU and apic information: */
>      bool apic_xrupt_override;
>      unsigned pci_irq_mask;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init()
  2022-05-20 10:45 ` [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
@ 2022-06-16 13:21   ` Reviewed-by: Igor Mammedov
  2022-06-17 11:03     ` Joao Martins
  2022-06-20  7:12     ` Mark Cave-Ayland
  0 siblings, 2 replies; 32+ messages in thread
From: Reviewed-by: Igor Mammedov @ 2022-06-16 13:21 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 20 May 2022 11:45:29 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> At the start of pc_memory_init() we usually pass a range of
> 0..UINT64_MAX as pci_memory, when really its 2G (i440fx) or
> 32G (q35). To get the real user value, we need to get pci-host
> passed property for default pci_hole64_size. Thus to get that,
> create the qdev prior to memory init to better make estimations
> on max used/phys addr.
> 
> This is in preparation to determine that host-phys-bits are
> enough and also for pci-hole64-size to be considered to relocate
> ram-above-4g to be at 1T (on AMD platforms).

with comments below fixed
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc_piix.c            | 5 ++++-
>  hw/i386/pc_q35.c             | 6 +++---
>  hw/pci-host/i440fx.c         | 3 +--
>  include/hw/pci-host/i440fx.h | 2 +-
>  4 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 578e537b3525..12d4a279c793 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
>      MemoryRegion *pci_memory;
>      MemoryRegion *rom_memory;
>      ram_addr_t lowmem;
> +    DeviceState *i440fx_dev;
>  
>      /*
>       * Calculate ram split, for memory below and above 4G.  It's a bit
> @@ -164,9 +165,11 @@ static void pc_init1(MachineState *machine,
>          pci_memory = g_new(MemoryRegion, 1);
>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>          rom_memory = pci_memory;
> +        i440fx_dev = qdev_new(host_type);
>      } else {
>          pci_memory = NULL;
>          rom_memory = system_memory;
> +        i440fx_dev = NULL;
>      }
>  
>      pc_guest_info_init(pcms);
> @@ -199,7 +202,7 @@ static void pc_init1(MachineState *machine,
>  
>          pci_bus = i440fx_init(host_type,
>                                pci_type,
> -                              &i440fx_state,
> +                              i440fx_dev, &i440fx_state,
confusing names, suggest to rename i440fx_state -> pci_i440fx and i440fx_dev -> i440fx_host
or something like this

>                                system_memory, system_io, machine->ram_size,
>                                x86ms->below_4g_mem_size,
>                                x86ms->above_4g_mem_size,
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 42eb8b97079a..8d867bdb274a 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -203,12 +203,12 @@ static void pc_q35_init(MachineState *machine)
>                              pcms->smbios_entry_point_type);
>      }
>  
> -    /* allocate ram and load rom/bios */
> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> -
>      /* create pci host bus */
>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>  
> +    /* allocate ram and load rom/bios */
> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> +
>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>                               OBJECT(ram_memory), NULL);
> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> index e08716142b6e..5c1bab5c58ed 100644
> --- a/hw/pci-host/i440fx.c
> +++ b/hw/pci-host/i440fx.c
> @@ -238,6 +238,7 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>  }
>  
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,

does it still need 'host_type'?

> +                    DeviceState *dev,
>                      PCII440FXState **pi440fx_state,
>                      MemoryRegion *address_space_mem,
>                      MemoryRegion *address_space_io,
> @@ -247,7 +248,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      MemoryRegion *pci_address_space,
>                      MemoryRegion *ram_memory)
>  {
> -    DeviceState *dev;
>      PCIBus *b;
>      PCIDevice *d;
>      PCIHostState *s;
> @@ -255,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>      unsigned i;
>      I440FXState *i440fx;
>  
> -    dev = qdev_new(host_type);
>      s = PCI_HOST_BRIDGE(dev);
>      b = pci_root_bus_new(dev, NULL, pci_address_space,
>                           address_space_io, 0, TYPE_PCI_BUS);
> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
> index f068aaba8fda..c4710445e30a 100644
> --- a/include/hw/pci-host/i440fx.h
> +++ b/include/hw/pci-host/i440fx.h
> @@ -36,7 +36,7 @@ struct PCII440FXState {
>  #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
>  
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> -                    PCII440FXState **pi440fx_state,
> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>                      MemoryRegion *address_space_mem,
>                      MemoryRegion *address_space_io,
>                      ram_addr_t ram_size,



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/5] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-05-20 10:45 ` [PATCH v5 3/5] i386/pc: pass pci_hole64_size " Joao Martins
@ 2022-06-16 13:30   ` Igor Mammedov
  2022-06-16 14:16     ` Michael S. Tsirkin
  2022-06-17 11:13     ` Joao Martins
  0 siblings, 2 replies; 32+ messages in thread
From: Igor Mammedov @ 2022-06-16 13:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 20 May 2022 11:45:30 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Use the pre-initialized pci-host qdev and fetch the
> pci-hole64-size into pc_memory_init() newly added argument.
> piix needs a bit of care given all the !pci_enabled()
> and that the pci_hole64_size is private to i440fx.
> 
> This is in preparation to determine that host-phys-bits are
> enough and for pci-hole64-size to be considered to relocate
> ram-above-4g to be at 1T (on AMD platforms).

modulo nit blow

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c                 | 3 ++-
>  hw/i386/pc_piix.c            | 5 ++++-
>  hw/i386/pc_q35.c             | 8 +++++++-
>  hw/pci-host/i440fx.c         | 7 +++++++
>  include/hw/i386/pc.h         | 3 ++-
>  include/hw/pci-host/i440fx.h | 1 +
>  6 files changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index f7da1d5dd40d..af52d4ff89ef 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -799,7 +799,8 @@ void xen_load_linux(PCMachineState *pcms)
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> -                    MemoryRegion **ram_memory)
> +                    MemoryRegion **ram_memory,
> +                    uint64_t pci_hole64_size)
>  {
>      int linux_boot, i;
>      MemoryRegion *option_rom_mr;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 12d4a279c793..57bb5b8f2aea 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
>      MemoryRegion *pci_memory;
>      MemoryRegion *rom_memory;
>      ram_addr_t lowmem;
> +    uint64_t hole64_size;

init it to 0 right here to avoid chance of run amok uninitialized variable?

>      DeviceState *i440fx_dev;
>  
>      /*
> @@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>          rom_memory = pci_memory;
>          i440fx_dev = qdev_new(host_type);
> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
>      } else {
>          pci_memory = NULL;
>          rom_memory = system_memory;
>          i440fx_dev = NULL;
> +        hole64_size = 0;
>      }
>  
>      pc_guest_info_init(pcms);
> @@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
>      /* allocate ram and load rom/bios */
>      if (!xen_enabled()) {
>          pc_memory_init(pcms, system_memory,
> -                       rom_memory, &ram_memory);
> +                       rom_memory, &ram_memory, hole64_size);
>      } else {
>          pc_system_flash_cleanup_unused(pcms);
>          if (machine->kernel_filename != NULL) {
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 8d867bdb274a..4d5c2fbd976b 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>      bool acpi_pcihp;
>      bool keep_pci_slot_hpc;
> +    uint64_t pci_hole64_size = 0;
>  
>      /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
>       * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
> @@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
>      /* create pci host bus */
>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>  
> +    if (pcmc->pci_enabled) {
> +        pci_hole64_size = q35_host->mch.pci_hole64_size;
> +    }
> +
>      /* allocate ram and load rom/bios */
> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
> +                   pci_hole64_size);
>  
>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> index 5c1bab5c58ed..c5cc28250d5c 100644
> --- a/hw/pci-host/i440fx.c
> +++ b/hw/pci-host/i440fx.c
> @@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>      }
>  }
>  
> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
> +{
> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
> +
> +        return i440fx->pci_hole64_size;
> +}
> +
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      DeviceState *dev,
>                      PCII440FXState **pi440fx_state,
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index ffcac5121ed9..9c847faea2f8 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> -                    MemoryRegion **ram_memory);
> +                    MemoryRegion **ram_memory,
> +                    uint64_t pci_hole64_size);
>  uint64_t pc_pci_hole64_start(void);
>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>  void pc_basic_device_init(struct PCMachineState *pcms,
> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
> index c4710445e30a..1299d6a2b0e4 100644
> --- a/include/hw/pci-host/i440fx.h
> +++ b/include/hw/pci-host/i440fx.h
> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      MemoryRegion *pci_memory,
>                      MemoryRegion *ram_memory);
>  
> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
>  
>  #endif



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/5] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-06-16 13:30   ` Igor Mammedov
@ 2022-06-16 14:16     ` Michael S. Tsirkin
  2022-06-17 11:13     ` Joao Martins
  1 sibling, 0 replies; 32+ messages in thread
From: Michael S. Tsirkin @ 2022-06-16 14:16 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Joao Martins, qemu-devel, Eduardo Habkost, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Suravee Suthikulpanit

On Thu, Jun 16, 2022 at 03:30:14PM +0200, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:30 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> > Use the pre-initialized pci-host qdev and fetch the
> > pci-hole64-size into pc_memory_init() newly added argument.
> > piix needs a bit of care given all the !pci_enabled()
> > and that the pci_hole64_size is private to i440fx.
> > 
> > This is in preparation to determine that host-phys-bits are
> > enough and for pci-hole64-size to be considered to relocate
> > ram-above-4g to be at 1T (on AMD platforms).
> 
> modulo nit blow
> 
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> 
> > 
> > Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> > ---
> >  hw/i386/pc.c                 | 3 ++-
> >  hw/i386/pc_piix.c            | 5 ++++-
> >  hw/i386/pc_q35.c             | 8 +++++++-
> >  hw/pci-host/i440fx.c         | 7 +++++++
> >  include/hw/i386/pc.h         | 3 ++-
> >  include/hw/pci-host/i440fx.h | 1 +
> >  6 files changed, 23 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > index f7da1d5dd40d..af52d4ff89ef 100644
> > --- a/hw/i386/pc.c
> > +++ b/hw/i386/pc.c
> > @@ -799,7 +799,8 @@ void xen_load_linux(PCMachineState *pcms)
> >  void pc_memory_init(PCMachineState *pcms,
> >                      MemoryRegion *system_memory,
> >                      MemoryRegion *rom_memory,
> > -                    MemoryRegion **ram_memory)
> > +                    MemoryRegion **ram_memory,
> > +                    uint64_t pci_hole64_size)
> >  {
> >      int linux_boot, i;
> >      MemoryRegion *option_rom_mr;
> > diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> > index 12d4a279c793..57bb5b8f2aea 100644
> > --- a/hw/i386/pc_piix.c
> > +++ b/hw/i386/pc_piix.c
> > @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
> >      MemoryRegion *pci_memory;
> >      MemoryRegion *rom_memory;
> >      ram_addr_t lowmem;
> > +    uint64_t hole64_size;
> 
> init it to 0 right here to avoid chance of run amok uninitialized variable?


I don't see why we should, compilers seems to be pretty good about warning
about these things nowdays.

> >      DeviceState *i440fx_dev;
> >  
> >      /*
> > @@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
> >          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
> >          rom_memory = pci_memory;
> >          i440fx_dev = qdev_new(host_type);
> > +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
> >      } else {
> >          pci_memory = NULL;
> >          rom_memory = system_memory;
> >          i440fx_dev = NULL;
> > +        hole64_size = 0;
> >      }
> >  
> >      pc_guest_info_init(pcms);
> > @@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
> >      /* allocate ram and load rom/bios */
> >      if (!xen_enabled()) {
> >          pc_memory_init(pcms, system_memory,
> > -                       rom_memory, &ram_memory);
> > +                       rom_memory, &ram_memory, hole64_size);
> >      } else {
> >          pc_system_flash_cleanup_unused(pcms);
> >          if (machine->kernel_filename != NULL) {
> > diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> > index 8d867bdb274a..4d5c2fbd976b 100644
> > --- a/hw/i386/pc_q35.c
> > +++ b/hw/i386/pc_q35.c
> > @@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
> >      MachineClass *mc = MACHINE_GET_CLASS(machine);
> >      bool acpi_pcihp;
> >      bool keep_pci_slot_hpc;
> > +    uint64_t pci_hole64_size = 0;
> >  
> >      /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
> >       * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
> > @@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
> >      /* create pci host bus */
> >      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
> >  
> > +    if (pcmc->pci_enabled) {
> > +        pci_hole64_size = q35_host->mch.pci_hole64_size;
> > +    }
> > +
> >      /* allocate ram and load rom/bios */
> > -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> > +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
> > +                   pci_hole64_size);
> >  
> >      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
> >      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
> > diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> > index 5c1bab5c58ed..c5cc28250d5c 100644
> > --- a/hw/pci-host/i440fx.c
> > +++ b/hw/pci-host/i440fx.c
> > @@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
> >      }
> >  }
> >  
> > +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
> > +{
> > +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
> > +
> > +        return i440fx->pci_hole64_size;
> > +}
> > +
> >  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> >                      DeviceState *dev,
> >                      PCII440FXState **pi440fx_state,
> > diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> > index ffcac5121ed9..9c847faea2f8 100644
> > --- a/include/hw/i386/pc.h
> > +++ b/include/hw/i386/pc.h
> > @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
> >  void pc_memory_init(PCMachineState *pcms,
> >                      MemoryRegion *system_memory,
> >                      MemoryRegion *rom_memory,
> > -                    MemoryRegion **ram_memory);
> > +                    MemoryRegion **ram_memory,
> > +                    uint64_t pci_hole64_size);
> >  uint64_t pc_pci_hole64_start(void);
> >  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
> >  void pc_basic_device_init(struct PCMachineState *pcms,
> > diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
> > index c4710445e30a..1299d6a2b0e4 100644
> > --- a/include/hw/pci-host/i440fx.h
> > +++ b/include/hw/pci-host/i440fx.h
> > @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> >                      MemoryRegion *pci_memory,
> >                      MemoryRegion *ram_memory);
> >  
> > +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
> >  
> >  #endif



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-05-20 10:45 ` [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-06-16 14:23   ` Igor Mammedov
  2022-06-17 12:18     ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Igor Mammedov @ 2022-06-16 14:23 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 20 May 2022 11:45:31 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
                                                   ^^^ typo?

> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
              ^ ditto

> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> ram-above-4g relocation may be desired and the CPU wasn't configured
> with a big enough phys-bits, print an error message to the user
> and do not make the relocation of the above-4g-region if phys-bits
> is too low.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 111 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index af52d4ff89ef..652ae8ff9ccf 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,

s/x86_max_phys_addr/pc_max_used_gpa/

> +                                hwaddr above_4g_mem_start,
> +                                uint64_t pci_hole64_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    ram_addr_t device_mem_size = 0;
> +    hwaddr base;
> +
> +    if (!x86ms->above_4g_mem_size) {
> +       /*
> +        * 32-bit pci hole goes from
> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +        */
> +        return IO_APIC_DEFAULT_ADDRESS - 1;

lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
that's located above it.

> +    }
> +
> +    if (pcmc->has_reserved_memory &&
> +       (machine->ram_size < machine->maxram_size)) {
> +        device_mem_size = machine->maxram_size - machine->ram_size;
> +    }
> +
> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
> +                    pcms->sgx_epc.size, 1 * GiB);
> +
> +    return base + device_mem_size + pci_hole64_size;

it's not guarantied that pci64 hole starts right away device_mem,
but you are not 1st doing this assumption in code, maybe instead of
all above use existing 
   pc_pci_hole64_start() + pci_hole64_size
to gestimate max address 

> +}
> +
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)

s/x86_update_above_4g_mem_start/pc_set_amd_above_4g_mem_start/

> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    CPUX86State *env = &X86_CPU(first_cpu)->env;
> +    hwaddr start = x86ms->above_4g_mem_start;
> +    hwaddr maxphysaddr, maxusedaddr;


> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (!IS_AMD_CPU(env)) {
> +        return;
> +    }

move this to caller

> +    /* Bail out if max possible address does not cross HT range */
> +    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
> +        return;
> +    }
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    start = AMD_ABOVE_1TB_START;
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +
> +    x86ms->above_4g_mem_start = start;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +
> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +    }
probably it is not necessary, but it doesn't hurt

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-05-20 10:45 ` [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
@ 2022-06-16 14:27   ` Igor Mammedov
  2022-06-17 13:36     ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Igor Mammedov @ 2022-06-16 14:27 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 20 May 2022 11:45:32 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> The added enforcing is only relevant in the case of AMD where the
> range right before the 1TB is restricted and cannot be DMA mapped
> by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
> or possibly other kinds of IOMMU events in the AMD IOMMU.
> 
> Although, there's a case where it may make sense to disable the
> IOVA relocation/validation when migrating from a
> non-valid-IOVA-aware qemu to one that supports it.
> 
> Relocating RAM regions to after the 1Tb hole has consequences for
> guest ABI because we are changing the memory mapping, so make
> sure that only new machine enforce but not older ons.

is old machine with so much ram going to work and not explode
even without iommu?

> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c         | 7 +++++--
>  hw/i386/pc_piix.c    | 2 ++
>  hw/i386/pc_q35.c     | 2 ++
>  include/hw/i386/pc.h | 1 +
>  4 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 652ae8ff9ccf..62f9af91f19f 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -862,6 +862,7 @@ static hwaddr x86_max_phys_addr(PCMachineState *pcms,
>  static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>                                            uint64_t pci_hole64_size)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      CPUX86State *env = &X86_CPU(first_cpu)->env;
>      hwaddr start = x86ms->above_4g_mem_start;
> @@ -870,9 +871,10 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>      /*
>       * The HyperTransport range close to the 1T boundary is unique to AMD
>       * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> -     * to above 1T to AMD vCPUs only.
> +     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
> +     * older machine types (<= 7.0) for compatibility purposes.
>       */
> -    if (!IS_AMD_CPU(env)) {
> +    if (!IS_AMD_CPU(env) || !pcmc->enforce_valid_iova) {
>          return;
>      }
>  
> @@ -1881,6 +1883,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>      pcmc->has_reserved_memory = true;
>      pcmc->kvmclock_enabled = true;
>      pcmc->enforce_aligned_dimm = true;
> +    pcmc->enforce_valid_iova = true;
>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>       * to be used at the moment, 32K should be enough for a while.  */
>      pcmc->acpi_data_size = 0x20000 + 0x8000;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 57bb5b8f2aea..74176a210d56 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -437,9 +437,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
>  
>  static void pc_i440fx_7_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_i440fx_7_1_machine_options(m);
>      m->alias = NULL;
>      m->is_default = false;
> +    pcmc->enforce_valid_iova = false;
>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>  }
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 4d5c2fbd976b..bc38a6ba4c67 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
>  
>  static void pc_q35_7_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_q35_7_1_machine_options(m);
>      m->alias = NULL;
> +    pcmc->enforce_valid_iova = false;
>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>  }
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 9c847faea2f8..22119131eca7 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -117,6 +117,7 @@ struct PCMachineClass {
>      bool has_reserved_memory;
>      bool enforce_aligned_dimm;
>      bool broken_reserved_end;
> +    bool enforce_valid_iova;
>  
>      /* generate legacy CPU hotplug AML */
>      bool legacy_cpu_hotplug;



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState
  2022-06-16 13:05   ` Igor Mammedov
@ 2022-06-17 10:57     ` Joao Martins
  0 siblings, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-17 10:57 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/16/22 14:05, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:28 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> Rather than hardcoding the 4G boundary everywhere, introduce a
>> X86MachineState property @above_4g_mem_start and use it
> so far it's just field not a property /fix commit message/
> 
Fixed.

>> accordingly.
>>
>> This is in preparation for relocating ram-above-4g to be
>> dynamically start at 1T on AMD platforms.
> 
> possibly needs to be rebased on top of current master to include cxl_base
> 
Yeap. I fxed the cxl_base as following:

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 82cfafc1c3b6..a9d1bf95649a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -930,7 +930,7 @@ void pc_memory_init(PCMachineState *pcms,
         } else if (pcms->sgx_epc.size != 0) {
             cxl_base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
-            cxl_base = 0x100000000ULL + x86ms->above_4g_mem_size;
+            cxl_base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
         }

         e820_add_entry(cxl_base, cxl_size, E820_RESERVED);


> with comments fixed
> 
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> 

I added this -- Thanks a lot!

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/acpi-build.c  | 2 +-
>>  hw/i386/pc.c          | 9 +++++----
>>  hw/i386/sgx.c         | 2 +-
>>  hw/i386/x86.c         | 1 +
>>  include/hw/i386/x86.h | 3 +++
>>  5 files changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
>> index c125939ed6f9..3160b20c9574 100644
>> --- a/hw/i386/acpi-build.c
>> +++ b/hw/i386/acpi-build.c
>> @@ -2120,7 +2120,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>>                  build_srat_memory(table_data, mem_base, mem_len, i - 1,
>>                                    MEM_AFFINITY_ENABLED);
>>              }
>> -            mem_base = 1ULL << 32;
>> +            mem_base = x86ms->above_4g_mem_start;
>>              mem_len = next_base - x86ms->below_4g_mem_size;
>>              next_base = mem_base + mem_len;
>>          }
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 7c39c913355b..f7da1d5dd40d 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -832,9 +832,10 @@ void pc_memory_init(PCMachineState *pcms,
>>                                   machine->ram,
>>                                   x86ms->below_4g_mem_size,
>>                                   x86ms->above_4g_mem_size);
>> -        memory_region_add_subregion(system_memory, 0x100000000ULL,
>> +        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
>>                                      ram_above_4g);
>> -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
>> +        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
>> +                       E820_RAM);
>>      }
>>  
>>      if (pcms->sgx_epc.size != 0) {
>> @@ -875,7 +876,7 @@ void pc_memory_init(PCMachineState *pcms,
>>              machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
>>          } else {
>>              machine->device_memory->base =
>> -                0x100000000ULL + x86ms->above_4g_mem_size;
>> +                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>          }
>>  
>>          machine->device_memory->base =
>> @@ -1019,7 +1020,7 @@ uint64_t pc_pci_hole64_start(void)
>>      } else if (pcms->sgx_epc.size != 0) {
>>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>>      } else {
>> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
>> +        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>      }
>>  
>>      return ROUND_UP(hole64_start, 1 * GiB);
>> diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
>> index a44d66ba2afc..09d9c7c73d9f 100644
>> --- a/hw/i386/sgx.c
>> +++ b/hw/i386/sgx.c
>> @@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
>>          return;
>>      }
>>  
>> -    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
>> +    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>  
>>      memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
>>      memory_region_add_subregion(get_system_memory(), sgx_epc->base,
>> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>> index 78b05ab7a2d1..af3c790a2830 100644
>> --- a/hw/i386/x86.c
>> +++ b/hw/i386/x86.c
>> @@ -1373,6 +1373,7 @@ static void x86_machine_initfn(Object *obj)
>>      x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
>>      x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
>>      x86ms->bus_lock_ratelimit = 0;
>> +    x86ms->above_4g_mem_start = 0x100000000ULL;
> 
> s/0x.../4 * GiB/
> 
Fixed.

>>  }
>>  
>>  static void x86_machine_class_init(ObjectClass *oc, void *data)
>> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
>> index 9089bdd99c3a..df82c5fd4252 100644
>> --- a/include/hw/i386/x86.h
>> +++ b/include/hw/i386/x86.h
>> @@ -56,6 +56,9 @@ struct X86MachineState {
>>      /* RAM information (sizes, addresses, configuration): */
>>      ram_addr_t below_4g_mem_size, above_4g_mem_size;
>>  
>> +    /* Start address of the initial RAM above 4G */
>> +    uint64_t above_4g_mem_start;
>> +
>>      /* CPU and apic information: */
>>      bool apic_xrupt_override;
>>      unsigned pci_irq_mask;
> 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init()
  2022-06-16 13:21   ` Reviewed-by: Igor Mammedov
@ 2022-06-17 11:03     ` Joao Martins
  2022-06-20  7:12     ` Mark Cave-Ayland
  1 sibling, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-17 11:03 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/16/22 14:21, Reviewed-by: Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> At the start of pc_memory_init() we usually pass a range of
>> 0..UINT64_MAX as pci_memory, when really its 2G (i440fx) or
>> 32G (q35). To get the real user value, we need to get pci-host
>> passed property for default pci_hole64_size. Thus to get that,
>> create the qdev prior to memory init to better make estimations
>> on max used/phys addr.
>>
>> This is in preparation to determine that host-phys-bits are
>> enough and also for pci-hole64-size to be considered to relocate
>> ram-above-4g to be at 1T (on AMD platforms).
> 
> with comments below fixed
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
>  
Having fixed your comments, I added this thanks!

>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc_piix.c            | 5 ++++-
>>  hw/i386/pc_q35.c             | 6 +++---
>>  hw/pci-host/i440fx.c         | 3 +--
>>  include/hw/pci-host/i440fx.h | 2 +-
>>  4 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index 578e537b3525..12d4a279c793 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
>>      MemoryRegion *pci_memory;
>>      MemoryRegion *rom_memory;
>>      ram_addr_t lowmem;
>> +    DeviceState *i440fx_dev;
>>  
>>      /*
>>       * Calculate ram split, for memory below and above 4G.  It's a bit
>> @@ -164,9 +165,11 @@ static void pc_init1(MachineState *machine,
>>          pci_memory = g_new(MemoryRegion, 1);
>>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>>          rom_memory = pci_memory;
>> +        i440fx_dev = qdev_new(host_type);
>>      } else {
>>          pci_memory = NULL;
>>          rom_memory = system_memory;
>> +        i440fx_dev = NULL;
>>      }
>>  
>>      pc_guest_info_init(pcms);
>> @@ -199,7 +202,7 @@ static void pc_init1(MachineState *machine,
>>  
>>          pci_bus = i440fx_init(host_type,
>>                                pci_type,
>> -                              &i440fx_state,
>> +                              i440fx_dev, &i440fx_state,
> confusing names, suggest to rename i440fx_state -> pci_i440fx and i440fx_dev -> i440fx_host
> or something like this
> 
I've changed i440fx_dev as that's what I add in this patch.

>>                                system_memory, system_io, machine->ram_size,
>>                                x86ms->below_4g_mem_size,
>>                                x86ms->above_4g_mem_size,
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 42eb8b97079a..8d867bdb274a 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -203,12 +203,12 @@ static void pc_q35_init(MachineState *machine)
>>                              pcms->smbios_entry_point_type);
>>      }
>>  
>> -    /* allocate ram and load rom/bios */
>> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> -
>>      /* create pci host bus */
>>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>>  
>> +    /* allocate ram and load rom/bios */
>> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> +
>>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>>                               OBJECT(ram_memory), NULL);
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index e08716142b6e..5c1bab5c58ed 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -238,6 +238,7 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>>  }
>>  
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> 
> does it still need 'host_type'?
> 
I've removed it.

>> +                    DeviceState *dev,
>>                      PCII440FXState **pi440fx_state,
>>                      MemoryRegion *address_space_mem,
>>                      MemoryRegion *address_space_io,
>> @@ -247,7 +248,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      MemoryRegion *pci_address_space,
>>                      MemoryRegion *ram_memory)
>>  {
>> -    DeviceState *dev;
>>      PCIBus *b;
>>      PCIDevice *d;
>>      PCIHostState *s;
>> @@ -255,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>      unsigned i;
>>      I440FXState *i440fx;
>>  
>> -    dev = qdev_new(host_type);
>>      s = PCI_HOST_BRIDGE(dev);
>>      b = pci_root_bus_new(dev, NULL, pci_address_space,
>>                           address_space_io, 0, TYPE_PCI_BUS);
>> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
>> index f068aaba8fda..c4710445e30a 100644
>> --- a/include/hw/pci-host/i440fx.h
>> +++ b/include/hw/pci-host/i440fx.h
>> @@ -36,7 +36,7 @@ struct PCII440FXState {
>>  #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
>>  
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>> -                    PCII440FXState **pi440fx_state,
>> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>>                      MemoryRegion *address_space_mem,
>>                      MemoryRegion *address_space_io,
>>                      ram_addr_t ram_size,
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/5] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-06-16 13:30   ` Igor Mammedov
  2022-06-16 14:16     ` Michael S. Tsirkin
@ 2022-06-17 11:13     ` Joao Martins
  2022-06-17 11:58       ` Igor Mammedov
  1 sibling, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-06-17 11:13 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/16/22 14:30, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:30 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Use the pre-initialized pci-host qdev and fetch the
>> pci-hole64-size into pc_memory_init() newly added argument.
>> piix needs a bit of care given all the !pci_enabled()
>> and that the pci_hole64_size is private to i440fx.
>>
>> This is in preparation to determine that host-phys-bits are
>> enough and for pci-hole64-size to be considered to relocate
>> ram-above-4g to be at 1T (on AMD platforms).
> 
> modulo nit blow
> 
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> 

I haven't tackled the initialization nit below but I would assume
you agree with the rest of the patch. Let me know if I should still
add the Rb tag.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c                 | 3 ++-
>>  hw/i386/pc_piix.c            | 5 ++++-
>>  hw/i386/pc_q35.c             | 8 +++++++-
>>  hw/pci-host/i440fx.c         | 7 +++++++
>>  include/hw/i386/pc.h         | 3 ++-
>>  include/hw/pci-host/i440fx.h | 1 +
>>  6 files changed, 23 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index f7da1d5dd40d..af52d4ff89ef 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -799,7 +799,8 @@ void xen_load_linux(PCMachineState *pcms)
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> -                    MemoryRegion **ram_memory)
>> +                    MemoryRegion **ram_memory,
>> +                    uint64_t pci_hole64_size)
>>  {
>>      int linux_boot, i;
>>      MemoryRegion *option_rom_mr;
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index 12d4a279c793..57bb5b8f2aea 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
>>      MemoryRegion *pci_memory;
>>      MemoryRegion *rom_memory;
>>      ram_addr_t lowmem;
>> +    uint64_t hole64_size;
> 
> init it to 0 right here to avoid chance of run amok uninitialized variable?
> 
I haven't done this given that mst disagreed, plus the fact that the code style of
the function seems to place the NULL initialization mostly left to else conditional
clause. Part of the reason I haven't inited @i440fx_dev to NULL here as well (now
i440fx_host. The location we use hole64_size is also the same location we are using
@i440fx_host.

>>      DeviceState *i440fx_dev;
>>  
>>      /*
>> @@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
>>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>>          rom_memory = pci_memory;
>>          i440fx_dev = qdev_new(host_type);
>> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
>>      } else {
>>          pci_memory = NULL;
>>          rom_memory = system_memory;
>>          i440fx_dev = NULL;
>> +        hole64_size = 0;
>>      }
>>  
>>      pc_guest_info_init(pcms);
>> @@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
>>      /* allocate ram and load rom/bios */
>>      if (!xen_enabled()) {
>>          pc_memory_init(pcms, system_memory,
>> -                       rom_memory, &ram_memory);
>> +                       rom_memory, &ram_memory, hole64_size);
>>      } else {
>>          pc_system_flash_cleanup_unused(pcms);
>>          if (machine->kernel_filename != NULL) {
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 8d867bdb274a..4d5c2fbd976b 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
>>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>>      bool acpi_pcihp;
>>      bool keep_pci_slot_hpc;
>> +    uint64_t pci_hole64_size = 0;
>>  
>>      /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
>>       * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
>> @@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
>>      /* create pci host bus */
>>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>>  
>> +    if (pcmc->pci_enabled) {
>> +        pci_hole64_size = q35_host->mch.pci_hole64_size;
>> +    }
>> +
>>      /* allocate ram and load rom/bios */
>> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
>> +                   pci_hole64_size);
>>  
>>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index 5c1bab5c58ed..c5cc28250d5c 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>>      }
>>  }
>>  
>> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
>> +{
>> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
>> +
>> +        return i440fx->pci_hole64_size;
>> +}
>> +
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      DeviceState *dev,
>>                      PCII440FXState **pi440fx_state,
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index ffcac5121ed9..9c847faea2f8 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> -                    MemoryRegion **ram_memory);
>> +                    MemoryRegion **ram_memory,
>> +                    uint64_t pci_hole64_size);
>>  uint64_t pc_pci_hole64_start(void);
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>>  void pc_basic_device_init(struct PCMachineState *pcms,
>> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
>> index c4710445e30a..1299d6a2b0e4 100644
>> --- a/include/hw/pci-host/i440fx.h
>> +++ b/include/hw/pci-host/i440fx.h
>> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      MemoryRegion *pci_memory,
>>                      MemoryRegion *ram_memory);
>>  
>> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
>>  
>>  #endif
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/5] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-06-17 11:13     ` Joao Martins
@ 2022-06-17 11:58       ` Igor Mammedov
  0 siblings, 0 replies; 32+ messages in thread
From: Igor Mammedov @ 2022-06-17 11:58 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 17 Jun 2022 12:13:45 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/16/22 14:30, Igor Mammedov wrote:
> > On Fri, 20 May 2022 11:45:30 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Use the pre-initialized pci-host qdev and fetch the
> >> pci-hole64-size into pc_memory_init() newly added argument.
> >> piix needs a bit of care given all the !pci_enabled()
> >> and that the pci_hole64_size is private to i440fx.
> >>
> >> This is in preparation to determine that host-phys-bits are
> >> enough and for pci-hole64-size to be considered to relocate
> >> ram-above-4g to be at 1T (on AMD platforms).  
> > 
> > modulo nit blow
> > 
> > Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> >   
> 
> I haven't tackled the initialization nit below but I would assume
> you agree with the rest of the patch. Let me know if I should still
> add the Rb tag.

My ack still stands
 
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c                 | 3 ++-
> >>  hw/i386/pc_piix.c            | 5 ++++-
> >>  hw/i386/pc_q35.c             | 8 +++++++-
> >>  hw/pci-host/i440fx.c         | 7 +++++++
> >>  include/hw/i386/pc.h         | 3 ++-
> >>  include/hw/pci-host/i440fx.h | 1 +
> >>  6 files changed, 23 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index f7da1d5dd40d..af52d4ff89ef 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -799,7 +799,8 @@ void xen_load_linux(PCMachineState *pcms)
> >>  void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >> -                    MemoryRegion **ram_memory)
> >> +                    MemoryRegion **ram_memory,
> >> +                    uint64_t pci_hole64_size)
> >>  {
> >>      int linux_boot, i;
> >>      MemoryRegion *option_rom_mr;
> >> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> >> index 12d4a279c793..57bb5b8f2aea 100644
> >> --- a/hw/i386/pc_piix.c
> >> +++ b/hw/i386/pc_piix.c
> >> @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
> >>      MemoryRegion *pci_memory;
> >>      MemoryRegion *rom_memory;
> >>      ram_addr_t lowmem;
> >> +    uint64_t hole64_size;  
> > 
> > init it to 0 right here to avoid chance of run amok uninitialized variable?
> >   
> I haven't done this given that mst disagreed, plus the fact that the code style of
> the function seems to place the NULL initialization mostly left to else conditional
> clause. Part of the reason I haven't inited @i440fx_dev to NULL here as well (now
> i440fx_host. The location we use hole64_size is also the same location we are using
> @i440fx_host.
> 
> >>      DeviceState *i440fx_dev;
> >>  
> >>      /*
> >> @@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
> >>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
> >>          rom_memory = pci_memory;
> >>          i440fx_dev = qdev_new(host_type);
> >> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
> >>      } else {
> >>          pci_memory = NULL;
> >>          rom_memory = system_memory;
> >>          i440fx_dev = NULL;
> >> +        hole64_size = 0;
> >>      }
> >>  
> >>      pc_guest_info_init(pcms);
> >> @@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
> >>      /* allocate ram and load rom/bios */
> >>      if (!xen_enabled()) {
> >>          pc_memory_init(pcms, system_memory,
> >> -                       rom_memory, &ram_memory);
> >> +                       rom_memory, &ram_memory, hole64_size);
> >>      } else {
> >>          pc_system_flash_cleanup_unused(pcms);
> >>          if (machine->kernel_filename != NULL) {
> >> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> >> index 8d867bdb274a..4d5c2fbd976b 100644
> >> --- a/hw/i386/pc_q35.c
> >> +++ b/hw/i386/pc_q35.c
> >> @@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
> >>      MachineClass *mc = MACHINE_GET_CLASS(machine);
> >>      bool acpi_pcihp;
> >>      bool keep_pci_slot_hpc;
> >> +    uint64_t pci_hole64_size = 0;
> >>  
> >>      /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
> >>       * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
> >> @@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
> >>      /* create pci host bus */
> >>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
> >>  
> >> +    if (pcmc->pci_enabled) {
> >> +        pci_hole64_size = q35_host->mch.pci_hole64_size;
> >> +    }
> >> +
> >>      /* allocate ram and load rom/bios */
> >> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> >> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
> >> +                   pci_hole64_size);
> >>  
> >>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
> >>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
> >> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> >> index 5c1bab5c58ed..c5cc28250d5c 100644
> >> --- a/hw/pci-host/i440fx.c
> >> +++ b/hw/pci-host/i440fx.c
> >> @@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
> >>      }
> >>  }
> >>  
> >> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
> >> +{
> >> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
> >> +
> >> +        return i440fx->pci_hole64_size;
> >> +}
> >> +
> >>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> >>                      DeviceState *dev,
> >>                      PCII440FXState **pi440fx_state,
> >> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> >> index ffcac5121ed9..9c847faea2f8 100644
> >> --- a/include/hw/i386/pc.h
> >> +++ b/include/hw/i386/pc.h
> >> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
> >>  void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >> -                    MemoryRegion **ram_memory);
> >> +                    MemoryRegion **ram_memory,
> >> +                    uint64_t pci_hole64_size);
> >>  uint64_t pc_pci_hole64_start(void);
> >>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
> >>  void pc_basic_device_init(struct PCMachineState *pcms,
> >> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
> >> index c4710445e30a..1299d6a2b0e4 100644
> >> --- a/include/hw/pci-host/i440fx.h
> >> +++ b/include/hw/pci-host/i440fx.h
> >> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> >>                      MemoryRegion *pci_memory,
> >>                      MemoryRegion *ram_memory);
> >>  
> >> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
> >>  
> >>  #endif  
> >   
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-16 14:23   ` Igor Mammedov
@ 2022-06-17 12:18     ` Joao Martins
  2022-06-17 12:32       ` Igor Mammedov
  2022-06-17 16:12       ` Joao Martins
  0 siblings, 2 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-17 12:18 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit



On 6/16/22 15:23, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:31 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>                                                    ^^^ typo?
> 
Yes, it should have been 'except'.

>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
>>
>> [*] there is another reserved region unrelated to HT that exists
>> in the 256T boundaru in Fam 17h according to Errata #1286,
>               ^ ditto
> 
Fixed.

>> documeted also in "Open-Source Register Reference for AMD Family
>> 17h Processors (PUB)"
>>
>> When creating the region above 4G, take into account that on AMD
>> platforms the HyperTransport range is reserved and hence it
>> cannot be used either as GPAs. On those cases rather than
>> establishing the start of ram-above-4g to be 4G, relocate instead
>> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
>> Topology", for more information on the underlying restriction of
>> IOVAs.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> 0000000000000000-000000007fffffff (prio 0, i/o):
>> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
>> 0000010000000000-000001ff7fffffff (prio 0, i/o):
>> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
>>
>> If the relocation is done, we also add the the reserved HT
>> e820 range as reserved.
>>
>> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
>> ram-above-4g relocation may be desired and the CPU wasn't configured
>> with a big enough phys-bits, print an error message to the user
>> and do not make the relocation of the above-4g-region if phys-bits
>> is too low.
>>
>> Suggested-by: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 111 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index af52d4ff89ef..652ae8ff9ccf 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN       0x800
>>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>>  
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> 
> s/x86_max_phys_addr/pc_max_used_gpa/
> 
Fixed.

>> +                                hwaddr above_4g_mem_start,
>> +                                uint64_t pci_hole64_size)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    ram_addr_t device_mem_size = 0;
>> +    hwaddr base;
>> +
>> +    if (!x86ms->above_4g_mem_size) {
>> +       /*
>> +        * 32-bit pci hole goes from
>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>> +        */
>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> 
> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> that's located above it.
> 

True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
otherwise. We won't hit the 1T hole, hence a nop. Unless we plan on using
pc_max_used_gpa() for something else other than this.

The alternative would be to just early bail out of pc_set_amd_above_4g_mem_start() if
!above_4g_mem_size. And I guess in that case we can just remove pc_max_used_gpa()
and replace with a:

	max_used_gpa = pc_pci_hole64_start() + pci_hole64_size

Which makes this even simpler. thoughts?

>> +    }
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +       (machine->ram_size < machine->maxram_size)) {
>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>> +    }
>> +
>> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
>> +                    pcms->sgx_epc.size, 1 * GiB);
>> +
>> +    return base + device_mem_size + pci_hole64_size;
> 
> it's not guarantied that pci64 hole starts right away device_mem,
> but you are not 1st doing this assumption in code, maybe instead of
> all above use existing 
>    pc_pci_hole64_start() + pci_hole64_size
> to gestimate max address 
> 
I've switched the block above to that instead.

>> +}
>> +
>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
> 
> s/x86_update_above_4g_mem_start/pc_set_amd_above_4g_mem_start/
> 
Fixed.

>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    CPUX86State *env = &X86_CPU(first_cpu)->env;
>> +    hwaddr start = x86ms->above_4g_mem_start;
>> +    hwaddr maxphysaddr, maxusedaddr;
> 
> 
>> +    /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (!IS_AMD_CPU(env)) {
>> +        return;
>> +    }
> 
> move this to caller
> 
Done (same for the patch after this one):

-    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
+    /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(env)) {
+        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
+    }


>> +    /* Bail out if max possible address does not cross HT range */
>> +    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    start = AMD_ABOVE_1TB_START;
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +
>> +    x86ms->above_4g_mem_start = start;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
>> +
>>      /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>> @@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
>>                               0, x86ms->below_4g_mem_size);
>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>> +
>> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
>> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +    }
> probably it is not necessary, but it doesn't hurt
> 

virtual firmware can make better decisions to avoid reserved ranges.

I was actually thinking that if phys_bits was >= 40 that we would
anyways add it.

>>      if (x86ms->above_4g_mem_size > 0) {
>>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-17 12:18     ` Joao Martins
@ 2022-06-17 12:32       ` Igor Mammedov
  2022-06-17 13:33         ` Joao Martins
  2022-06-17 16:12       ` Joao Martins
  1 sibling, 1 reply; 32+ messages in thread
From: Igor Mammedov @ 2022-06-17 12:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 17 Jun 2022 13:18:38 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/16/22 15:23, Igor Mammedov wrote:
> > On Fri, 20 May 2022 11:45:31 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> It is assumed that the whole GPA space is available to be DMA
> >> addressable, within a given address space limit, expect for a  
> >                                                    ^^^ typo?
> >   
> Yes, it should have been 'except'.
> 
> >> tiny region before the 4G. Since Linux v5.4, VFIO validates
> >> whether the selected GPA is indeed valid i.e. not reserved by
> >> IOMMU on behalf of some specific devices or platform-defined
> >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
> >>  -EINVAL.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may only have these ranges as allowed:
> >>
> >> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> >>
> >> We already account for the 4G hole, albeit if the guest is big
> >> enough we will fail to allocate a guest with  >1010G due to the
> >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> >>
> >> [*] there is another reserved region unrelated to HT that exists
> >> in the 256T boundaru in Fam 17h according to Errata #1286,  
> >               ^ ditto
> >   
> Fixed.
> 
> >> documeted also in "Open-Source Register Reference for AMD Family
> >> 17h Processors (PUB)"
> >>
> >> When creating the region above 4G, take into account that on AMD
> >> platforms the HyperTransport range is reserved and hence it
> >> cannot be used either as GPAs. On those cases rather than
> >> establishing the start of ram-above-4g to be 4G, relocate instead
> >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> >> Topology", for more information on the underlying restriction of
> >> IOVAs.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> 0000000000000000-000000007fffffff (prio 0, i/o):
> >> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> >> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> >> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> >>
> >> If the relocation is done, we also add the the reserved HT
> >> e820 range as reserved.
> >>
> >> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> >> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> >> ram-above-4g relocation may be desired and the CPU wasn't configured
> >> with a big enough phys-bits, print an error message to the user
> >> and do not make the relocation of the above-4g-region if phys-bits
> >> is too low.
> >>
> >> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 111 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index af52d4ff89ef..652ae8ff9ccf 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -796,6 +796,110 @@ void xen_load_linux(PCMachineState *pcms)
> >>  #define PC_ROM_ALIGN       0x800
> >>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
> >>  
> >> +/*
> >> + * AMD systems with an IOMMU have an additional hole close to the
> >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> >> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> >> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> >> + * The ranges reserved for Hyper-Transport are:
> >> + *
> >> + * FD_0000_0000h - FF_FFFF_FFFFh
> >> + *
> >> + * The ranges represent the following:
> >> + *
> >> + * Base Address   Top Address  Use
> >> + *
> >> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> >> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> >> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> >> + * FD_F910_0000h FD_F91F_FFFFh System Management
> >> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> >> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> >> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> >> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> >> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> >> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> >> + *
> >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> >> + * Table 3: Special Address Controls (GPA) for more information.
> >> + */
> >> +#define AMD_HT_START         0xfd00000000UL
> >> +#define AMD_HT_END           0xffffffffffUL
> >> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> >> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> >> +
> >> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,  
> > 
> > s/x86_max_phys_addr/pc_max_used_gpa/
> >   
> Fixed.
> 
> >> +                                hwaddr above_4g_mem_start,
> >> +                                uint64_t pci_hole64_size)
> >> +{
> >> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >> +    MachineState *machine = MACHINE(pcms);
> >> +    ram_addr_t device_mem_size = 0;
> >> +    hwaddr base;
> >> +
> >> +    if (!x86ms->above_4g_mem_size) {
> >> +       /*
> >> +        * 32-bit pci hole goes from
> >> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >> +        */
> >> +        return IO_APIC_DEFAULT_ADDRESS - 1;  
> > 
> > lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> > that's located above it.
> >   
> 
> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
> otherwise. We won't hit the 1T hole, hence a nop.

I don't get the reasoning, can you clarify it pls?

>  Unless we plan on using
> pc_max_used_gpa() for something else other than this.

Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
large enough on CLI.

Looks like guesstimate we could use is taking pci64_hole_end as max used GPA

> 
> The alternative would be to just early bail out of pc_set_amd_above_4g_mem_start() if
> !above_4g_mem_size. And I guess in that case we can just remove pc_max_used_gpa()
> and replace with a:
> 
> 	max_used_gpa = pc_pci_hole64_start() + pci_hole64_size
> 
> Which makes this even simpler. thoughts?
> 
> >> +    }
> >> +
> >> +    if (pcmc->has_reserved_memory &&
> >> +       (machine->ram_size < machine->maxram_size)) {
> >> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >> +    }
> >> +
> >> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
> >> +                    pcms->sgx_epc.size, 1 * GiB);
> >> +
> >> +    return base + device_mem_size + pci_hole64_size;  
> > 
> > it's not guarantied that pci64 hole starts right away device_mem,
> > but you are not 1st doing this assumption in code, maybe instead of
> > all above use existing 
> >    pc_pci_hole64_start() + pci_hole64_size
> > to gestimate max address 
> >   
> I've switched the block above to that instead.
> 
> >> +}
> >> +
> >> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >> +                                          uint64_t pci_hole64_size)  
> > 
> > s/x86_update_above_4g_mem_start/pc_set_amd_above_4g_mem_start/
> >   
> Fixed.
> 
> >> +{
> >> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >> +    CPUX86State *env = &X86_CPU(first_cpu)->env;
> >> +    hwaddr start = x86ms->above_4g_mem_start;
> >> +    hwaddr maxphysaddr, maxusedaddr;  
> > 
> >   
> >> +    /*
> >> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >> +     * to above 1T to AMD vCPUs only.
> >> +     */
> >> +    if (!IS_AMD_CPU(env)) {
> >> +        return;
> >> +    }  
> > 
> > move this to caller
> >   
> Done (same for the patch after this one):
> 
> -    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(env)) {
> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> +    }
> 
> 
> >> +    /* Bail out if max possible address does not cross HT range */
> >> +    if (x86_max_phys_addr(pcms, start, pci_hole64_size) < AMD_HT_START) {
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> >> +     * So make sure phys-bits is required to be appropriately sized in order
> >> +     * to proceed with the above-4g-region relocation and thus boot.
> >> +     */
> >> +    start = AMD_ABOVE_1TB_START;
> >> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> >> +    maxusedaddr = x86_max_phys_addr(pcms, start, pci_hole64_size);
> >> +    if (maxphysaddr < maxusedaddr) {
> >> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> >> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> >> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> >> +        exit(EXIT_FAILURE);
> >> +    }
> >> +
> >> +
> >> +    x86ms->above_4g_mem_start = start;
> >> +}
> >> +
> >>  void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >> @@ -817,6 +921,8 @@ void pc_memory_init(PCMachineState *pcms,
> >>  
> >>      linux_boot = (machine->kernel_filename != NULL);
> >>  
> >> +    x86_update_above_4g_mem_start(pcms, pci_hole64_size);
> >> +
> >>      /*
> >>       * Split single memory region and use aliases to address portions of it,
> >>       * done for backwards compatibility with older qemus.
> >> @@ -827,6 +933,11 @@ void pc_memory_init(PCMachineState *pcms,
> >>                               0, x86ms->below_4g_mem_size);
> >>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> >>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> >> +
> >> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> >> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >> +    }  
> > probably it is not necessary, but it doesn't hurt
> >   
> 
> virtual firmware can make better decisions to avoid reserved ranges.
> 
> I was actually thinking that if phys_bits was >= 40 that we would
> anyways add it.
> 
> >>      if (x86ms->above_4g_mem_size > 0) {
> >>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> >>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",  
> >   
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-17 12:32       ` Igor Mammedov
@ 2022-06-17 13:33         ` Joao Martins
  2022-06-20 14:27           ` Igor Mammedov
  0 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-06-17 13:33 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/17/22 13:32, Igor Mammedov wrote:
> On Fri, 17 Jun 2022 13:18:38 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 6/16/22 15:23, Igor Mammedov wrote:
>>> On Fri, 20 May 2022 11:45:31 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> +                                hwaddr above_4g_mem_start,
>>>> +                                uint64_t pci_hole64_size)
>>>> +{
>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> +    MachineState *machine = MACHINE(pcms);
>>>> +    ram_addr_t device_mem_size = 0;
>>>> +    hwaddr base;
>>>> +
>>>> +    if (!x86ms->above_4g_mem_size) {
>>>> +       /*
>>>> +        * 32-bit pci hole goes from
>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>> +        */
>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;  
>>>
>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>> that's located above it.
>>>   
>>
>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>> otherwise. We won't hit the 1T hole, hence a nop.
> 
> I don't get the reasoning, can you clarify it pls?
> 

I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).

I was doing this before based on pci_hole64. phys-bits=32 was for example one
of the test failures, and pci-hole64 sits above what 32-bit can reference.

>>  Unless we plan on using
>> pc_max_used_gpa() for something else other than this.
> 
> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> large enough on CLI.
> 
So hotpluggable memory seems to assume it sits above 4g mem.

pci_hole64 likewise as it uses similar computations as hotplug.

Unless I am misunderstanding something here.

> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> 
I think this was what I had before (v3[0]) and did not work.

Let me revisit this edge case again.

[0] https://lore.kernel.org/all/20220223184455.9057-5-joao.m.martins@oracle.com/


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-06-16 14:27   ` Igor Mammedov
@ 2022-06-17 13:36     ` Joao Martins
  0 siblings, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-17 13:36 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/16/22 15:27, Igor Mammedov wrote:
> On Fri, 20 May 2022 11:45:32 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> The added enforcing is only relevant in the case of AMD where the
>> range right before the 1TB is restricted and cannot be DMA mapped
>> by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
>> or possibly other kinds of IOMMU events in the AMD IOMMU.
>>
>> Although, there's a case where it may make sense to disable the
>> IOVA relocation/validation when migrating from a
>> non-valid-IOVA-aware qemu to one that supports it.
>>
>> Relocating RAM regions to after the 1Tb hole has consequences for
>> guest ABI because we are changing the memory mapping, so make
>> sure that only new machine enforce but not older ons.
> 
> is old machine with so much ram going to work and not explode
> even without iommu?
> 
Depends on your definition of work.

And that's the purpose of this patch, to still allow graceful
failures on hosts with different hypervisor kernel versions that
would use versioned machine (like pc-q35-7.0 or older)

e.g. if you boot a guest with pc-q35-7.0 on a 4.19 kernel it will boot
whereas on a v5.14 kernel with same pc-q35-7.0, the memory map would
stay the same, but it would fail as a >= 5.4 kernel will validate
whether IOVA.

It will 'work' as before for old machine, meaning you are dependent on the
kernel to validate IOVAs and prevent dma maps or not. Without IOMMU enabled
you don't need this, but you also can't do VFIO (or the like vDPA)

>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c         | 7 +++++--
>>  hw/i386/pc_piix.c    | 2 ++
>>  hw/i386/pc_q35.c     | 2 ++
>>  include/hw/i386/pc.h | 1 +
>>  4 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 652ae8ff9ccf..62f9af91f19f 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -862,6 +862,7 @@ static hwaddr x86_max_phys_addr(PCMachineState *pcms,
>>  static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>                                            uint64_t pci_hole64_size)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      CPUX86State *env = &X86_CPU(first_cpu)->env;
>>      hwaddr start = x86ms->above_4g_mem_start;
>> @@ -870,9 +871,10 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>      /*
>>       * The HyperTransport range close to the 1T boundary is unique to AMD
>>       * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> -     * to above 1T to AMD vCPUs only.
>> +     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
>> +     * older machine types (<= 7.0) for compatibility purposes.
>>       */
>> -    if (!IS_AMD_CPU(env)) {
>> +    if (!IS_AMD_CPU(env) || !pcmc->enforce_valid_iova) {
>>          return;
>>      }
>>  
>> @@ -1881,6 +1883,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>>      pcmc->has_reserved_memory = true;
>>      pcmc->kvmclock_enabled = true;
>>      pcmc->enforce_aligned_dimm = true;
>> +    pcmc->enforce_valid_iova = true;
>>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>>       * to be used at the moment, 32K should be enough for a while.  */
>>      pcmc->acpi_data_size = 0x20000 + 0x8000;
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index 57bb5b8f2aea..74176a210d56 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -437,9 +437,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
>>  
>>  static void pc_i440fx_7_0_machine_options(MachineClass *m)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>>      pc_i440fx_7_1_machine_options(m);
>>      m->alias = NULL;
>>      m->is_default = false;
>> +    pcmc->enforce_valid_iova = false;
>>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>>  }
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 4d5c2fbd976b..bc38a6ba4c67 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
>>  
>>  static void pc_q35_7_0_machine_options(MachineClass *m)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>>      pc_q35_7_1_machine_options(m);
>>      m->alias = NULL;
>> +    pcmc->enforce_valid_iova = false;
>>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>>  }
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 9c847faea2f8..22119131eca7 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -117,6 +117,7 @@ struct PCMachineClass {
>>      bool has_reserved_memory;
>>      bool enforce_aligned_dimm;
>>      bool broken_reserved_end;
>> +    bool enforce_valid_iova;
>>  
>>      /* generate legacy CPU hotplug AML */
>>      bool legacy_cpu_hotplug;
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-17 12:18     ` Joao Martins
  2022-06-17 12:32       ` Igor Mammedov
@ 2022-06-17 16:12       ` Joao Martins
  1 sibling, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-17 16:12 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/17/22 13:18, Joao Martins wrote:
> On 6/16/22 15:23, Igor Mammedov wrote:
>> On Fri, 20 May 2022 11:45:31 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> +    }
>>> +
>>> +    if (pcmc->has_reserved_memory &&
>>> +       (machine->ram_size < machine->maxram_size)) {
>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>> +    }
>>> +
>>> +    base = ROUND_UP(above_4g_mem_start + x86ms->above_4g_mem_size +
>>> +                    pcms->sgx_epc.size, 1 * GiB);
>>> +
>>> +    return base + device_mem_size + pci_hole64_size;
>>
>> it's not guarantied that pci64 hole starts right away device_mem,
>> but you are not 1st doing this assumption in code, maybe instead of
>> all above use existing 
>>    pc_pci_hole64_start() + pci_hole64_size
>> to gestimate max address 
>>
> I've switched the block above to that instead.
> 

I had done this, albeit on a second look (and confirmed with testing) this
will crash, provided @device_memory isn't yet initialized. And even without
hotplug, CXL might have had issues.

The problem is largely that pc_pci_hole64_start() that the above check relies
on info we only populate later on in pc_memory_init(), and I don't think I can
move this done to a later point as definitely don't want to re-initialize
MRs or anything.

So we might be left with manually calculating as I was doing in this patch
but maybe try to arrange some form of new helper that has somewhat shared
logic with pc_pci_hole64_start().

  1114  uint64_t pc_pci_hole64_start(void)
  1115  {
  1116      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
  1117      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
  1118      MachineState *ms = MACHINE(pcms);
  1119      X86MachineState *x86ms = X86_MACHINE(pcms);
  1120      uint64_t hole64_start = 0;
  1121
  1122      if (pcms->cxl_devices_state.host_mr.addr) {
  1123          hole64_start = pcms->cxl_devices_state.host_mr.addr +
  1124              memory_region_size(&pcms->cxl_devices_state.host_mr);
  1125          if (pcms->cxl_devices_state.fixed_windows) {
  1126              GList *it;
  1127              for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
  1128                  CXLFixedWindow *fw = it->data;
  1129                  hole64_start = fw->mr.addr + memory_region_size(&fw->mr);
  1130              }
  1131          }
* 1132      } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
  1133          hole64_start = ms->device_memory->base;
  1134          if (!pcmc->broken_reserved_end) {
  1135              hole64_start += memory_region_size(&ms->device_memory->mr);
  1136          }
  1137      } else if (pcms->sgx_epc.size != 0) {
  1138              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
  1139      } else {
  1140          hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
  1141      }



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init()
  2022-06-16 13:21   ` Reviewed-by: Igor Mammedov
  2022-06-17 11:03     ` Joao Martins
@ 2022-06-20  7:12     ` Mark Cave-Ayland
  1 sibling, 0 replies; 32+ messages in thread
From: Mark Cave-Ayland @ 2022-06-20  7:12 UTC (permalink / raw)
  To: Igor Mammedov, Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 16/06/2022 14:21, Reviewed-by: Igor Mammedov wrote:

> On Fri, 20 May 2022 11:45:29 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> At the start of pc_memory_init() we usually pass a range of
>> 0..UINT64_MAX as pci_memory, when really its 2G (i440fx) or
>> 32G (q35). To get the real user value, we need to get pci-host
>> passed property for default pci_hole64_size. Thus to get that,
>> create the qdev prior to memory init to better make estimations
>> on max used/phys addr.
>>
>> This is in preparation to determine that host-phys-bits are
>> enough and also for pci-hole64-size to be considered to relocate
>> ram-above-4g to be at 1T (on AMD platforms).
> 
> with comments below fixed
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
>   
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>   hw/i386/pc_piix.c            | 5 ++++-
>>   hw/i386/pc_q35.c             | 6 +++---
>>   hw/pci-host/i440fx.c         | 3 +--
>>   include/hw/pci-host/i440fx.h | 2 +-
>>   4 files changed, 9 insertions(+), 7 deletions(-)
>>
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index 578e537b3525..12d4a279c793 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
>>       MemoryRegion *pci_memory;
>>       MemoryRegion *rom_memory;
>>       ram_addr_t lowmem;
>> +    DeviceState *i440fx_dev;
>>   
>>       /*
>>        * Calculate ram split, for memory below and above 4G.  It's a bit
>> @@ -164,9 +165,11 @@ static void pc_init1(MachineState *machine,
>>           pci_memory = g_new(MemoryRegion, 1);
>>           memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>>           rom_memory = pci_memory;
>> +        i440fx_dev = qdev_new(host_type);
>>       } else {
>>           pci_memory = NULL;
>>           rom_memory = system_memory;
>> +        i440fx_dev = NULL;
>>       }
>>   
>>       pc_guest_info_init(pcms);
>> @@ -199,7 +202,7 @@ static void pc_init1(MachineState *machine,
>>   
>>           pci_bus = i440fx_init(host_type,
>>                                 pci_type,
>> -                              &i440fx_state,
>> +                              i440fx_dev, &i440fx_state,
> confusing names, suggest to rename i440fx_state -> pci_i440fx and i440fx_dev -> i440fx_host
> or something like this

It might be worth considering this series on top of Bernhard's patch here: 
https://lists.gnu.org/archive/html/qemu-devel/2022-06/msg02206.html.

>>                                 system_memory, system_io, machine->ram_size,
>>                                 x86ms->below_4g_mem_size,
>>                                 x86ms->above_4g_mem_size,
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 42eb8b97079a..8d867bdb274a 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -203,12 +203,12 @@ static void pc_q35_init(MachineState *machine)
>>                               pcms->smbios_entry_point_type);
>>       }
>>   
>> -    /* allocate ram and load rom/bios */
>> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> -
>>       /* create pci host bus */
>>       q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>>   
>> +    /* allocate ram and load rom/bios */
>> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> +
>>       object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>>       object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>>                                OBJECT(ram_memory), NULL);
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index e08716142b6e..5c1bab5c58ed 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -238,6 +238,7 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>>   }
>>   
>>   PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> 
> does it still need 'host_type'?
> 
>> +                    DeviceState *dev,
>>                       PCII440FXState **pi440fx_state,
>>                       MemoryRegion *address_space_mem,
>>                       MemoryRegion *address_space_io,
>> @@ -247,7 +248,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                       MemoryRegion *pci_address_space,
>>                       MemoryRegion *ram_memory)
>>   {
>> -    DeviceState *dev;
>>       PCIBus *b;
>>       PCIDevice *d;
>>       PCIHostState *s;
>> @@ -255,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>       unsigned i;
>>       I440FXState *i440fx;
>>   
>> -    dev = qdev_new(host_type);
>>       s = PCI_HOST_BRIDGE(dev);
>>       b = pci_root_bus_new(dev, NULL, pci_address_space,
>>                            address_space_io, 0, TYPE_PCI_BUS);
>> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
>> index f068aaba8fda..c4710445e30a 100644
>> --- a/include/hw/pci-host/i440fx.h
>> +++ b/include/hw/pci-host/i440fx.h
>> @@ -36,7 +36,7 @@ struct PCII440FXState {
>>   #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
>>   
>>   PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>> -                    PCII440FXState **pi440fx_state,
>> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>>                       MemoryRegion *address_space_mem,
>>                       MemoryRegion *address_space_io,
>>                       ram_addr_t ram_size,


ATB,

Mark.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-17 13:33         ` Joao Martins
@ 2022-06-20 14:27           ` Igor Mammedov
  2022-06-20 16:36             ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Igor Mammedov @ 2022-06-20 14:27 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Fri, 17 Jun 2022 14:33:02 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/17/22 13:32, Igor Mammedov wrote:
> > On Fri, 17 Jun 2022 13:18:38 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 6/16/22 15:23, Igor Mammedov wrote:  
> >>> On Fri, 20 May 2022 11:45:31 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>>> +                                hwaddr above_4g_mem_start,
> >>>> +                                uint64_t pci_hole64_size)
> >>>> +{
> >>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> +    MachineState *machine = MACHINE(pcms);
> >>>> +    ram_addr_t device_mem_size = 0;
> >>>> +    hwaddr base;
> >>>> +
> >>>> +    if (!x86ms->above_4g_mem_size) {
> >>>> +       /*
> >>>> +        * 32-bit pci hole goes from
> >>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>>> +        */
> >>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;    
> >>>
> >>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> >>> that's located above it.
> >>>     
> >>
> >> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
> >> otherwise. We won't hit the 1T hole, hence a nop.  
> > 
> > I don't get the reasoning, can you clarify it pls?
> >   
> 
> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
> 
> I was doing this before based on pci_hole64. phys-bits=32 was for example one
> of the test failures, and pci-hole64 sits above what 32-bit can reference.

if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
(including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)

and this doesn't look to me as AMD specific issue

perhaps do a phys-bits check as a separate patch
that will error out if max_used_gpa is above phys-bits limit
(maybe at machine_done time)
(i.e. defining max_gpa and checking if compatible with configured cpu
are 2 different things)

(it might be possible that tests need to be fixed too to account for it)

> >>  Unless we plan on using
> >> pc_max_used_gpa() for something else other than this.  
> > 
> > Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> > present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> > large enough on CLI.
> >   
> So hotpluggable memory seems to assume it sits above 4g mem.
> 
> pci_hole64 likewise as it uses similar computations as hotplug.
> 
> Unless I am misunderstanding something here.
> 
> > Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> >   
> I think this was what I had before (v3[0]) and did not work.

that had been tied to host's phys-bits directly, all in one patch
and duplicating existing pc_pci_hole64_start().
 
> Let me revisit this edge case again.
> 
> [0] https://lore.kernel.org/all/20220223184455.9057-5-joao.m.martins@oracle.com/
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-20 14:27           ` Igor Mammedov
@ 2022-06-20 16:36             ` Joao Martins
  2022-06-20 18:13               ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-06-20 16:36 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/20/22 15:27, Igor Mammedov wrote:
> On Fri, 17 Jun 2022 14:33:02 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 6/17/22 13:32, Igor Mammedov wrote:
>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 6/16/22 15:23, Igor Mammedov wrote:  
>>>>> On Fri, 20 May 2022 11:45:31 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>> +                                uint64_t pci_hole64_size)
>>>>>> +{
>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> +    MachineState *machine = MACHINE(pcms);
>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>> +    hwaddr base;
>>>>>> +
>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>> +       /*
>>>>>> +        * 32-bit pci hole goes from
>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>> +        */
>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;    
>>>>>
>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>>>> that's located above it.
>>>>>     
>>>>
>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>>>> otherwise. We won't hit the 1T hole, hence a nop.  
>>>
>>> I don't get the reasoning, can you clarify it pls?
>>>   
>>
>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
>>
>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>> of the test failures, and pci-hole64 sits above what 32-bit can reference.
> 
> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
> 
> and this doesn't look to me as AMD specific issue
> 
> perhaps do a phys-bits check as a separate patch
> that will error out if max_used_gpa is above phys-bits limit
> (maybe at machine_done time)
> (i.e. defining max_gpa and checking if compatible with configured cpu
> are 2 different things)
> 
> (it might be possible that tests need to be fixed too to account for it)
> 

My old notes (from v3) tell me with such a check these tests were exiting early thanks to
that error:

 1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
  killed by signal 6 SIGABRT
 4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
  killed by signal 6 SIGABRT
 7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
  killed by signal 6 SIGABRT
44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
  killed by signal 6 SIGABRT
45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
  killed by signal 6 SIGABRT

But the real reason these fail is not at all related to CPU phys bits,
but because we just don't handle the case where no pci_hole64 is supposed to exist (which
is what that other check is trying to do) e.g. A VM with -m 1G would
observe the same thing i.e. the computations after that conditional are all for the pci
hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
bigger than phys-bits=32 (by definition). So the error_report is just telling me that
pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.

If you're not fond of:

+    if (!x86ms->above_4g_mem_size) {
+       /*
+        * 32-bit pci hole goes from
+        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+         */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }

Then what should I use instead of the above?

'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
also what is used for i440fx/q35 code. I could move it to a macro (e.g.
PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
in addition for hotplug/CXL/etc existence?

>>>>  Unless we plan on using
>>>> pc_max_used_gpa() for something else other than this.  
>>>
>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
>>> large enough on CLI.
>>>   
>> So hotpluggable memory seems to assume it sits above 4g mem.
>>
>> pci_hole64 likewise as it uses similar computations as hotplug.
>>
>> Unless I am misunderstanding something here.
>>
>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>   
>> I think this was what I had before (v3[0]) and did not work.
> 
> that had been tied to host's phys-bits directly, all in one patch
> and duplicating existing pc_pci_hole64_start().
>  

Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()

I was sort of thinking to something like extracting calls to start + size "tuple" into
functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
the memory-region @base and its size.

See snippet below. Note I am missing CXL handling, but gives you the idea.

But it is slightly more complex than what I had in this version :( and would require
anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
the similar logic.

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index fd088093b5d5..016bc65fcb4b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -885,6 +885,34 @@ static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
     x86ms->above_4g_mem_start = start;
 }

+static void pc_get_device_memory_range(PCMachineState *pcms,
+                                       hwaddr *base,
+                                       hwaddr *device_mem_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    hwaddr addr, size;
+
+    size = machine->maxram_size - machine->ram_size;
+
+    if (pcms->sgx_epc.size != 0) {
+        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
+    } else {
+        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+    }
+
+    if (pcmc->enforce_aligned_dimm) {
+        /* size device region assuming 1G page max alignment per slot */
+        size += (1 * GiB) * machine->ram_slots;
+    }
+
+    if (base)
+        *base = addr;
+    if (device_mem_size)
+        *device_mem_size = size;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -962,7 +990,7 @@ void pc_memory_init(PCMachineState *pcms,
     /* initialize device memory address space */
     if (pcmc->has_reserved_memory &&
         (machine->ram_size < machine->maxram_size)) {
-        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+        ram_addr_t device_mem_size;

         if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
             error_report("unsupported amount of memory slots: %"PRIu64,
@@ -977,20 +1005,7 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }

-        if (pcms->sgx_epc.size != 0) {
-            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
-        } else {
-            machine->device_memory->base =
-                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
-        }
-
-        machine->device_memory->base =
-            ROUND_UP(machine->device_memory->base, 1 * GiB);
-
-        if (pcmc->enforce_aligned_dimm) {
-            /* size device region assuming 1G page max alignment per slot */
-            device_mem_size += (1 * GiB) * machine->ram_slots;
-        }
+        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);

         if ((machine->device_memory->base + device_mem_size) <
             device_mem_size) {
@@ -1053,6 +1068,27 @@ void pc_memory_init(PCMachineState *pcms,
     pcms->memhp_io_base = ACPI_MEMORY_HOTPLUG_BASE;
 }

+static uint64_t x86ms_pci_hole64_start(PCMachineState *pcms)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    uint64_t hole64_start, size;
+
+    if (pcmc->has_reserved_memory &&
+        (machine->ram_size < machine->maxram_size)) {
+        pc_get_device_memory_range(pcms, &hole64_start, &size);
+        if (!pcmc->broken_reserved_end) {
+            hole64_start += size;
+        }
+    } else if (pcms->sgx_epc.size != 0) {
+        hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
+    } else {
+        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+    }
+
+    return hole64_start;
+}
 /*
  * The 64bit pci hole starts after "above 4G RAM" and
  * potentially the space reserved for memory hotplug.
@@ -1062,18 +1098,17 @@ uint64_t pc_pci_hole64_start(void)
     PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *ms = MACHINE(pcms);
-    X86MachineState *x86ms = X86_MACHINE(pcms);
     uint64_t hole64_start = 0;

-    if (pcmc->has_reserved_memory && ms->device_memory->base) {
+    if (pcmc->has_reserved_memory &&
+        ms->device_memory && ms->device_memory->base) {
         hole64_start = ms->device_memory->base;
         if (!pcmc->broken_reserved_end) {
             hole64_start += memory_region_size(&ms->device_memory->mr);
         }
-    } else if (pcms->sgx_epc.size != 0) {
-            hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+        /* handles unpopulated memory regions */
+        hole64_start = x86ms_pci_hole64_start(pcms);
     }

     return ROUND_UP(hole64_start, 1 * GiB);


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-20 16:36             ` Joao Martins
@ 2022-06-20 18:13               ` Joao Martins
  2022-06-28 12:38                 ` Igor Mammedov
  0 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-06-20 18:13 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On 6/20/22 17:36, Joao Martins wrote:
> On 6/20/22 15:27, Igor Mammedov wrote:
>> On Fri, 17 Jun 2022 14:33:02 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 6/17/22 13:32, Igor Mammedov wrote:
>>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 6/16/22 15:23, Igor Mammedov wrote:  
>>>>>> On Fri, 20 May 2022 11:45:31 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>>> +                                uint64_t pci_hole64_size)
>>>>>>> +{
>>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>> +    MachineState *machine = MACHINE(pcms);
>>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>>> +    hwaddr base;
>>>>>>> +
>>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>>> +       /*
>>>>>>> +        * 32-bit pci hole goes from
>>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>>> +        */
>>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;    
>>>>>>
>>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>>>>> that's located above it.
>>>>>>     
>>>>>
>>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>>>>> otherwise. We won't hit the 1T hole, hence a nop.  
>>>>
>>>> I don't get the reasoning, can you clarify it pls?
>>>>   
>>>
>>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
>>>
>>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>>> of the test failures, and pci-hole64 sits above what 32-bit can reference.
>>
>> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
>> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
>>
>> and this doesn't look to me as AMD specific issue
>>
>> perhaps do a phys-bits check as a separate patch
>> that will error out if max_used_gpa is above phys-bits limit
>> (maybe at machine_done time)
>> (i.e. defining max_gpa and checking if compatible with configured cpu
>> are 2 different things)
>>
>> (it might be possible that tests need to be fixed too to account for it)
>>
> 
> My old notes (from v3) tell me with such a check these tests were exiting early thanks to
> that error:
> 
>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
>   killed by signal 6 SIGABRT
>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
>   killed by signal 6 SIGABRT
>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
>   killed by signal 6 SIGABRT
> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
>   killed by signal 6 SIGABRT
> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
>   killed by signal 6 SIGABRT
> 
> But the real reason these fail is not at all related to CPU phys bits,
> but because we just don't handle the case where no pci_hole64 is supposed to exist (which
> is what that other check is trying to do) e.g. A VM with -m 1G would
> observe the same thing i.e. the computations after that conditional are all for the pci
> hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
> bigger than phys-bits=32 (by definition). So the error_report is just telling me that
> pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.
> 
> If you're not fond of:
> 
> +    if (!x86ms->above_4g_mem_size) {
> +       /*
> +        * 32-bit pci hole goes from
> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +         */
> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> +    }
> 
> Then what should I use instead of the above?
> 
> 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
> also what is used for i440fx/q35 code. I could move it to a macro (e.g.
> PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
> perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
> in addition for hotplug/CXL/etc existence?
> 
>>>>>  Unless we plan on using
>>>>> pc_max_used_gpa() for something else other than this.  
>>>>
>>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
>>>> large enough on CLI.
>>>>   
>>> So hotpluggable memory seems to assume it sits above 4g mem.
>>>
>>> pci_hole64 likewise as it uses similar computations as hotplug.
>>>
>>> Unless I am misunderstanding something here.
>>>
>>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>>   
>>> I think this was what I had before (v3[0]) and did not work.
>>
>> that had been tied to host's phys-bits directly, all in one patch
>> and duplicating existing pc_pci_hole64_start().
>>  
> 
> Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()
> 
> I was sort of thinking to something like extracting calls to start + size "tuple" into
> functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
> maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
> the memory-region @base and its size.
> 
> See snippet below. Note I am missing CXL handling, but gives you the idea.
> 
> But it is slightly more complex than what I had in this version :( and would require
> anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
> the similar logic.
> 

Ignore previous snippet, here's a slightly cleaner version:

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 8eaa32ee2106..1d97c77a5eac 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -803,6 +803,43 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)

+static void pc_get_device_memory_range(PCMachineState *pcms,
+                                       hwaddr *base,
+                                       hwaddr *device_mem_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    MachineState *machine = MACHINE(pcms);
+    hwaddr addr, size;
+
+    if (pcmc->has_reserved_memory &&
+        machine->device_memory && machine->device_memory->base) {
+        addr = machine->device_memory->base;
+        size = memory_region_size(&machine->device_memory->mr);
+        goto out;
+    }
+
+    /* uninitialized memory region */
+    size = machine->maxram_size - machine->ram_size;
+
+    if (pcms->sgx_epc.size != 0) {
+        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
+    } else {
+        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+    }
+
+    if (pcmc->enforce_aligned_dimm) {
+        /* size device region assuming 1G page max alignment per slot */
+        size += (1 * GiB) * machine->ram_slots;
+    }
+
+out:
+    if (base)
+        *base = addr;
+    if (device_mem_size)
+        *device_mem_size = size;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -864,7 +901,7 @@ void pc_memory_init(PCMachineState *pcms,
     /* initialize device memory address space */
     if (pcmc->has_reserved_memory &&
         (machine->ram_size < machine->maxram_size)) {
-        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+        ram_addr_t device_mem_size;

         if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
             error_report("unsupported amount of memory slots: %"PRIu64,
@@ -879,20 +916,7 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }

-        if (pcms->sgx_epc.size != 0) {
-            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
-        } else {
-            machine->device_memory->base =
-                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
-        }
-
-        machine->device_memory->base =
-            ROUND_UP(machine->device_memory->base, 1 * GiB);
-
-        if (pcmc->enforce_aligned_dimm) {
-            /* size device region assuming 1G page max alignment per slot */
-            device_mem_size += (1 * GiB) * machine->ram_slots;
-        }
+        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);

         if ((machine->device_memory->base + device_mem_size) <
             device_mem_size) {
@@ -965,12 +989,13 @@ uint64_t pc_pci_hole64_start(void)
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *ms = MACHINE(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
-    uint64_t hole64_start = 0;
+    uint64_t hole64_start = 0, size = 0;

-    if (pcmc->has_reserved_memory && ms->device_memory->base) {
-        hole64_start = ms->device_memory->base;
+    if (pcmc->has_reserved_memory &&
+        (ms->ram_size < ms->maxram_size)) {
+        pc_get_device_memory_range(pcms, &hole64_start, &size);
         if (!pcmc->broken_reserved_end) {
-            hole64_start += memory_region_size(&ms->device_memory->mr);
+            hole64_start += size;
         }
     } else if (pcms->sgx_epc.size != 0) {
             hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
  2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (5 preceding siblings ...)
  2022-06-08 10:37 ` [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
@ 2022-06-22 22:37 ` Alex Williamson
  2022-06-22 23:18   ` Joao Martins
  6 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2022-06-22 22:37 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, wei.huang2, Dr. David Alan Gilbert

On Fri, 20 May 2022 11:45:27 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> v4[5] -> v5:
> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> commit message;
> 
> ---
> 
> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> particularly when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> 	failed to setup container for group 258: memory listener initialization failed:
> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
> 
> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> It is organized as following:
> 
> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>          address of the 4G boundary
> 
> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> 	     to get accessing to pci_hole64_size. The actual pci-host
> 	     initialization is kept as is, only the qdev_new.
> 
> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> possible address acrosses the HT region. Errors out if the phys-bits is too
> low, which is only the case for >=1010G configurations or something that
> crosses the HT region.
> 
> patch 5: Ensure valid IOVAs only on new machine types, but not older
> ones (<= v7.0.0)
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40
> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
> pick the right value of phys-bits (regardless of this series), so we warn in
> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> old seabios. 
> 
> Additionally, the reserved region is added to E820 if the relocation is done.

I was helping a user on irc yesterday who was assigning a bunch of GPUs
on an AMD system and was not specifying an increased PCI hole and
therefore was not triggering the relocation.  The result was that the
VM doesn't know about this special range and given their guest RAM
size, firmware was mapping GPU BARs overlapping this reserved range
anyway.  I didn't see any evidence that this user was doing anything
like booting with pci=nocrs to blatantly ignore the firmware provided
bus resources.

To avoid this sort of thing, shouldn't this hypertransport range always
be marked reserved regardless of whether the relocation is done?

vfio-pci won't generate a fatal error when MMIO mappings fail, so this
scenario can be rather subtle.  NB, it also did not resolve this user's
problem to specify the PCI hole size and activate the relocation, so
this was not necessarily the issue they were fighting, but I noted it
as an apparent gap in this series.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
  2022-06-22 22:37 ` Alex Williamson
@ 2022-06-22 23:18   ` Joao Martins
  2022-06-23 16:03     ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Joao Martins @ 2022-06-22 23:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, wei.huang2, Dr. David Alan Gilbert

On 6/22/22 23:37, Alex Williamson wrote:
> On Fri, 20 May 2022 11:45:27 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> v4[5] -> v5:
>> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
>> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
>> commit message;
>>
>> ---
>>
>> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
>> particularly when running on AMD systems with an IOMMU.
>>
>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>
>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>> of the failure:
>>
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>> 	failed to setup container for group 258: memory listener initialization failed:
>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>
>> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>> as documented on the links down below.
>>
>> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
>> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
>> to be above the 1T if the maximum RAM range crosses the HT reserved range.
>> It is organized as following:
>>
>> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>>          address of the 4G boundary
>>
>> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
>> 	     to get accessing to pci_hole64_size. The actual pci-host
>> 	     initialization is kept as is, only the qdev_new.
>>
>> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
>> possible address acrosses the HT region. Errors out if the phys-bits is too
>> low, which is only the case for >=1010G configurations or something that
>> crosses the HT region.
>>
>> patch 5: Ensure valid IOVAs only on new machine types, but not older
>> ones (<= v7.0.0)
>>
>> The 'consequence' of this approach is that we may need more than the default
>> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
>> address, consequently needing 41 phys-bits as opposed to the default of 40
>> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
>> pick the right value of phys-bits (regardless of this series), so we warn in
>> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
>> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
>> old seabios. 
>>
>> Additionally, the reserved region is added to E820 if the relocation is done.
> 
> I was helping a user on irc yesterday who was assigning a bunch of GPUs
> on an AMD system and was not specifying an increased PCI hole and
> therefore was not triggering the relocation.  The result was that the
> VM doesn't know about this special range and given their guest RAM
> size, firmware was mapping GPU BARs overlapping this reserved range
> anyway.  I didn't see any evidence that this user was doing anything
> like booting with pci=nocrs to blatantly ignore the firmware provided
> bus resources.
> 
> To avoid this sort of thing, shouldn't this hypertransport range always
> be marked reserved regardless of whether the relocation is done?
> 
Yeap, I think that's the right thing to do. We were alluding to that in patch 4.

I can switch said patch to IS_AMD() together with a phys-bits check to add the
range to e820.

But in practice, right now, this is going to be merely informative and doesn't
change the outcome, as OVMF ignores reserved ranges if I understood that code
correctly.

relocation is most effective at avoiding this reserved-range overlapping issue
on guests with less than a 1010GiB.

> vfio-pci won't generate a fatal error when MMIO mappings fail, so this
> scenario can be rather subtle.  NB, it also did not resolve this user's
> problem to specify the PCI hole size and activate the relocation, so
> this was not necessarily the issue they were fighting, but I noted it
> as an apparent gap in this series.  Thanks,

So I take it that even after the user expanded the PCI hole64 size and thus
the GPU BARS were placed in a non-reserved range... still saw the MMIO
mappings fail?

	Joao


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
  2022-06-22 23:18   ` Joao Martins
@ 2022-06-23 16:03     ` Alex Williamson
  2022-06-23 17:13       ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2022-06-23 16:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, wei.huang2, Dr. David Alan Gilbert

On Thu, 23 Jun 2022 00:18:06 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/22/22 23:37, Alex Williamson wrote:
> > On Fri, 20 May 2022 11:45:27 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> v4[5] -> v5:
> >> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> >> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> >> commit message;
> >>
> >> ---
> >>
> >> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> >> particularly when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> >> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> >> 	failed to setup container for group 258: memory listener initialization failed:
> >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> >> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> >> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> >> It is organized as following:
> >>
> >> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
> >>          address of the 4G boundary
> >>
> >> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> >> 	     to get accessing to pci_hole64_size. The actual pci-host
> >> 	     initialization is kept as is, only the qdev_new.
> >>
> >> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> >> possible address acrosses the HT region. Errors out if the phys-bits is too
> >> low, which is only the case for >=1010G configurations or something that
> >> crosses the HT region.
> >>
> >> patch 5: Ensure valid IOVAs only on new machine types, but not older
> >> ones (<= v7.0.0)
> >>
> >> The 'consequence' of this approach is that we may need more than the default
> >> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> >> address, consequently needing 41 phys-bits as opposed to the default of 40
> >> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
> >> pick the right value of phys-bits (regardless of this series), so we warn in
> >> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> >> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> >> old seabios. 
> >>
> >> Additionally, the reserved region is added to E820 if the relocation is done.  
> > 
> > I was helping a user on irc yesterday who was assigning a bunch of GPUs
> > on an AMD system and was not specifying an increased PCI hole and
> > therefore was not triggering the relocation.  The result was that the
> > VM doesn't know about this special range and given their guest RAM
> > size, firmware was mapping GPU BARs overlapping this reserved range
> > anyway.  I didn't see any evidence that this user was doing anything
> > like booting with pci=nocrs to blatantly ignore the firmware provided
> > bus resources.
> > 
> > To avoid this sort of thing, shouldn't this hypertransport range always
> > be marked reserved regardless of whether the relocation is done?
> >   
> Yeap, I think that's the right thing to do. We were alluding to that in patch 4.
> 
> I can switch said patch to IS_AMD() together with a phys-bits check to add the
> range to e820.
> 
> But in practice, right now, this is going to be merely informative and doesn't
> change the outcome, as OVMF ignores reserved ranges if I understood that code
> correctly.

:-\

> relocation is most effective at avoiding this reserved-range overlapping issue
> on guests with less than a 1010GiB.

Do we need to do the relocation by default?

> > vfio-pci won't generate a fatal error when MMIO mappings fail, so this
> > scenario can be rather subtle.  NB, it also did not resolve this user's
> > problem to specify the PCI hole size and activate the relocation, so
> > this was not necessarily the issue they were fighting, but I noted it
> > as an apparent gap in this series.  Thanks,  
> 
> So I take it that even after the user expanded the PCI hole64 size and thus
> the GPU BARS were placed in a non-reserved range... still saw the MMIO
> mappings fail?

No, the mapping failures are resolved if the hole64 size is set, it's
just that there seem to be remaining issues that a device occasionally
gets into a bad state that isn't resolved by restarting the VM.
AFAICT, p2p mappings are not being used, so the faults were more of a
nuisance than actually contributing to the issues this user is working
through.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
  2022-06-23 16:03     ` Alex Williamson
@ 2022-06-23 17:13       ` Joao Martins
  0 siblings, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-23 17:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson, Paolo Bonzini,
	Ani Sinha, Marcel Apfelbaum, Igor Mammedov,
	Suravee Suthikulpanit, wei.huang2, Dr. David Alan Gilbert

On 6/23/22 17:03, Alex Williamson wrote:
> On Thu, 23 Jun 2022 00:18:06 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 6/22/22 23:37, Alex Williamson wrote:
>>> On Fri, 20 May 2022 11:45:27 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> v4[5] -> v5:
>>>> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
>>>> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
>>>> commit message;
>>>>
>>>> ---
>>>>
>>>> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
>>>> particularly when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
>>>> to be above the 1T if the maximum RAM range crosses the HT reserved range.
>>>> It is organized as following:
>>>>
>>>> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>>>>          address of the 4G boundary
>>>>
>>>> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
>>>> 	     to get accessing to pci_hole64_size. The actual pci-host
>>>> 	     initialization is kept as is, only the qdev_new.
>>>>
>>>> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
>>>> possible address acrosses the HT region. Errors out if the phys-bits is too
>>>> low, which is only the case for >=1010G configurations or something that
>>>> crosses the HT region.
>>>>
>>>> patch 5: Ensure valid IOVAs only on new machine types, but not older
>>>> ones (<= v7.0.0)
>>>>
>>>> The 'consequence' of this approach is that we may need more than the default
>>>> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
>>>> address, consequently needing 41 phys-bits as opposed to the default of 40
>>>> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
>>>> pick the right value of phys-bits (regardless of this series), so we warn in
>>>> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
>>>> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
>>>> old seabios. 
>>>>
>>>> Additionally, the reserved region is added to E820 if the relocation is done.  
>>>
>>> I was helping a user on irc yesterday who was assigning a bunch of GPUs
>>> on an AMD system and was not specifying an increased PCI hole and
>>> therefore was not triggering the relocation.  The result was that the
>>> VM doesn't know about this special range and given their guest RAM
>>> size, firmware was mapping GPU BARs overlapping this reserved range
>>> anyway.  I didn't see any evidence that this user was doing anything
>>> like booting with pci=nocrs to blatantly ignore the firmware provided
>>> bus resources.
>>>
>>> To avoid this sort of thing, shouldn't this hypertransport range always
>>> be marked reserved regardless of whether the relocation is done?
>>>   
>> Yeap, I think that's the right thing to do. We were alluding to that in patch 4.
>>
>> I can switch said patch to IS_AMD() together with a phys-bits check to add the
>> range to e820.
>>
>> But in practice, right now, this is going to be merely informative and doesn't
>> change the outcome, as OVMF ignores reserved ranges if I understood that code
>> correctly.
> 
> :-\
> 
>> relocation is most effective at avoiding this reserved-range overlapping issue
>> on guests with less than a 1010GiB.
> 
> Do we need to do the relocation by default?
> 
Given the dependency on phys-bits > 40 (TCG_TARGET_PHYS_BITS), maybe not.

Plus this might not be the common case, considering that it is restricted to VMs that have
something closer to 1Tb of memory, (say +768GB) *and* have VFs attached that have very big
BARs enough that cross the 1010G..1T reserved region.

... Unless we could get an idea on how much the PCI hole64 size will be at surplus (based
on device BAR sizes) to understand if it's enough, and relocate based on that. Albeit in
qemu vfio is at a late stage versus the memmap construction.

>>> vfio-pci won't generate a fatal error when MMIO mappings fail, so this
>>> scenario can be rather subtle.  NB, it also did not resolve this user's
>>> problem to specify the PCI hole size and activate the relocation, so
>>> this was not necessarily the issue they were fighting, but I noted it
>>> as an apparent gap in this series.  Thanks,  
>>
>> So I take it that even after the user expanded the PCI hole64 size and thus
>> the GPU BARS were placed in a non-reserved range... still saw the MMIO
>> mappings fail?
> 
> No, the mapping failures are resolved if the hole64 size is set, it's
> just that there seem to be remaining issues that a device occasionally
> gets into a bad state that isn't resolved by restarting the VM.

/me nods

> AFAICT, p2p mappings are not being used, so the faults were more of a
> nuisance than actually contributing to the issues this user is working
> through.  Thanks

Ah OK -- thanks for enlightening


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-20 18:13               ` Joao Martins
@ 2022-06-28 12:38                 ` Igor Mammedov
  2022-06-28 15:27                   ` Joao Martins
  0 siblings, 1 reply; 32+ messages in thread
From: Igor Mammedov @ 2022-06-28 12:38 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit

On Mon, 20 Jun 2022 19:13:46 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/20/22 17:36, Joao Martins wrote:
> > On 6/20/22 15:27, Igor Mammedov wrote:  
> >> On Fri, 17 Jun 2022 14:33:02 +0100
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> On 6/17/22 13:32, Igor Mammedov wrote:  
> >>>> On Fri, 17 Jun 2022 13:18:38 +0100
> >>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>> On 6/16/22 15:23, Igor Mammedov wrote:    
> >>>>>> On Fri, 20 May 2022 11:45:31 +0100
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>>> +                                hwaddr above_4g_mem_start,
> >>>>>>> +                                uint64_t pci_hole64_size)
> >>>>>>> +{
> >>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>> +    MachineState *machine = MACHINE(pcms);
> >>>>>>> +    ram_addr_t device_mem_size = 0;
> >>>>>>> +    hwaddr base;
> >>>>>>> +
> >>>>>>> +    if (!x86ms->above_4g_mem_size) {
> >>>>>>> +       /*
> >>>>>>> +        * 32-bit pci hole goes from
> >>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>>>>>> +        */
> >>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;      
> >>>>>>
> >>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
> >>>>>> that's located above it.
> >>>>>>       
> >>>>>
> >>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
> >>>>> otherwise. We won't hit the 1T hole, hence a nop.    
> >>>>
> >>>> I don't get the reasoning, can you clarify it pls?
> >>>>     
> >>>
> >>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
> >>>
> >>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
> >>> of the test failures, and pci-hole64 sits above what 32-bit can reference.  
> >>
> >> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
> >> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
> >>
> >> and this doesn't look to me as AMD specific issue
> >>
> >> perhaps do a phys-bits check as a separate patch
> >> that will error out if max_used_gpa is above phys-bits limit
> >> (maybe at machine_done time)
> >> (i.e. defining max_gpa and checking if compatible with configured cpu
> >> are 2 different things)
> >>
> >> (it might be possible that tests need to be fixed too to account for it)
> >>  
> > 
> > My old notes (from v3) tell me with such a check these tests were exiting early thanks to
> > that error:
> > 
> >  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
> >   killed by signal 6 SIGABRT
> >  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
> >   killed by signal 6 SIGABRT
> >  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
> >   killed by signal 6 SIGABRT
> > 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
> >   killed by signal 6 SIGABRT
> > 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
> >   killed by signal 6 SIGABRT
> > 
> > But the real reason these fail is not at all related to CPU phys bits,
> > but because we just don't handle the case where no pci_hole64 is supposed to exist (which
> > is what that other check is trying to do) e.g. A VM with -m 1G would
> > observe the same thing i.e. the computations after that conditional are all for the pci
> > hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
> > bigger than phys-bits=32 (by definition). So the error_report is just telling me that
> > pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.
> > 
> > If you're not fond of:
> > 
> > +    if (!x86ms->above_4g_mem_size) {
> > +       /*
> > +        * 32-bit pci hole goes from
> > +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> > +         */
> > +        return IO_APIC_DEFAULT_ADDRESS - 1;
> > +    }
> > 
> > Then what should I use instead of the above?
> > 
> > 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
> > also what is used for i440fx/q35 code. I could move it to a macro (e.g.
> > PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
> > perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
> > in addition for hotplug/CXL/etc existence?
> >   
> >>>>>  Unless we plan on using
> >>>>> pc_max_used_gpa() for something else other than this.    
> >>>>
> >>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
> >>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
> >>>> large enough on CLI.
> >>>>     
> >>> So hotpluggable memory seems to assume it sits above 4g mem.
> >>>
> >>> pci_hole64 likewise as it uses similar computations as hotplug.
> >>>
> >>> Unless I am misunderstanding something here.
> >>>  
> >>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
> >>>>     
> >>> I think this was what I had before (v3[0]) and did not work.  
> >>
> >> that had been tied to host's phys-bits directly, all in one patch
> >> and duplicating existing pc_pci_hole64_start().
> >>    
> > 
> > Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()
> > 
> > I was sort of thinking to something like extracting calls to start + size "tuple" into
> > functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
> > maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
> > the memory-region @base and its size.
> > 
> > See snippet below. Note I am missing CXL handling, but gives you the idea.
> > 
> > But it is slightly more complex than what I had in this version :( and would require
> > anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
> > the similar logic.
> >   
> 
> Ignore previous snippet, here's a slightly cleaner version:

lets go with this version

> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 8eaa32ee2106..1d97c77a5eac 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -803,6 +803,43 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
> 
> +static void pc_get_device_memory_range(PCMachineState *pcms,
> +                                       hwaddr *base,
> +                                       hwaddr *device_mem_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    hwaddr addr, size;
> +
> +    if (pcmc->has_reserved_memory &&
> +        machine->device_memory && machine->device_memory->base) {
> +        addr = machine->device_memory->base;
> +        size = memory_region_size(&machine->device_memory->mr);
> +        goto out;
> +    }
> +
> +    /* uninitialized memory region */
> +    size = machine->maxram_size - machine->ram_size;
> +
> +    if (pcms->sgx_epc.size != 0) {
> +        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
> +    } else {
> +        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> +    }
> +
> +    if (pcmc->enforce_aligned_dimm) {
> +        /* size device region assuming 1G page max alignment per slot */
> +        size += (1 * GiB) * machine->ram_slots;
> +    }
> +
> +out:
> +    if (base)
> +        *base = addr;
> +    if (device_mem_size)
> +        *device_mem_size = size;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -864,7 +901,7 @@ void pc_memory_init(PCMachineState *pcms,
>      /* initialize device memory address space */
>      if (pcmc->has_reserved_memory &&
>          (machine->ram_size < machine->maxram_size)) {
> -        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
> +        ram_addr_t device_mem_size;
> 
>          if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
>              error_report("unsupported amount of memory slots: %"PRIu64,
> @@ -879,20 +916,7 @@ void pc_memory_init(PCMachineState *pcms,
>              exit(EXIT_FAILURE);
>          }
> 
> -        if (pcms->sgx_epc.size != 0) {
> -            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
> -        } else {
> -            machine->device_memory->base =
> -                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> -        }
> -
> -        machine->device_memory->base =
> -            ROUND_UP(machine->device_memory->base, 1 * GiB);
> -
> -        if (pcmc->enforce_aligned_dimm) {
> -            /* size device region assuming 1G page max alignment per slot */
> -            device_mem_size += (1 * GiB) * machine->ram_slots;
> -        }
> +        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
> 
>          if ((machine->device_memory->base + device_mem_size) <
>              device_mem_size) {
> @@ -965,12 +989,13 @@ uint64_t pc_pci_hole64_start(void)
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      MachineState *ms = MACHINE(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
> -    uint64_t hole64_start = 0;
> +    uint64_t hole64_start = 0, size = 0;
> 
> -    if (pcmc->has_reserved_memory && ms->device_memory->base) {
> -        hole64_start = ms->device_memory->base;
> +    if (pcmc->has_reserved_memory &&
> +        (ms->ram_size < ms->maxram_size)) {
> +        pc_get_device_memory_range(pcms, &hole64_start, &size);
>          if (!pcmc->broken_reserved_end) {
> -            hole64_start += memory_region_size(&ms->device_memory->mr);
> +            hole64_start += size;
>          }
>      } else if (pcms->sgx_epc.size != 0) {
>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable
  2022-06-28 12:38                 ` Igor Mammedov
@ 2022-06-28 15:27                   ` Joao Martins
  0 siblings, 0 replies; 32+ messages in thread
From: Joao Martins @ 2022-06-28 15:27 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Daniel Jordan, David Edmondson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Suravee Suthikulpanit



On 6/28/22 13:38, Igor Mammedov wrote:
> On Mon, 20 Jun 2022 19:13:46 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 6/20/22 17:36, Joao Martins wrote:
>>> On 6/20/22 15:27, Igor Mammedov wrote:  
>>>> On Fri, 17 Jun 2022 14:33:02 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 6/17/22 13:32, Igor Mammedov wrote:  
>>>>>> On Fri, 17 Jun 2022 13:18:38 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>> On 6/16/22 15:23, Igor Mammedov wrote:    
>>>>>>>> On Fri, 20 May 2022 11:45:31 +0100
>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>>>>> +                                uint64_t pci_hole64_size)
>>>>>>>>> +{
>>>>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>> +    MachineState *machine = MACHINE(pcms);
>>>>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>>>>> +    hwaddr base;
>>>>>>>>> +
>>>>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>>>>> +       /*
>>>>>>>>> +        * 32-bit pci hole goes from
>>>>>>>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>>>>> +        */
>>>>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;      
>>>>>>>>
>>>>>>>> lack of above_4g_mem, doesn't mean absence of device_mem_size or anything else
>>>>>>>> that's located above it.
>>>>>>>>       
>>>>>>>
>>>>>>> True. But the intent is to fix 32-bit boundaries as one of the qtests was failing
>>>>>>> otherwise. We won't hit the 1T hole, hence a nop.    
>>>>>>
>>>>>> I don't get the reasoning, can you clarify it pls?
>>>>>>     
>>>>>
>>>>> I was trying to say that what lead me here was a couple of qtests failures (from v3->v4).
>>>>>
>>>>> I was doing this before based on pci_hole64. phys-bits=32 was for example one
>>>>> of the test failures, and pci-hole64 sits above what 32-bit can reference.  
>>>>
>>>> if user sets phys-bits=32, then nothing above 4Gb should work (be usable)
>>>> (including above-4g-ram, hotplug region or pci64 hole or sgx or cxl)
>>>>
>>>> and this doesn't look to me as AMD specific issue
>>>>
>>>> perhaps do a phys-bits check as a separate patch
>>>> that will error out if max_used_gpa is above phys-bits limit
>>>> (maybe at machine_done time)
>>>> (i.e. defining max_gpa and checking if compatible with configured cpu
>>>> are 2 different things)
>>>>
>>>> (it might be possible that tests need to be fixed too to account for it)
>>>>  
>>>
>>> My old notes (from v3) tell me with such a check these tests were exiting early thanks to
>>> that error:
>>>
>>>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
>>>   killed by signal 6 SIGABRT
>>>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
>>>   killed by signal 6 SIGABRT
>>>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
>>>   killed by signal 6 SIGABRT
>>> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
>>>   killed by signal 6 SIGABRT
>>> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
>>>   killed by signal 6 SIGABRT
>>>
>>> But the real reason these fail is not at all related to CPU phys bits,
>>> but because we just don't handle the case where no pci_hole64 is supposed to exist (which
>>> is what that other check is trying to do) e.g. A VM with -m 1G would
>>> observe the same thing i.e. the computations after that conditional are all for the pci
>>> hole64, which acounts for SGX/CXL/hotplug or etc which consequently means it's *errousnly*
>>> bigger than phys-bits=32 (by definition). So the error_report is just telling me that
>>> pc_max_used_gpa() is just incorrect without the !x86ms->above_4g_mem_size check.
>>>
>>> If you're not fond of:
>>>
>>> +    if (!x86ms->above_4g_mem_size) {
>>> +       /*
>>> +        * 32-bit pci hole goes from
>>> +        * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>> +         */
>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>> +    }
>>>
>>> Then what should I use instead of the above?
>>>
>>> 'IO_APIC_DEFAULT_ADDRESS - 1' is the size of the 32-bit PCI hole, which is
>>> also what is used for i440fx/q35 code. I could move it to a macro (e.g.
>>> PCI_HOST_HOLE32_SIZE) to make it a bit readable and less hardcoded. Or
>>> perhaps your problem is on !x86ms->above_4g_mem_size and maybe I should check
>>> in addition for hotplug/CXL/etc existence?
>>>   
>>>>>>>  Unless we plan on using
>>>>>>> pc_max_used_gpa() for something else other than this.    
>>>>>>
>>>>>> Even if '!above_4g_mem_sizem', we can still have hotpluggable memory region
>>>>>> present and that can  hit 1Tb. The same goes for pci64_hole if it's configured
>>>>>> large enough on CLI.
>>>>>>     
>>>>> So hotpluggable memory seems to assume it sits above 4g mem.
>>>>>
>>>>> pci_hole64 likewise as it uses similar computations as hotplug.
>>>>>
>>>>> Unless I am misunderstanding something here.
>>>>>  
>>>>>> Looks like guesstimate we could use is taking pci64_hole_end as max used GPA
>>>>>>     
>>>>> I think this was what I had before (v3[0]) and did not work.  
>>>>
>>>> that had been tied to host's phys-bits directly, all in one patch
>>>> and duplicating existing pc_pci_hole64_start().
>>>>    
>>>
>>> Duplicating was sort of my bad attempt in this patch for pc_max_used_gpa()
>>>
>>> I was sort of thinking to something like extracting calls to start + size "tuple" into
>>> functions -- e.g. for hotplug it is pc_get_device_memory_range() and for CXL it would be
>>> maybe pc_get_cxl_range()) -- rather than assuming those values are already initialized on
>>> the memory-region @base and its size.
>>>
>>> See snippet below. Note I am missing CXL handling, but gives you the idea.
>>>
>>> But it is slightly more complex than what I had in this version :( and would require
>>> anyone doing changes in pc_memory_init() and pc_pci_hole64_start() to make sure it follows
>>> the similar logic.
>>>   
>>
>> Ignore previous snippet, here's a slightly cleaner version:
> 
> lets go with this version
> 

OK. I have splited into 5 new patches:

578f551a41f0 i386/pc: handle unitialized mr in pc_get_cxl_range_end()
49256313cfd9 i386/pc: factor out cxl range start to helper
4bc1836bd588 i386/pc: factor out cxl range end to helper
e83cc9d3081c i386/pc: factor out device_memory base/size to helper
1ccb5064338e i386/pc: factor out above-4g end to an helper

Will re-test and respin the series.

The core of the series (in this patch) doesn't change and just gets simpler.

>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 8eaa32ee2106..1d97c77a5eac 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -803,6 +803,43 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN       0x800
>>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>>
>> +static void pc_get_device_memory_range(PCMachineState *pcms,
>> +                                       hwaddr *base,
>> +                                       hwaddr *device_mem_size)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    hwaddr addr, size;
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +        machine->device_memory && machine->device_memory->base) {
>> +        addr = machine->device_memory->base;
>> +        size = memory_region_size(&machine->device_memory->mr);
>> +        goto out;
>> +    }
>> +
>> +    /* uninitialized memory region */
>> +    size = machine->maxram_size - machine->ram_size;
>> +
>> +    if (pcms->sgx_epc.size != 0) {
>> +        addr = sgx_epc_above_4g_end(&pcms->sgx_epc);
>> +    } else {
>> +        addr = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>> +    }
>> +
>> +    if (pcmc->enforce_aligned_dimm) {
>> +        /* size device region assuming 1G page max alignment per slot */
>> +        size += (1 * GiB) * machine->ram_slots;
>> +    }
>> +
>> +out:
>> +    if (base)
>> +        *base = addr;
>> +    if (device_mem_size)
>> +        *device_mem_size = size;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -864,7 +901,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      /* initialize device memory address space */
>>      if (pcmc->has_reserved_memory &&
>>          (machine->ram_size < machine->maxram_size)) {
>> -        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
>> +        ram_addr_t device_mem_size;
>>
>>          if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
>>              error_report("unsupported amount of memory slots: %"PRIu64,
>> @@ -879,20 +916,7 @@ void pc_memory_init(PCMachineState *pcms,
>>              exit(EXIT_FAILURE);
>>          }
>>
>> -        if (pcms->sgx_epc.size != 0) {
>> -            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
>> -        } else {
>> -            machine->device_memory->base =
>> -                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>> -        }
>> -
>> -        machine->device_memory->base =
>> -            ROUND_UP(machine->device_memory->base, 1 * GiB);
>> -
>> -        if (pcmc->enforce_aligned_dimm) {
>> -            /* size device region assuming 1G page max alignment per slot */
>> -            device_mem_size += (1 * GiB) * machine->ram_slots;
>> -        }
>> +        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
>>
>>          if ((machine->device_memory->base + device_mem_size) <
>>              device_mem_size) {
>> @@ -965,12 +989,13 @@ uint64_t pc_pci_hole64_start(void)
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      MachineState *ms = MACHINE(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>> -    uint64_t hole64_start = 0;
>> +    uint64_t hole64_start = 0, size = 0;
>>
>> -    if (pcmc->has_reserved_memory && ms->device_memory->base) {
>> -        hole64_start = ms->device_memory->base;
>> +    if (pcmc->has_reserved_memory &&
>> +        (ms->ram_size < ms->maxram_size)) {
>> +        pc_get_device_memory_range(pcms, &hole64_start, &size);
>>          if (!pcmc->broken_reserved_end) {
>> -            hole64_start += memory_region_size(&ms->device_memory->mr);
>> +            hole64_start += size;
>>          }
>>      } else if (pcms->sgx_epc.size != 0) {
>>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-06-28 15:30 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-20 10:45 [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-05-20 10:45 ` [PATCH v5 1/5] hw/i386: add 4g boundary start to X86MachineState Joao Martins
2022-06-16 13:05   ` Igor Mammedov
2022-06-17 10:57     ` Joao Martins
2022-05-20 10:45 ` [PATCH v5 2/5] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
2022-06-16 13:21   ` Reviewed-by: Igor Mammedov
2022-06-17 11:03     ` Joao Martins
2022-06-20  7:12     ` Mark Cave-Ayland
2022-05-20 10:45 ` [PATCH v5 3/5] i386/pc: pass pci_hole64_size " Joao Martins
2022-06-16 13:30   ` Igor Mammedov
2022-06-16 14:16     ` Michael S. Tsirkin
2022-06-17 11:13     ` Joao Martins
2022-06-17 11:58       ` Igor Mammedov
2022-05-20 10:45 ` [PATCH v5 4/5] i386/pc: relocate 4g start to 1T where applicable Joao Martins
2022-06-16 14:23   ` Igor Mammedov
2022-06-17 12:18     ` Joao Martins
2022-06-17 12:32       ` Igor Mammedov
2022-06-17 13:33         ` Joao Martins
2022-06-20 14:27           ` Igor Mammedov
2022-06-20 16:36             ` Joao Martins
2022-06-20 18:13               ` Joao Martins
2022-06-28 12:38                 ` Igor Mammedov
2022-06-28 15:27                   ` Joao Martins
2022-06-17 16:12       ` Joao Martins
2022-05-20 10:45 ` [PATCH v5 5/5] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
2022-06-16 14:27   ` Igor Mammedov
2022-06-17 13:36     ` Joao Martins
2022-06-08 10:37 ` [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-06-22 22:37 ` Alex Williamson
2022-06-22 23:18   ` Joao Martins
2022-06-23 16:03     ` Alex Williamson
2022-06-23 17:13       ` Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.