All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
@ 2022-07-01 16:10 Joao Martins
  2022-07-01 16:10 ` [PATCH v6 01/10] hw/i386: add 4g boundary start to X86MachineState Joao Martins
                   ` (9 more replies)
  0 siblings, 10 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins, Jonathan Cameron

v5[6] -> v6:
* Rebased to latest staging
* Consider @cxl_base setting to also use above_4g_mem_start (Igor Mammedov)
* Use 4 * GiB instead of raw hex (Igor Mammedov)
* Delete @host_type (Igor Mammedov)
* Rename to i440fx_dev to i440fx_host (Igor Mammedov)
* Rebase on top of patch that removes i440fx_state (Mark Cave-Ayland)
* Add Reviewed-by from Igor in patches 1-3 (Igor Mammedov)
* Fix commit messages typos (Igor Mammedov)
* Move IS_AMD_CPU() call into caller i.e. pc_memory_init() (Igor Mammedov)
* Rename x86_max_phys_addr into pc_max_used_gpa (Igor Mammedov)
* Rename x86_update_above_4g_mem_start into pc_set_amd_above_4g_mem_start (Igor Mammedov)
* Rework how we calculate the pc_max_used_gpa() to use pc_pci_hole64_start() instead,
  This lead to refactor a bunch into separate helpers that handle the case
  where Memory regions aren't yet initialized while streamlining how calculations
  are done at pc_memory_init() and pc_pci_hole64_start().
  This lead to new patches 4-8 in v5 (Igor Mammedov)
  CC'ing Jonathan Cameron on the CXL-related memory init refactoring patches (5-8).
* Always add the HyperTransport range into e820 even when the relocation isn't
  done *and* there's >= 40 phys bit that would put max phyusical boundary to 1T
  (Alex Williamson)
  This should allow virtual firmware to avoid the reserved range at the
  1T boundary on VFs with big bars.

---

This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
particularly when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This small series tries to address that by dealing with this AMD-specific 1Tb hole,
but rather than dealing like the 4G hole, it instead relocates RAM above 4G
to be above the 1T if the maximum RAM range crosses the HT reserved range.
It is organized as following:

patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
         address of the 4G boundary

patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
	     to get accessing to pci_hole64_size. The actual pci-host
	     initialization is kept as is, only the qdev_new.

patch 4: Small deduplication cleanup that was spread around pc

patches 5-8: Make pc_pci_hole64_start() be callable before pc_memory_init()
             initializes any memory regions. This way, the returned value
	     is consistent and we don't need to duplicate same said
	     calculations when detecting the relocation is needed.

patch 9: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
possible address acrosses the HT region. Errors out if the phys-bits is too
low, which is only the case for >=1010G configurations or something that
crosses the HT region.

patch 10: Ensure valid IOVAs only on new machine types, but not older
ones (<= v7.0.0)

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40
(TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
pick the right value of phys-bits (regardless of this series), so we warn in
case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
ram blocks, but it was mentioned over RFC that CMOS is only useful for very
old seabios. 

Additionally, the reserved region is added to E820 if the relocation is done.

Alternative options considered (in RFC[0]):

a) Dealing with the 1T hole like the 4G hole -- which also represents what
hardware closely does.

Thanks,
	Joao

Older Changelog,

v4[5] -> v5:
* Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
* Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
commit message;

v3[4] -> v4[5]:
(changes in patch 4 and 5 only)
* Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
* Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
* Squash previous patch 5 into patch 4 to tie in the phys-bits check
  into the relocate-4g-start logic: We now error out if the phys-bits
  aren't enough on configurations that require above-4g ram relocation. (Michael Tsirkin)
* Make the error message more explicit when phys-bits isn't enough to also
  mention: "cannot avoid AMD HT range"
* Add comments inside x86_update_above_4g_mem_start() explaining the
  logic behind it. (Michael Tsirkin)
* Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based kernels
  alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael Tsirkin)
  Validated -numa topologies too as well as making sure qtests observe no regressions;

 Notes from v4:

* the machine attribute that enables this new logic (see last patch)
is called ::enforce_valid_iova since the RFC. Let me know if folks think it
is poorly named, and whether something a bit more obvious is preferred
(e.g. ::amd_relocate_1t).

* @mst one of the comments you said was to add "host checks" in vdpa/vfio devices.
In discussion with Alex and you over the last version of the patches it seems
that we weren't keen on making this device-specific or behind any machine
property flags (besides machine-compat). Just to reiterate there, making sure we do
the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
vendor (as this series) already ensures thtat this is going to be right for
offending configuration with VDPA/VFIO device that might be
configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
start poking into machine-specific details when we fail to relocate due to the
lack of phys-bits? Otherwise Qemu, just doesn't have enough information to tell
what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is the one
that needs fixing (as VFIO did in v5.4).

RFCv2[3] -> v3[4]:

* Add missing brackets in single line statement, in patch 5 (David)
* Change ranges printf to use PRIx64, in patch 5 (David)
* Move the check to after changing above_4g_mem_start, in patch 5 (David)
* Make the check generic and move it to pc_memory_init rather being specific
to AMD, as the check is useful to capture invalid phys-bits
configs (patch 5, Igor).
* Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
* Consider pci_hole64_size in patch 4 (Igor)
* To consider pci_hole64_size in max used addr we need to get it from pci-host,
so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
qdev_new("q35") to be before pc_memory_init().
* Consider sgx_epc.size in max used address, in patch 4 (Igor)
* Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
* Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.

RFC[0] -> RFCv2[3]:

* At Igor's suggestion in one of the patches I reworked the series enterily,
and more or less as he was thinking it is far simpler to relocate the
ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
and less intrusive. (patch 1 & 2)
* Check phys-bits is big enough prior to relocating (new patch 3)
* Remove the machine property, and it's only internal and set by new machine
version (Igor, patch 4).
* Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
* Add IOMMU SDM in the commit message (Igor, patch 2)

[0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
[3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u
[4] https://lore.kernel.org/all/20220223184455.9057-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.martins@oracle.com/
[6] https://lore.kernel.org/qemu-devel/20220520104532.9816-1-joao.m.martins@oracle.com/

Joao Martins (10):
  hw/i386: add 4g boundary start to X86MachineState
  i386/pc: create pci-host qdev prior to pc_memory_init()
  i386/pc: pass pci_hole64_size to pc_memory_init()
  i386/pc: factor out above-4g end to an helper
  i386/pc: factor out cxl range end to helper
  i386/pc: factor out cxl range start to helper
  i386/pc: handle unitialized mr in pc_get_cxl_range_end()
  i386/pc: factor out device_memory base/size to helper
  i386/pc: relocate 4g start to 1T where applicable
  i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
    type

 hw/i386/acpi-build.c         |   2 +-
 hw/i386/pc.c                 | 257 ++++++++++++++++++++++++++++-------
 hw/i386/pc_piix.c            |  14 +-
 hw/i386/pc_q35.c             |  14 +-
 hw/i386/sgx.c                |   2 +-
 hw/i386/x86.c                |   1 +
 hw/pci-host/i440fx.c         |  12 +-
 include/hw/i386/pc.h         |   4 +-
 include/hw/i386/x86.h        |   3 +
 include/hw/pci-host/i440fx.h |   4 +-
 10 files changed, 254 insertions(+), 59 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v6 01/10] hw/i386: add 4g boundary start to X86MachineState
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-01 16:10 ` [PATCH v6 02/10] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins

Rather than hardcoding the 4G boundary everywhere, introduce a
X86MachineState field @above_4g_mem_start and use it
accordingly.

This is in preparation for relocating ram-above-4g to be
dynamically start at 1T on AMD platforms.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/i386/acpi-build.c  |  2 +-
 hw/i386/pc.c          | 11 ++++++-----
 hw/i386/sgx.c         |  2 +-
 hw/i386/x86.c         |  1 +
 include/hw/i386/x86.h |  3 +++
 5 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index cad6f5ac41e9..0355bd3ddaad 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2024,7 +2024,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
                 build_srat_memory(table_data, mem_base, mem_len, i - 1,
                                   MEM_AFFINITY_ENABLED);
             }
-            mem_base = 1ULL << 32;
+            mem_base = x86ms->above_4g_mem_start;
             mem_len = next_base - x86ms->below_4g_mem_size;
             next_base = mem_base + mem_len;
         }
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 774cb2bf0748..a9d1bf95649a 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -850,9 +850,10 @@ void pc_memory_init(PCMachineState *pcms,
                                  machine->ram,
                                  x86ms->below_4g_mem_size,
                                  x86ms->above_4g_mem_size);
-        memory_region_add_subregion(system_memory, 0x100000000ULL,
+        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
                                     ram_above_4g);
-        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
+        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
+                       E820_RAM);
     }
 
     if (pcms->sgx_epc.size != 0) {
@@ -893,7 +894,7 @@ void pc_memory_init(PCMachineState *pcms,
             machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
             machine->device_memory->base =
-                0x100000000ULL + x86ms->above_4g_mem_size;
+                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
         }
 
         machine->device_memory->base =
@@ -929,7 +930,7 @@ void pc_memory_init(PCMachineState *pcms,
         } else if (pcms->sgx_epc.size != 0) {
             cxl_base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
-            cxl_base = 0x100000000ULL + x86ms->above_4g_mem_size;
+            cxl_base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
         }
 
         e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
@@ -1037,7 +1038,7 @@ uint64_t pc_pci_hole64_start(void)
     } else if (pcms->sgx_epc.size != 0) {
             hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
+        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
     }
 
     return ROUND_UP(hole64_start, 1 * GiB);
diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
index a44d66ba2afc..09d9c7c73d9f 100644
--- a/hw/i386/sgx.c
+++ b/hw/i386/sgx.c
@@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
         return;
     }
 
-    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
+    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 
     memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
     memory_region_add_subregion(get_system_memory(), sgx_epc->base,
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 6003b4b2dfea..029264c54fe2 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1373,6 +1373,7 @@ static void x86_machine_initfn(Object *obj)
     x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
     x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
     x86ms->bus_lock_ratelimit = 0;
+    x86ms->above_4g_mem_start = 4 * GiB;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index 9089bdd99c3a..df82c5fd4252 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -56,6 +56,9 @@ struct X86MachineState {
     /* RAM information (sizes, addresses, configuration): */
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 
+    /* Start address of the initial RAM above 4G */
+    uint64_t above_4g_mem_start;
+
     /* CPU and apic information: */
     bool apic_xrupt_override;
     unsigned pci_irq_mask;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 02/10] i386/pc: create pci-host qdev prior to pc_memory_init()
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-07-01 16:10 ` [PATCH v6 01/10] hw/i386: add 4g boundary start to X86MachineState Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-01 16:10 ` [PATCH v6 03/10] i386/pc: pass pci_hole64_size " Joao Martins
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins

At the start of pc_memory_init() we usually pass a range of
0..UINT64_MAX as pci_memory, when really its 2G (i440fx) or
32G (q35). To get the real user value, we need to get pci-host
passed property for default pci_hole64_size. Thus to get that,
create the qdev prior to memory init to better make estimations
on max used/phys addr.

This is in preparation to determine that host-phys-bits are
enough and also for pci-hole64-size to be considered to relocate
ram-above-4g to be at 1T (on AMD platforms).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/i386/pc_piix.c            | 7 +++++--
 hw/i386/pc_q35.c             | 6 +++---
 hw/pci-host/i440fx.c         | 5 ++---
 include/hw/pci-host/i440fx.h | 3 ++-
 4 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index a234989ac363..6186a1473755 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    DeviceState *i440fx_host;
 
     /*
      * Calculate ram split, for memory below and above 4G.  It's a bit
@@ -164,9 +165,11 @@ static void pc_init1(MachineState *machine,
         pci_memory = g_new(MemoryRegion, 1);
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
+        i440fx_host = qdev_new(host_type);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
+        i440fx_host = NULL;
     }
 
     pc_guest_info_init(pcms);
@@ -200,8 +203,8 @@ static void pc_init1(MachineState *machine,
         const char *type = xen_enabled() ? TYPE_PIIX3_XEN_DEVICE
                                          : TYPE_PIIX3_DEVICE;
 
-        pci_bus = i440fx_init(host_type,
-                              pci_type,
+        pci_bus = i440fx_init(pci_type,
+                              i440fx_host,
                               system_memory, system_io, machine->ram_size,
                               x86ms->below_4g_mem_size,
                               x86ms->above_4g_mem_size,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index f96cbd04e284..46ea89e564de 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -203,12 +203,12 @@ static void pc_q35_init(MachineState *machine)
                             pcms->smbios_entry_point_type);
     }
 
-    /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
-
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
 
+    /* allocate ram and load rom/bios */
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
+
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
                              OBJECT(ram_memory), NULL);
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index 1c5ad5f918a2..d5426ef4a53c 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,7 +237,8 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }
 
-PCIBus *i440fx_init(const char *host_type, const char *pci_type,
+PCIBus *i440fx_init(const char *pci_type,
+                    DeviceState *dev,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
@@ -246,7 +247,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_address_space,
                     MemoryRegion *ram_memory)
 {
-    DeviceState *dev;
     PCIBus *b;
     PCIDevice *d;
     PCIHostState *s;
@@ -254,7 +254,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
     unsigned i;
     I440FXState *i440fx;
 
-    dev = qdev_new(host_type);
     s = PCI_HOST_BRIDGE(dev);
     b = pci_root_bus_new(dev, NULL, pci_address_space,
                          address_space_io, 0, TYPE_PCI_BUS);
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index 52518dbf08e6..d02bf1ed6b93 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -35,7 +35,8 @@ struct PCII440FXState {
 
 #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
 
-PCIBus *i440fx_init(const char *host_type, const char *pci_type,
+PCIBus *i440fx_init(const char *pci_type,
+                    DeviceState *dev,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 03/10] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-07-01 16:10 ` [PATCH v6 01/10] hw/i386: add 4g boundary start to X86MachineState Joao Martins
  2022-07-01 16:10 ` [PATCH v6 02/10] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-09 20:51   ` B
  2022-07-01 16:10 ` [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper Joao Martins
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins

Use the pre-initialized pci-host qdev and fetch the
pci-hole64-size into pc_memory_init() newly added argument.
piix needs a bit of care given all the !pci_enabled()
and that the pci_hole64_size is private to i440fx.

This is in preparation to determine that host-phys-bits are
enough and for pci-hole64-size to be considered to relocate
ram-above-4g to be at 1T (on AMD platforms).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/i386/pc.c                 | 3 ++-
 hw/i386/pc_piix.c            | 5 ++++-
 hw/i386/pc_q35.c             | 8 +++++++-
 hw/pci-host/i440fx.c         | 7 +++++++
 include/hw/i386/pc.h         | 3 ++-
 include/hw/pci-host/i440fx.h | 1 +
 6 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a9d1bf95649a..1bb89a9c17ec 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -817,7 +817,8 @@ void xen_load_linux(PCMachineState *pcms)
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory)
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size)
 {
     int linux_boot, i;
     MemoryRegion *option_rom_mr;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 6186a1473755..f3c726e42400 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    uint64_t hole64_size;
     DeviceState *i440fx_host;
 
     /*
@@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
         i440fx_host = qdev_new(host_type);
+        hole64_size = i440fx_pci_hole64_size(i440fx_host);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
         i440fx_host = NULL;
+        hole64_size = 0;
     }
 
     pc_guest_info_init(pcms);
@@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         pc_memory_init(pcms, system_memory,
-                       rom_memory, &ram_memory);
+                       rom_memory, &ram_memory, hole64_size);
     } else {
         pc_system_flash_cleanup_unused(pcms);
         if (machine->kernel_filename != NULL) {
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 46ea89e564de..5a4a737fe203 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     bool acpi_pcihp;
     bool keep_pci_slot_hpc;
+    uint64_t pci_hole64_size = 0;
 
     /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
      * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
@@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
 
+    if (pcmc->pci_enabled) {
+        pci_hole64_size = q35_host->mch.pci_hole64_size;
+    }
+
     /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
+                   pci_hole64_size);
 
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index d5426ef4a53c..15680da7d709 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }
 
+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
+{
+        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
+
+        return i440fx->pci_hole64_size;
+}
+
 PCIBus *i440fx_init(const char *pci_type,
                     DeviceState *dev,
                     MemoryRegion *address_space_mem,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index b7735dccfc81..568c226d3034 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -159,7 +159,8 @@ void xen_load_linux(PCMachineState *pcms);
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory);
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size);
 uint64_t pc_pci_hole64_start(void);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(struct PCMachineState *pcms,
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index d02bf1ed6b93..2234dd5a2a6a 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *pci_type,
                     MemoryRegion *pci_memory,
                     MemoryRegion *ram_memory);
 
+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
 
 #endif
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 03/10] i386/pc: pass pci_hole64_size " Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-07 12:42   ` Igor Mammedov
  2022-07-01 16:10 ` [PATCH v6 05/10] i386/pc: factor out cxl range end to helper Joao Martins
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins

There's a couple of places that seem to duplicate this calculation
of RAM size above the 4G boundary. Move all those to a helper function.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 1bb89a9c17ec..6c7c49ca5a32 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -814,6 +814,17 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+static hwaddr pc_above_4g_end(PCMachineState *pcms)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+
+    if (pcms->sgx_epc.size != 0) {
+        return sgx_epc_above_4g_end(&pcms->sgx_epc);
+    }
+
+    return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -891,15 +902,8 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }
 
-        if (pcms->sgx_epc.size != 0) {
-            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
-        } else {
-            machine->device_memory->base =
-                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
-        }
-
         machine->device_memory->base =
-            ROUND_UP(machine->device_memory->base, 1 * GiB);
+            ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
 
         if (pcmc->enforce_aligned_dimm) {
             /* size device region assuming 1G page max alignment per slot */
@@ -928,10 +932,8 @@ void pc_memory_init(PCMachineState *pcms,
             if (!pcmc->broken_reserved_end) {
                 cxl_base += memory_region_size(&machine->device_memory->mr);
             }
-        } else if (pcms->sgx_epc.size != 0) {
-            cxl_base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
-            cxl_base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+            cxl_base = pc_above_4g_end(pcms);
         }
 
         e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
@@ -1018,7 +1020,6 @@ uint64_t pc_pci_hole64_start(void)
     PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *ms = MACHINE(pcms);
-    X86MachineState *x86ms = X86_MACHINE(pcms);
     uint64_t hole64_start = 0;
 
     if (pcms->cxl_devices_state.host_mr.addr) {
@@ -1036,10 +1037,8 @@ uint64_t pc_pci_hole64_start(void)
         if (!pcmc->broken_reserved_end) {
             hole64_start += memory_region_size(&ms->device_memory->mr);
         }
-    } else if (pcms->sgx_epc.size != 0) {
-            hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
+        hole64_start = pc_above_4g_end(pcms);
     }
 
     return ROUND_UP(hole64_start, 1 * GiB);
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 05/10] i386/pc: factor out cxl range end to helper
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (3 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-07 12:57   ` Igor Mammedov
  2022-07-01 16:10 ` [PATCH v6 06/10] i386/pc: factor out cxl range start " Joao Martins
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins, Jonathan Cameron

Move calculation of CXL memory region end to separate helper in
preparation to allow pc_pci_hole64_start() to be called before
any mrs are initialized.

Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6c7c49ca5a32..0abbf81841a9 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -825,6 +825,25 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
     return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 }
 
+static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
+{
+    uint64_t start = 0;
+
+    if (pcms->cxl_devices_state.host_mr.addr) {
+        start = pcms->cxl_devices_state.host_mr.addr +
+            memory_region_size(&pcms->cxl_devices_state.host_mr);
+        if (pcms->cxl_devices_state.fixed_windows) {
+            GList *it;
+            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
+                CXLFixedWindow *fw = it->data;
+                start = fw->mr.addr + memory_region_size(&fw->mr);
+            }
+        }
+    }
+
+    return start;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -1022,16 +1041,8 @@ uint64_t pc_pci_hole64_start(void)
     MachineState *ms = MACHINE(pcms);
     uint64_t hole64_start = 0;
 
-    if (pcms->cxl_devices_state.host_mr.addr) {
-        hole64_start = pcms->cxl_devices_state.host_mr.addr +
-            memory_region_size(&pcms->cxl_devices_state.host_mr);
-        if (pcms->cxl_devices_state.fixed_windows) {
-            GList *it;
-            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
-                CXLFixedWindow *fw = it->data;
-                hole64_start = fw->mr.addr + memory_region_size(&fw->mr);
-            }
-        }
+    if (pcms->cxl_devices_state.is_enabled) {
+        hole64_start = pc_get_cxl_range_end(pcms);
     } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
         hole64_start = ms->device_memory->base;
         if (!pcmc->broken_reserved_end) {
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 06/10] i386/pc: factor out cxl range start to helper
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (4 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 05/10] i386/pc: factor out cxl range end to helper Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-07 13:00   ` Igor Mammedov
  2022-07-01 16:10 ` [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end() Joao Martins
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins, Jonathan Cameron

Factor out the calculation of the base address of the MR. It will be
used later on for the cxl range end counterpart calculation and as
well in pc_memory_init() CXL mr initialization, thus avoiding
duplication.

Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 0abbf81841a9..8655cc3b8894 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -825,6 +825,24 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
     return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 }
 
+static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    MachineState *machine = MACHINE(pcms);
+    hwaddr cxl_base;
+
+    if (pcmc->has_reserved_memory && machine->device_memory->base) {
+        cxl_base = machine->device_memory->base;
+        if (!pcmc->broken_reserved_end) {
+            cxl_base += memory_region_size(&machine->device_memory->mr);
+        }
+    } else {
+        cxl_base = pc_above_4g_end(pcms);
+    }
+
+    return cxl_base;
+}
+
 static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
 {
     uint64_t start = 0;
@@ -946,15 +964,7 @@ void pc_memory_init(PCMachineState *pcms,
         MemoryRegion *mr = &pcms->cxl_devices_state.host_mr;
         hwaddr cxl_size = MiB;
 
-        if (pcmc->has_reserved_memory && machine->device_memory->base) {
-            cxl_base = machine->device_memory->base;
-            if (!pcmc->broken_reserved_end) {
-                cxl_base += memory_region_size(&machine->device_memory->mr);
-            }
-        } else {
-            cxl_base = pc_above_4g_end(pcms);
-        }
-
+        cxl_base = pc_get_cxl_range_start(pcms);
         e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
         memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
         memory_region_add_subregion(system_memory, cxl_base, mr);
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end()
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (5 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 06/10] i386/pc: factor out cxl range start " Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-07 13:05   ` Igor Mammedov
  2022-07-01 16:10 ` [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper Joao Martins
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins, Jonathan Cameron

This in preparation to allow pc_pci_hole64_start() to be called early
in pc_memory_init(), handle CXL memory region end when its underlying
memory region isn't yet initialized.

Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 8655cc3b8894..d6dff71012ab 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -857,6 +857,19 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
                 start = fw->mr.addr + memory_region_size(&fw->mr);
             }
         }
+    } else {
+        hwaddr cxl_size = MiB;
+
+        start = pc_get_cxl_range_start(pcms);
+        if (pcms->cxl_devices_state.fixed_windows) {
+            GList *it;
+
+            start = ROUND_UP(start + cxl_size, 256 * MiB);
+            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
+                CXLFixedWindow *fw = it->data;
+                start += fw->size;
+            }
+        }
     }
 
     return start;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (6 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end() Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-07 13:15   ` Igor Mammedov
  2022-07-01 16:10 ` [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable Joao Martins
  2022-07-01 16:10 ` [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
  9 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins, Jonathan Cameron

Move obtaining hole64_start from device_memory MR base/size into an helper
alongside correspondent getters in pc_memory_init() when the hotplug
range is unitialized.

This is the final step that allows pc_pci_hole64_start() to be callable
at the beginning of pc_memory_init() before any MRs are initialized.

Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 55 +++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index d6dff71012ab..a79fa1b6beeb 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -825,16 +825,48 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
     return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 }
 
+static void pc_get_device_memory_range(PCMachineState *pcms,
+                                       hwaddr *base,
+                                       ram_addr_t *device_mem_size)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    MachineState *machine = MACHINE(pcms);
+    ram_addr_t size;
+    hwaddr addr;
+
+    if (pcmc->has_reserved_memory &&
+        machine->device_memory && machine->device_memory->base) {
+        *base = machine->device_memory->base;
+        *device_mem_size = memory_region_size(&machine->device_memory->mr);
+        return;
+    }
+
+    /* handles uninitialized @device_memory MR */
+    size = machine->maxram_size - machine->ram_size;
+    addr = ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
+
+    if (pcmc->enforce_aligned_dimm) {
+        /* size device region assuming 1G page max alignment per slot */
+        size += (1 * GiB) * machine->ram_slots;
+    }
+
+    *base = addr;
+    *device_mem_size = size;
+}
+
+
 static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
 {
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *machine = MACHINE(pcms);
     hwaddr cxl_base;
+    ram_addr_t size;
 
-    if (pcmc->has_reserved_memory && machine->device_memory->base) {
-        cxl_base = machine->device_memory->base;
+    if (pcmc->has_reserved_memory &&
+        machine->device_memory && machine->device_memory->base) {
+        pc_get_device_memory_range(pcms, &cxl_base, &size);
         if (!pcmc->broken_reserved_end) {
-            cxl_base += memory_region_size(&machine->device_memory->mr);
+            cxl_base += size;
         }
     } else {
         cxl_base = pc_above_4g_end(pcms);
@@ -937,7 +969,7 @@ void pc_memory_init(PCMachineState *pcms,
     /* initialize device memory address space */
     if (pcmc->has_reserved_memory &&
         (machine->ram_size < machine->maxram_size)) {
-        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
+        ram_addr_t device_mem_size;
 
         if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
             error_report("unsupported amount of memory slots: %"PRIu64,
@@ -952,13 +984,7 @@ void pc_memory_init(PCMachineState *pcms,
             exit(EXIT_FAILURE);
         }
 
-        machine->device_memory->base =
-            ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
-
-        if (pcmc->enforce_aligned_dimm) {
-            /* size device region assuming 1G page max alignment per slot */
-            device_mem_size += (1 * GiB) * machine->ram_slots;
-        }
+        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
 
         if ((machine->device_memory->base + device_mem_size) <
             device_mem_size) {
@@ -1063,13 +1089,14 @@ uint64_t pc_pci_hole64_start(void)
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     MachineState *ms = MACHINE(pcms);
     uint64_t hole64_start = 0;
+    ram_addr_t size = 0;
 
     if (pcms->cxl_devices_state.is_enabled) {
         hole64_start = pc_get_cxl_range_end(pcms);
-    } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
-        hole64_start = ms->device_memory->base;
+    } else if (pcmc->has_reserved_memory && (ms->ram_size < ms->maxram_size)) {
+        pc_get_device_memory_range(pcms, &hole64_start, &size);
         if (!pcmc->broken_reserved_end) {
-            hole64_start += memory_region_size(&ms->device_memory->mr);
+            hole64_start += size;
         }
     } else {
         hole64_start = pc_above_4g_end(pcms);
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (7 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-07 15:53   ` Joao Martins
  2022-07-11 12:56   ` Igor Mammedov
  2022-07-01 16:10 ` [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
  9 siblings, 2 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins

It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, except for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundary in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done or the address space covers it, we
also add the the reserved HT e820 range as reserved.

Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
to address 1Tb (0xff ffff ffff). On AMD platforms, if a
ram-above-4g relocation may be desired and the CPU wasn't configured
with a big enough phys-bits, print an error message to the user
and do not make the relocation of the above-4g-region if phys-bits
is too low.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a79fa1b6beeb..07025b510540 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }
 
+static hwaddr pc_max_used_gpa(PCMachineState *pcms,
+                                hwaddr above_4g_mem_start,
+                                uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+
+    if (!x86ms->above_4g_mem_size) {
+        /*
+         * 32-bit pci hole goes from
+         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+          */
+        return IO_APIC_DEFAULT_ADDRESS - 1;
+    }
+
+    return pc_pci_hole64_start() + pci_hole64_size;
+}
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr start = x86ms->above_4g_mem_start;
+    hwaddr maxphysaddr, maxusedaddr;
+
+    /* Bail out if max possible address does not cross HT range */
+    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {
+        return;
+    }
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    start = AMD_ABOVE_1TB_START;
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+    x86ms->above_4g_mem_start = start;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);
 
     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env)) {
+        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
+            cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (8 preceding siblings ...)
  2022-07-01 16:10 ` [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-07-01 16:10 ` Joao Martins
  2022-07-04 14:27   ` Dr. David Alan Gilbert
  2022-07-11 13:03   ` Igor Mammedov
  9 siblings, 2 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-01 16:10 UTC (permalink / raw)
  To: qemu-devel
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Joao Martins

The added enforcing is only relevant in the case of AMD where the
range right before the 1TB is restricted and cannot be DMA mapped
by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
or possibly other kinds of IOMMU events in the AMD IOMMU.

Although, there's a case where it may make sense to disable the
IOVA relocation/validation when migrating from a
non-valid-IOVA-aware qemu to one that supports it.

Relocating RAM regions to after the 1Tb hole has consequences for
guest ABI because we are changing the memory mapping, so make
sure that only new machine enforce but not older ones.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 6 ++++--
 hw/i386/pc_piix.c    | 2 ++
 hw/i386/pc_q35.c     | 2 ++
 include/hw/i386/pc.h | 1 +
 4 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 07025b510540..f99e16a5db4b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1013,9 +1013,10 @@ void pc_memory_init(PCMachineState *pcms,
     /*
      * The HyperTransport range close to the 1T boundary is unique to AMD
      * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
-     * to above 1T to AMD vCPUs only.
+     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
+     * older machine types (<= 7.0) for compatibility purposes.
      */
-    if (IS_AMD_CPU(&cpu->env)) {
+    if (IS_AMD_CPU(&cpu->env) && pcmc->enforce_valid_iova) {
         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
 
         /*
@@ -1950,6 +1951,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     pcmc->has_reserved_memory = true;
     pcmc->kvmclock_enabled = true;
     pcmc->enforce_aligned_dimm = true;
+    pcmc->enforce_valid_iova = true;
     /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
      * to be used at the moment, 32K should be enough for a while.  */
     pcmc->acpi_data_size = 0x20000 + 0x8000;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index f3c726e42400..504ddd0deece 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -444,9 +444,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
 
 static void pc_i440fx_7_0_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_i440fx_7_1_machine_options(m);
     m->alias = NULL;
     m->is_default = false;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
     compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
 }
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 5a4a737fe203..4b747c59c19a 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
 
 static void pc_q35_7_0_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_q35_7_1_machine_options(m);
     m->alias = NULL;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
     compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
 }
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 568c226d3034..3a873ff69499 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -118,6 +118,7 @@ struct PCMachineClass {
     bool has_reserved_memory;
     bool enforce_aligned_dimm;
     bool broken_reserved_end;
+    bool enforce_valid_iova;
 
     /* generate legacy CPU hotplug AML */
     bool legacy_cpu_hotplug;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-07-01 16:10 ` [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
@ 2022-07-04 14:27   ` Dr. David Alan Gilbert
  2022-07-05  8:48     ` Joao Martins
  2022-07-11 13:03   ` Igor Mammedov
  1 sibling, 1 reply; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2022-07-04 14:27 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Suravee Suthikulpanit

* Joao Martins (joao.m.martins@oracle.com) wrote:
> The added enforcing is only relevant in the case of AMD where the
> range right before the 1TB is restricted and cannot be DMA mapped
> by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
> or possibly other kinds of IOMMU events in the AMD IOMMU.
> 
> Although, there's a case where it may make sense to disable the
> IOVA relocation/validation when migrating from a
> non-valid-IOVA-aware qemu to one that supports it.
> 
> Relocating RAM regions to after the 1Tb hole has consequences for
> guest ABI because we are changing the memory mapping, so make
> sure that only new machine enforce but not older ones.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Thanks for keeping the migration compatibility, so for migration:

Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  hw/i386/pc.c         | 6 ++++--
>  hw/i386/pc_piix.c    | 2 ++
>  hw/i386/pc_q35.c     | 2 ++
>  include/hw/i386/pc.h | 1 +
>  4 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 07025b510540..f99e16a5db4b 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1013,9 +1013,10 @@ void pc_memory_init(PCMachineState *pcms,
>      /*
>       * The HyperTransport range close to the 1T boundary is unique to AMD
>       * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> -     * to above 1T to AMD vCPUs only.
> +     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
> +     * older machine types (<= 7.0) for compatibility purposes.
>       */
> -    if (IS_AMD_CPU(&cpu->env)) {
> +    if (IS_AMD_CPU(&cpu->env) && pcmc->enforce_valid_iova) {
>          pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>  
>          /*
> @@ -1950,6 +1951,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>      pcmc->has_reserved_memory = true;
>      pcmc->kvmclock_enabled = true;
>      pcmc->enforce_aligned_dimm = true;
> +    pcmc->enforce_valid_iova = true;
>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>       * to be used at the moment, 32K should be enough for a while.  */
>      pcmc->acpi_data_size = 0x20000 + 0x8000;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index f3c726e42400..504ddd0deece 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -444,9 +444,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
>  
>  static void pc_i440fx_7_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_i440fx_7_1_machine_options(m);
>      m->alias = NULL;
>      m->is_default = false;
> +    pcmc->enforce_valid_iova = false;
>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>  }
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 5a4a737fe203..4b747c59c19a 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
>  
>  static void pc_q35_7_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_q35_7_1_machine_options(m);
>      m->alias = NULL;
> +    pcmc->enforce_valid_iova = false;
>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>  }
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 568c226d3034..3a873ff69499 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -118,6 +118,7 @@ struct PCMachineClass {
>      bool has_reserved_memory;
>      bool enforce_aligned_dimm;
>      bool broken_reserved_end;
> +    bool enforce_valid_iova;
>  
>      /* generate legacy CPU hotplug AML */
>      bool legacy_cpu_hotplug;
> -- 
> 2.17.2
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-07-04 14:27   ` Dr. David Alan Gilbert
@ 2022-07-05  8:48     ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-05  8:48 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Suravee Suthikulpanit



On 7/4/22 15:27, Dr. David Alan Gilbert wrote:
> * Joao Martins (joao.m.martins@oracle.com) wrote:
>> The added enforcing is only relevant in the case of AMD where the
>> range right before the 1TB is restricted and cannot be DMA mapped
>> by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
>> or possibly other kinds of IOMMU events in the AMD IOMMU.
>>
>> Although, there's a case where it may make sense to disable the
>> IOVA relocation/validation when migrating from a
>> non-valid-IOVA-aware qemu to one that supports it.
>>
>> Relocating RAM regions to after the 1Tb hole has consequences for
>> guest ABI because we are changing the memory mapping, so make
>> sure that only new machine enforce but not older ones.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> Thanks for keeping the migration compatibility, so for migration:
> 
> Acked-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 

Thank you!


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper
  2022-07-01 16:10 ` [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper Joao Martins
@ 2022-07-07 12:42   ` Igor Mammedov
  2022-07-07 15:14     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-07 12:42 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Fri,  1 Jul 2022 17:10:08 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> There's a couple of places that seem to duplicate this calculation
> of RAM size above the 4G boundary. Move all those to a helper function.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> ---
>  hw/i386/pc.c | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 1bb89a9c17ec..6c7c49ca5a32 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -814,6 +814,17 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +static hwaddr pc_above_4g_end(PCMachineState *pcms)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +
> +    if (pcms->sgx_epc.size != 0) {
> +        return sgx_epc_above_4g_end(&pcms->sgx_epc);
> +    }
> +
> +    return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -891,15 +902,8 @@ void pc_memory_init(PCMachineState *pcms,
>              exit(EXIT_FAILURE);
>          }
>  
> -        if (pcms->sgx_epc.size != 0) {
> -            machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
> -        } else {
> -            machine->device_memory->base =
> -                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> -        }
> -
>          machine->device_memory->base =
> -            ROUND_UP(machine->device_memory->base, 1 * GiB);
> +            ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
>  
>          if (pcmc->enforce_aligned_dimm) {
>              /* size device region assuming 1G page max alignment per slot */
> @@ -928,10 +932,8 @@ void pc_memory_init(PCMachineState *pcms,
>              if (!pcmc->broken_reserved_end) {
>                  cxl_base += memory_region_size(&machine->device_memory->mr);
>              }
> -        } else if (pcms->sgx_epc.size != 0) {
> -            cxl_base = sgx_epc_above_4g_end(&pcms->sgx_epc);
>          } else {
> -            cxl_base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> +            cxl_base = pc_above_4g_end(pcms);
>          }
>  
>          e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
> @@ -1018,7 +1020,6 @@ uint64_t pc_pci_hole64_start(void)
>      PCMachineState *pcms = PC_MACHINE(qdev_get_machine());
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      MachineState *ms = MACHINE(pcms);
> -    X86MachineState *x86ms = X86_MACHINE(pcms);
>      uint64_t hole64_start = 0;
>  
>      if (pcms->cxl_devices_state.host_mr.addr) {
> @@ -1036,10 +1037,8 @@ uint64_t pc_pci_hole64_start(void)
>          if (!pcmc->broken_reserved_end) {
>              hole64_start += memory_region_size(&ms->device_memory->mr);
>          }
> -    } else if (pcms->sgx_epc.size != 0) {
> -            hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>      } else {
> -        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> +        hole64_start = pc_above_4g_end(pcms);
>      }
>  
>      return ROUND_UP(hole64_start, 1 * GiB);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 05/10] i386/pc: factor out cxl range end to helper
  2022-07-01 16:10 ` [PATCH v6 05/10] i386/pc: factor out cxl range end to helper Joao Martins
@ 2022-07-07 12:57   ` Igor Mammedov
  2022-07-07 15:17     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-07 12:57 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On Fri,  1 Jul 2022 17:10:09 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Move calculation of CXL memory region end to separate helper in
> preparation to allow pc_pci_hole64_start() to be called before
> any mrs are initialized.
s/mrs/memory regions/



> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 6c7c49ca5a32..0abbf81841a9 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -825,6 +825,25 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>  }
>  
> +static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
> +{
> +    uint64_t start = 0;
> +
> +    if (pcms->cxl_devices_state.host_mr.addr) {
> +        start = pcms->cxl_devices_state.host_mr.addr +
> +            memory_region_size(&pcms->cxl_devices_state.host_mr);
> +        if (pcms->cxl_devices_state.fixed_windows) {
> +            GList *it;
> +            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
> +                CXLFixedWindow *fw = it->data;
> +                start = fw->mr.addr + memory_region_size(&fw->mr);
> +            }

this block deals with 'initialized' memory regions,
so claim 'before any mrs are initialized' in commit message is
confusing at least or outright wrong.

> +        }
> +    }
> +
> +    return start;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -1022,16 +1041,8 @@ uint64_t pc_pci_hole64_start(void)
>      MachineState *ms = MACHINE(pcms);
>      uint64_t hole64_start = 0;
>  
> -    if (pcms->cxl_devices_state.host_mr.addr) {
> -        hole64_start = pcms->cxl_devices_state.host_mr.addr +
> -            memory_region_size(&pcms->cxl_devices_state.host_mr);
> -        if (pcms->cxl_devices_state.fixed_windows) {
> -            GList *it;
> -            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
> -                CXLFixedWindow *fw = it->data;
> -                hole64_start = fw->mr.addr + memory_region_size(&fw->mr);
> -            }
> -        }
> +    if (pcms->cxl_devices_state.is_enabled) {
> +        hole64_start = pc_get_cxl_range_end(pcms);
>      } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
>          hole64_start = ms->device_memory->base;
>          if (!pcmc->broken_reserved_end) {



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 06/10] i386/pc: factor out cxl range start to helper
  2022-07-01 16:10 ` [PATCH v6 06/10] i386/pc: factor out cxl range start " Joao Martins
@ 2022-07-07 13:00   ` Igor Mammedov
  2022-07-07 15:18     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-07 13:00 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On Fri,  1 Jul 2022 17:10:10 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Factor out the calculation of the base address of the MR. It will be
> used later on for the cxl range end counterpart calculation and as
> well in pc_memory_init() CXL mr initialization, thus avoiding
> duplication.
> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

needs to be rebased on top of 


[PATCH 2/3] hw/i386/pc: Always place CXL Memory Regions after device_memory

> ---
>  hw/i386/pc.c | 28 +++++++++++++++++++---------
>  1 file changed, 19 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 0abbf81841a9..8655cc3b8894 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -825,6 +825,24 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>  }
>  
> +static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    hwaddr cxl_base;
> +
> +    if (pcmc->has_reserved_memory && machine->device_memory->base) {
> +        cxl_base = machine->device_memory->base;
> +        if (!pcmc->broken_reserved_end) {
> +            cxl_base += memory_region_size(&machine->device_memory->mr);
> +        }
> +    } else {
> +        cxl_base = pc_above_4g_end(pcms);
> +    }
> +
> +    return cxl_base;
> +}
> +
>  static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>  {
>      uint64_t start = 0;
> @@ -946,15 +964,7 @@ void pc_memory_init(PCMachineState *pcms,
>          MemoryRegion *mr = &pcms->cxl_devices_state.host_mr;
>          hwaddr cxl_size = MiB;
>  
> -        if (pcmc->has_reserved_memory && machine->device_memory->base) {
> -            cxl_base = machine->device_memory->base;
> -            if (!pcmc->broken_reserved_end) {
> -                cxl_base += memory_region_size(&machine->device_memory->mr);
> -            }
> -        } else {
> -            cxl_base = pc_above_4g_end(pcms);
> -        }
> -
> +        cxl_base = pc_get_cxl_range_start(pcms);
>          e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
>          memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
>          memory_region_add_subregion(system_memory, cxl_base, mr);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end()
  2022-07-01 16:10 ` [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end() Joao Martins
@ 2022-07-07 13:05   ` Igor Mammedov
  2022-07-07 15:21     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-07 13:05 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On Fri,  1 Jul 2022 17:10:11 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> This in preparation to allow pc_pci_hole64_start() to be called early
> in pc_memory_init(), handle CXL memory region end when its underlying
> memory region isn't yet initialized.
> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 8655cc3b8894..d6dff71012ab 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -857,6 +857,19 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>                  start = fw->mr.addr + memory_region_size(&fw->mr);
>              }
>          }
> +    } else {


> +        hwaddr cxl_size = MiB;
> +
> +        start = pc_get_cxl_range_start(pcms);
> +        if (pcms->cxl_devices_state.fixed_windows) {
> +            GList *it;
> +
> +            start = ROUND_UP(start + cxl_size, 256 * MiB);
> +            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
> +                CXLFixedWindow *fw = it->data;
> +                start += fw->size;
> +            }
> +        }

/me wondering if this can replace block above that supposedly does
the same only using initialized cxl memory regions?

>      }
>  
>      return start;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper
  2022-07-01 16:10 ` [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper Joao Martins
@ 2022-07-07 13:15   ` Igor Mammedov
  2022-07-07 15:23     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-07 13:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On Fri,  1 Jul 2022 17:10:12 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> Move obtaining hole64_start from device_memory MR base/size into an helper
> alongside correspondent getters in pc_memory_init() when the hotplug
> range is unitialized.
> 
> This is the final step that allows pc_pci_hole64_start() to be callable
> at the beginning of pc_memory_init() before any MRs are initialized.
> 
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 55 +++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 41 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index d6dff71012ab..a79fa1b6beeb 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -825,16 +825,48 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>  }
>  
> +static void pc_get_device_memory_range(PCMachineState *pcms,
> +                                       hwaddr *base,
> +                                       ram_addr_t *device_mem_size)
> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    MachineState *machine = MACHINE(pcms);
> +    ram_addr_t size;
> +    hwaddr addr;
> +

> +    if (pcmc->has_reserved_memory &&
> +        machine->device_memory && machine->device_memory->base) {
> +        *base = machine->device_memory->base;
> +        *device_mem_size = memory_region_size(&machine->device_memory->mr);
> +        return;
> +    }
is this block really needed?
(i.e. shouldn't block bellow always yeld the same result
as block above?)

> +
> +    /* handles uninitialized @device_memory MR */
> +    size = machine->maxram_size - machine->ram_size;
> +    addr = ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
> +
> +    if (pcmc->enforce_aligned_dimm) {
> +        /* size device region assuming 1G page max alignment per slot */
> +        size += (1 * GiB) * machine->ram_slots;
> +    }
> +
> +    *base = addr;
> +    *device_mem_size = size;
> +}
> +
> +
>  static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
>  {
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      MachineState *machine = MACHINE(pcms);
>      hwaddr cxl_base;
> +    ram_addr_t size;
>  
> -    if (pcmc->has_reserved_memory && machine->device_memory->base) {
> -        cxl_base = machine->device_memory->base;
> +    if (pcmc->has_reserved_memory &&
> +        machine->device_memory && machine->device_memory->base) {
> +        pc_get_device_memory_range(pcms, &cxl_base, &size);
>          if (!pcmc->broken_reserved_end) {
> -            cxl_base += memory_region_size(&machine->device_memory->mr);
> +            cxl_base += size;
>          }
>      } else {
>          cxl_base = pc_above_4g_end(pcms);
> @@ -937,7 +969,7 @@ void pc_memory_init(PCMachineState *pcms,
>      /* initialize device memory address space */
>      if (pcmc->has_reserved_memory &&
>          (machine->ram_size < machine->maxram_size)) {
> -        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
> +        ram_addr_t device_mem_size;
>  
>          if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
>              error_report("unsupported amount of memory slots: %"PRIu64,
> @@ -952,13 +984,7 @@ void pc_memory_init(PCMachineState *pcms,
>              exit(EXIT_FAILURE);
>          }
>  
> -        machine->device_memory->base =
> -            ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
> -
> -        if (pcmc->enforce_aligned_dimm) {
> -            /* size device region assuming 1G page max alignment per slot */
> -            device_mem_size += (1 * GiB) * machine->ram_slots;
> -        }
> +        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
>  
>          if ((machine->device_memory->base + device_mem_size) <
>              device_mem_size) {
> @@ -1063,13 +1089,14 @@ uint64_t pc_pci_hole64_start(void)
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      MachineState *ms = MACHINE(pcms);
>      uint64_t hole64_start = 0;
> +    ram_addr_t size = 0;
>  
>      if (pcms->cxl_devices_state.is_enabled) {
>          hole64_start = pc_get_cxl_range_end(pcms);
> -    } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
> -        hole64_start = ms->device_memory->base;
> +    } else if (pcmc->has_reserved_memory && (ms->ram_size < ms->maxram_size)) {
> +        pc_get_device_memory_range(pcms, &hole64_start, &size);
>          if (!pcmc->broken_reserved_end) {
> -            hole64_start += memory_region_size(&ms->device_memory->mr);
> +            hole64_start += size;
>          }
>      } else {
>          hole64_start = pc_above_4g_end(pcms);



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper
  2022-07-07 12:42   ` Igor Mammedov
@ 2022-07-07 15:14     ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-07 15:14 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit



On 7/7/22 13:42, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:08 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> There's a couple of places that seem to duplicate this calculation
>> of RAM size above the 4G boundary. Move all those to a helper function.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
> 
Thanks!


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 05/10] i386/pc: factor out cxl range end to helper
  2022-07-07 12:57   ` Igor Mammedov
@ 2022-07-07 15:17     ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-07 15:17 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On 7/7/22 13:57, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:09 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Move calculation of CXL memory region end to separate helper in
>> preparation to allow pc_pci_hole64_start() to be called before
>> any mrs are initialized.
> s/mrs/memory regions/
> 
Will fix.

> 
> 
>>
>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 31 +++++++++++++++++++++----------
>>  1 file changed, 21 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 6c7c49ca5a32..0abbf81841a9 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -825,6 +825,25 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>  }
>>  
>> +static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>> +{
>> +    uint64_t start = 0;
>> +
>> +    if (pcms->cxl_devices_state.host_mr.addr) {
>> +        start = pcms->cxl_devices_state.host_mr.addr +
>> +            memory_region_size(&pcms->cxl_devices_state.host_mr);
>> +        if (pcms->cxl_devices_state.fixed_windows) {
>> +            GList *it;
>> +            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
>> +                CXLFixedWindow *fw = it->data;
>> +                start = fw->mr.addr + memory_region_size(&fw->mr);
>> +            }
> 
> this block deals with 'initialized' memory regions,
> so claim 'before any mrs are initialized' in commit message is
> confusing at least or outright wrong.
> 

But the commit message is pretty clear on its purpose.

"Move calculation of CXL memory region end to separate helper"

Then it justifies why we are adding.. that is in preparation
for a patch that will come after. I am not implying at all
that I am dealing with unitiliazed MRs in this patch.

Anyhow, I can drop the part after 'in preparation' or just drop the
mention to unitialized MRs if confuses folks.

>> +        }
>> +    }
>> +
>> +    return start;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -1022,16 +1041,8 @@ uint64_t pc_pci_hole64_start(void)
>>      MachineState *ms = MACHINE(pcms);
>>      uint64_t hole64_start = 0;
>>  
>> -    if (pcms->cxl_devices_state.host_mr.addr) {
>> -        hole64_start = pcms->cxl_devices_state.host_mr.addr +
>> -            memory_region_size(&pcms->cxl_devices_state.host_mr);
>> -        if (pcms->cxl_devices_state.fixed_windows) {
>> -            GList *it;
>> -            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
>> -                CXLFixedWindow *fw = it->data;
>> -                hole64_start = fw->mr.addr + memory_region_size(&fw->mr);
>> -            }
>> -        }
>> +    if (pcms->cxl_devices_state.is_enabled) {
>> +        hole64_start = pc_get_cxl_range_end(pcms);
>>      } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
>>          hole64_start = ms->device_memory->base;
>>          if (!pcmc->broken_reserved_end) {
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 06/10] i386/pc: factor out cxl range start to helper
  2022-07-07 13:00   ` Igor Mammedov
@ 2022-07-07 15:18     ` Joao Martins
  2022-07-11 12:47       ` Igor Mammedov
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-07 15:18 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron



On 7/7/22 14:00, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:10 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Factor out the calculation of the base address of the MR. It will be
>> used later on for the cxl range end counterpart calculation and as
>> well in pc_memory_init() CXL mr initialization, thus avoiding
>> duplication.
>>
>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> needs to be rebased on top of 
> 
> 
> [PATCH 2/3] hw/i386/pc: Always place CXL Memory Regions after device_memory
> 
Is Michael merging these or should I just respin v7 with the assumption
that these patches are there?

I can't see anything in his tree yet.

>> ---
>>  hw/i386/pc.c | 28 +++++++++++++++++++---------
>>  1 file changed, 19 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 0abbf81841a9..8655cc3b8894 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -825,6 +825,24 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>  }
>>  
>> +static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    hwaddr cxl_base;
>> +
>> +    if (pcmc->has_reserved_memory && machine->device_memory->base) {
>> +        cxl_base = machine->device_memory->base;
>> +        if (!pcmc->broken_reserved_end) {
>> +            cxl_base += memory_region_size(&machine->device_memory->mr);
>> +        }
>> +    } else {
>> +        cxl_base = pc_above_4g_end(pcms);
>> +    }
>> +
>> +    return cxl_base;
>> +}
>> +
>>  static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>  {
>>      uint64_t start = 0;
>> @@ -946,15 +964,7 @@ void pc_memory_init(PCMachineState *pcms,
>>          MemoryRegion *mr = &pcms->cxl_devices_state.host_mr;
>>          hwaddr cxl_size = MiB;
>>  
>> -        if (pcmc->has_reserved_memory && machine->device_memory->base) {
>> -            cxl_base = machine->device_memory->base;
>> -            if (!pcmc->broken_reserved_end) {
>> -                cxl_base += memory_region_size(&machine->device_memory->mr);
>> -            }
>> -        } else {
>> -            cxl_base = pc_above_4g_end(pcms);
>> -        }
>> -
>> +        cxl_base = pc_get_cxl_range_start(pcms);
>>          e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
>>          memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
>>          memory_region_add_subregion(system_memory, cxl_base, mr);
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end()
  2022-07-07 13:05   ` Igor Mammedov
@ 2022-07-07 15:21     ` Joao Martins
  2022-07-11 12:58       ` Igor Mammedov
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-07 15:21 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron



On 7/7/22 14:05, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:11 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> This in preparation to allow pc_pci_hole64_start() to be called early
>> in pc_memory_init(), handle CXL memory region end when its underlying
>> memory region isn't yet initialized.
>>
>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 8655cc3b8894..d6dff71012ab 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -857,6 +857,19 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>                  start = fw->mr.addr + memory_region_size(&fw->mr);
>>              }
>>          }
>> +    } else {
> 
> 
>> +        hwaddr cxl_size = MiB;
>> +
>> +        start = pc_get_cxl_range_start(pcms);
>> +        if (pcms->cxl_devices_state.fixed_windows) {
>> +            GList *it;
>> +
>> +            start = ROUND_UP(start + cxl_size, 256 * MiB);
>> +            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
>> +                CXLFixedWindow *fw = it->data;
>> +                start += fw->size;
>> +            }
>> +        }
> 
> /me wondering if this can replace block above that supposedly does
> the same only using initialized cxl memory regions?
> 

I was thinking about the same thing as of writing.

If the calculation returns the same values might as well just replace it
as opposed to branching out similar logic.

I can do that in v7.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper
  2022-07-07 13:15   ` Igor Mammedov
@ 2022-07-07 15:23     ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-07 15:23 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron



On 7/7/22 14:15, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:12 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Move obtaining hole64_start from device_memory MR base/size into an helper
>> alongside correspondent getters in pc_memory_init() when the hotplug
>> range is unitialized.
>>
>> This is the final step that allows pc_pci_hole64_start() to be callable
>> at the beginning of pc_memory_init() before any MRs are initialized.
>>
>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 55 +++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 41 insertions(+), 14 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index d6dff71012ab..a79fa1b6beeb 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -825,16 +825,48 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>  }
>>  
>> +static void pc_get_device_memory_range(PCMachineState *pcms,
>> +                                       hwaddr *base,
>> +                                       ram_addr_t *device_mem_size)
>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    MachineState *machine = MACHINE(pcms);
>> +    ram_addr_t size;
>> +    hwaddr addr;
>> +
> 
>> +    if (pcmc->has_reserved_memory &&
>> +        machine->device_memory && machine->device_memory->base) {
>> +        *base = machine->device_memory->base;
>> +        *device_mem_size = memory_region_size(&machine->device_memory->mr);
>> +        return;
>> +    }
> is this block really needed?
> (i.e. shouldn't block bellow always yeld the same result
> as block above?)
> 
Similar to earlier patch -- I agree with you. It returns the same thing.

Might as well delete this block in favor of having different blocks returning
the same thing. I'll do that in v7.

>> +
>> +    /* handles uninitialized @device_memory MR */
>> +    size = machine->maxram_size - machine->ram_size;
>> +    addr = ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
>> +
>> +    if (pcmc->enforce_aligned_dimm) {
>> +        /* size device region assuming 1G page max alignment per slot */
>> +        size += (1 * GiB) * machine->ram_slots;
>> +    }
>> +
>> +    *base = addr;
>> +    *device_mem_size = size;
>> +}
>> +
>> +
>>  static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
>>  {
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      MachineState *machine = MACHINE(pcms);
>>      hwaddr cxl_base;
>> +    ram_addr_t size;
>>  
>> -    if (pcmc->has_reserved_memory && machine->device_memory->base) {
>> -        cxl_base = machine->device_memory->base;
>> +    if (pcmc->has_reserved_memory &&
>> +        machine->device_memory && machine->device_memory->base) {
>> +        pc_get_device_memory_range(pcms, &cxl_base, &size);
>>          if (!pcmc->broken_reserved_end) {
>> -            cxl_base += memory_region_size(&machine->device_memory->mr);
>> +            cxl_base += size;
>>          }
>>      } else {
>>          cxl_base = pc_above_4g_end(pcms);
>> @@ -937,7 +969,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      /* initialize device memory address space */
>>      if (pcmc->has_reserved_memory &&
>>          (machine->ram_size < machine->maxram_size)) {
>> -        ram_addr_t device_mem_size = machine->maxram_size - machine->ram_size;
>> +        ram_addr_t device_mem_size;
>>  
>>          if (machine->ram_slots > ACPI_MAX_RAM_SLOTS) {
>>              error_report("unsupported amount of memory slots: %"PRIu64,
>> @@ -952,13 +984,7 @@ void pc_memory_init(PCMachineState *pcms,
>>              exit(EXIT_FAILURE);
>>          }
>>  
>> -        machine->device_memory->base =
>> -            ROUND_UP(pc_above_4g_end(pcms), 1 * GiB);
>> -
>> -        if (pcmc->enforce_aligned_dimm) {
>> -            /* size device region assuming 1G page max alignment per slot */
>> -            device_mem_size += (1 * GiB) * machine->ram_slots;
>> -        }
>> +        pc_get_device_memory_range(pcms, &machine->device_memory->base, &device_mem_size);
>>  
>>          if ((machine->device_memory->base + device_mem_size) <
>>              device_mem_size) {
>> @@ -1063,13 +1089,14 @@ uint64_t pc_pci_hole64_start(void)
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      MachineState *ms = MACHINE(pcms);
>>      uint64_t hole64_start = 0;
>> +    ram_addr_t size = 0;
>>  
>>      if (pcms->cxl_devices_state.is_enabled) {
>>          hole64_start = pc_get_cxl_range_end(pcms);
>> -    } else if (pcmc->has_reserved_memory && ms->device_memory->base) {
>> -        hole64_start = ms->device_memory->base;
>> +    } else if (pcmc->has_reserved_memory && (ms->ram_size < ms->maxram_size)) {
>> +        pc_get_device_memory_range(pcms, &hole64_start, &size);
>>          if (!pcmc->broken_reserved_end) {
>> -            hole64_start += memory_region_size(&ms->device_memory->mr);
>> +            hole64_start += size;
>>          }
>>      } else {
>>          hole64_start = pc_above_4g_end(pcms);
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-01 16:10 ` [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-07-07 15:53   ` Joao Martins
  2022-07-11 12:56   ` Igor Mammedov
  1 sibling, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-07 15:53 UTC (permalink / raw)
  To: Igor Mammedov, Alex Williamson
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Dr. David Alan Gilbert, Suravee Suthikulpanit, qemu-devel

On 7/1/22 17:10, Joao Martins wrote:
> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +

[As part of Alex discussion in previous version there's this other case where VMs with
memory less than 1T but having enough GPUs (say each having 40G to state an example) can
have PCI devices placed within reserved HT region.]

Changing fwcfg 'reserved-memory-end' to 1T (bearing that phys_bits is correctly
configured) without triggering above-4g relocation ... fixes the case above. As
'reserved-memory-end' is ultimately what virtual firmware uses (SeaBIOS and OVMF) for
hole64 start. Though, I am at odds whether to include this. Meaning, whether this is the
VMM going around a fw bug[*] even after e820 is described accurately, or if this is the
right to do in the VMM?

Part of the reason I haven't done this was because the issue doesn't happen if VMM user
describes the correct pci-hole64-size in q35/pc that's big enough to cover all VFIO
devices (which is ultimately correct). Thoughts?

[*] as it should look at *all* reserved ranges including those above ram.

>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +

Spurious new line here that I will fix on v7.

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 03/10] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-07-01 16:10 ` [PATCH v6 03/10] i386/pc: pass pci_hole64_size " Joao Martins
@ 2022-07-09 20:51   ` B
  2022-07-11 10:01     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: B @ 2022-07-09 20:51 UTC (permalink / raw)
  To: qemu-devel, Joao Martins
  Cc: Igor Mammedov, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit



Am 1. Juli 2022 16:10:07 UTC schrieb Joao Martins <joao.m.martins@oracle.com>:
>Use the pre-initialized pci-host qdev and fetch the
>pci-hole64-size into pc_memory_init() newly added argument.
>piix needs a bit of care given all the !pci_enabled()
>and that the pci_hole64_size is private to i440fx.

It exposes this value as the property PCI_HOST_PROP_PCI_HOLE64_SIZE. Reusing it allows for not touching i440fx in this patch at all.

For code symmetry reasons the analogue property could be used for Q35 as well.

Best regards,
Bernhard

>
>This is in preparation to determine that host-phys-bits are
>enough and for pci-hole64-size to be considered to relocate
>ram-above-4g to be at 1T (on AMD platforms).
>
>Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>Reviewed-by: Igor Mammedov <imammedo@redhat.com>
>---
> hw/i386/pc.c                 | 3 ++-
> hw/i386/pc_piix.c            | 5 ++++-
> hw/i386/pc_q35.c             | 8 +++++++-
> hw/pci-host/i440fx.c         | 7 +++++++
> include/hw/i386/pc.h         | 3 ++-
> include/hw/pci-host/i440fx.h | 1 +
> 6 files changed, 23 insertions(+), 4 deletions(-)
>
>diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>index a9d1bf95649a..1bb89a9c17ec 100644
>--- a/hw/i386/pc.c
>+++ b/hw/i386/pc.c
>@@ -817,7 +817,8 @@ void xen_load_linux(PCMachineState *pcms)
> void pc_memory_init(PCMachineState *pcms,
>                     MemoryRegion *system_memory,
>                     MemoryRegion *rom_memory,
>-                    MemoryRegion **ram_memory)
>+                    MemoryRegion **ram_memory,
>+                    uint64_t pci_hole64_size)
> {
>     int linux_boot, i;
>     MemoryRegion *option_rom_mr;
>diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>index 6186a1473755..f3c726e42400 100644
>--- a/hw/i386/pc_piix.c
>+++ b/hw/i386/pc_piix.c
>@@ -91,6 +91,7 @@ static void pc_init1(MachineState *machine,
>     MemoryRegion *pci_memory;
>     MemoryRegion *rom_memory;
>     ram_addr_t lowmem;
>+    uint64_t hole64_size;
>     DeviceState *i440fx_host;
> 
>     /*
>@@ -166,10 +167,12 @@ static void pc_init1(MachineState *machine,
>         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>         rom_memory = pci_memory;
>         i440fx_host = qdev_new(host_type);
>+        hole64_size = i440fx_pci_hole64_size(i440fx_host);
>     } else {
>         pci_memory = NULL;
>         rom_memory = system_memory;
>         i440fx_host = NULL;
>+        hole64_size = 0;
>     }
> 
>     pc_guest_info_init(pcms);
>@@ -186,7 +189,7 @@ static void pc_init1(MachineState *machine,
>     /* allocate ram and load rom/bios */
>     if (!xen_enabled()) {
>         pc_memory_init(pcms, system_memory,
>-                       rom_memory, &ram_memory);
>+                       rom_memory, &ram_memory, hole64_size);
>     } else {
>         pc_system_flash_cleanup_unused(pcms);
>         if (machine->kernel_filename != NULL) {
>diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>index 46ea89e564de..5a4a737fe203 100644
>--- a/hw/i386/pc_q35.c
>+++ b/hw/i386/pc_q35.c
>@@ -138,6 +138,7 @@ static void pc_q35_init(MachineState *machine)
>     MachineClass *mc = MACHINE_GET_CLASS(machine);
>     bool acpi_pcihp;
>     bool keep_pci_slot_hpc;
>+    uint64_t pci_hole64_size = 0;
> 
>     /* Check whether RAM fits below 4G (leaving 1/2 GByte for IO memory
>      * and 256 Mbytes for PCI Express Enhanced Configuration Access Mapping
>@@ -206,8 +207,13 @@ static void pc_q35_init(MachineState *machine)
>     /* create pci host bus */
>     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
> 
>+    if (pcmc->pci_enabled) {
>+        pci_hole64_size = q35_host->mch.pci_hole64_size;
>+    }
>+
>     /* allocate ram and load rom/bios */
>-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
>+                   pci_hole64_size);
> 
>     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>index d5426ef4a53c..15680da7d709 100644
>--- a/hw/pci-host/i440fx.c
>+++ b/hw/pci-host/i440fx.c
>@@ -237,6 +237,13 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>     }
> }
> 
>+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
>+{
>+        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
>+
>+        return i440fx->pci_hole64_size;
>+}
>+
> PCIBus *i440fx_init(const char *pci_type,
>                     DeviceState *dev,
>                     MemoryRegion *address_space_mem,
>diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>index b7735dccfc81..568c226d3034 100644
>--- a/include/hw/i386/pc.h
>+++ b/include/hw/i386/pc.h
>@@ -159,7 +159,8 @@ void xen_load_linux(PCMachineState *pcms);
> void pc_memory_init(PCMachineState *pcms,
>                     MemoryRegion *system_memory,
>                     MemoryRegion *rom_memory,
>-                    MemoryRegion **ram_memory);
>+                    MemoryRegion **ram_memory,
>+                    uint64_t pci_hole64_size);
> uint64_t pc_pci_hole64_start(void);
> DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
> void pc_basic_device_init(struct PCMachineState *pcms,
>diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
>index d02bf1ed6b93..2234dd5a2a6a 100644
>--- a/include/hw/pci-host/i440fx.h
>+++ b/include/hw/pci-host/i440fx.h
>@@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *pci_type,
>                     MemoryRegion *pci_memory,
>                     MemoryRegion *ram_memory);
> 
>+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
> 
> #endif


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 03/10] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-07-09 20:51   ` B
@ 2022-07-11 10:01     ` Joao Martins
  2022-07-11 22:17       ` B
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-11 10:01 UTC (permalink / raw)
  To: B, Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Dr. David Alan Gilbert, Suravee Suthikulpanit, qemu-devel

On 7/9/22 21:51, B wrote:
> Am 1. Juli 2022 16:10:07 UTC schrieb Joao Martins <joao.m.martins@oracle.com>:
>> Use the pre-initialized pci-host qdev and fetch the
>> pci-hole64-size into pc_memory_init() newly added argument.
>> piix needs a bit of care given all the !pci_enabled()
>> and that the pci_hole64_size is private to i440fx.
> 
> It exposes this value as the property PCI_HOST_PROP_PCI_HOLE64_SIZE. 

Indeed.

> Reusing it allows for not touching i440fx in this patch at all.
> 
> For code symmetry reasons the analogue property could be used for Q35 as well.
> 
Presumably you want me to change into below while deleting i440fx_pci_hole64_size()
from this patch (see snip below). IMHO, gotta say that in q35 the code symmetry
doesn't buy much readability here, albeit it does remove any need for that other
helper in i440fx.

@Igor let me know if you agree with the change and whether I can keep the Reviewed-by.

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 504ddd0deece..cc0855066d06 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -167,7 +167,9 @@ static void pc_init1(MachineState *machine,
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
         i440fx_host = qdev_new(host_type);
-        hole64_size = i440fx_pci_hole64_size(i440fx_host);
+        hole64_size = object_property_get_uint(OBJECT(i440fx_host),
+                                               PCI_HOST_PROP_PCI_HOLE64_SIZE,
+                                               &error_abort);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 4b747c59c19a..466f3ef3c918 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -208,7 +208,9 @@ static void pc_q35_init(MachineState *machine)
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));

     if (pcmc->pci_enabled) {
-        pci_hole64_size = q35_host->mch.pci_hole64_size;
+        pci_hole64_size = object_property_get_uint(OBJECT(q35_host),
+                                                   PCI_HOST_PROP_PCI_HOLE64_SIZE,
+                                                   &error_abort);
     }

     /* allocate ram and load rom/bios */
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index 15680da7d709..d5426ef4a53c 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,13 +237,6 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }

-uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
-{
-        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
-
-        return i440fx->pci_hole64_size;
-}
-
 PCIBus *i440fx_init(const char *pci_type,
                     DeviceState *dev,
                     MemoryRegion *address_space_mem,


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 06/10] i386/pc: factor out cxl range start to helper
  2022-07-07 15:18     ` Joao Martins
@ 2022-07-11 12:47       ` Igor Mammedov
  2022-07-11 14:28         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-11 12:47 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On Thu, 7 Jul 2022 16:18:43 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/7/22 14:00, Igor Mammedov wrote:
> > On Fri,  1 Jul 2022 17:10:10 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Factor out the calculation of the base address of the MR. It will be
> >> used later on for the cxl range end counterpart calculation and as
> >> well in pc_memory_init() CXL mr initialization, thus avoiding
> >> duplication.
> >>
> >> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>  
> > 
> > needs to be rebased on top of 
> > 
> > 
> > [PATCH 2/3] hw/i386/pc: Always place CXL Memory Regions after device_memory
> >   
> Is Michael merging these or should I just respin v7 with the assumption
> that these patches are there?

I'd do the later (just mention dependency in cover letter)
 
> I can't see anything in his tree yet.
> 
> >> ---
> >>  hw/i386/pc.c | 28 +++++++++++++++++++---------
> >>  1 file changed, 19 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index 0abbf81841a9..8655cc3b8894 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -825,6 +825,24 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
> >>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
> >>  }
> >>  
> >> +static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
> >> +{
> >> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >> +    MachineState *machine = MACHINE(pcms);
> >> +    hwaddr cxl_base;
> >> +
> >> +    if (pcmc->has_reserved_memory && machine->device_memory->base) {
> >> +        cxl_base = machine->device_memory->base;
> >> +        if (!pcmc->broken_reserved_end) {
> >> +            cxl_base += memory_region_size(&machine->device_memory->mr);
> >> +        }
> >> +    } else {
> >> +        cxl_base = pc_above_4g_end(pcms);
> >> +    }
> >> +
> >> +    return cxl_base;
> >> +}
> >> +
> >>  static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
> >>  {
> >>      uint64_t start = 0;
> >> @@ -946,15 +964,7 @@ void pc_memory_init(PCMachineState *pcms,
> >>          MemoryRegion *mr = &pcms->cxl_devices_state.host_mr;
> >>          hwaddr cxl_size = MiB;
> >>  
> >> -        if (pcmc->has_reserved_memory && machine->device_memory->base) {
> >> -            cxl_base = machine->device_memory->base;
> >> -            if (!pcmc->broken_reserved_end) {
> >> -                cxl_base += memory_region_size(&machine->device_memory->mr);
> >> -            }
> >> -        } else {
> >> -            cxl_base = pc_above_4g_end(pcms);
> >> -        }
> >> -
> >> +        cxl_base = pc_get_cxl_range_start(pcms);
> >>          e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
> >>          memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
> >>          memory_region_add_subregion(system_memory, cxl_base, mr);  
> >   
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-01 16:10 ` [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable Joao Martins
  2022-07-07 15:53   ` Joao Martins
@ 2022-07-11 12:56   ` Igor Mammedov
  2022-07-11 14:52     ` Joao Martins
  1 sibling, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-11 12:56 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Fri,  1 Jul 2022 17:10:13 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, except for a
> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundary in Fam 17h according to Errata #1286,
> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done or the address space covers it, we
> also add the the reserved HT e820 range as reserved.
> 
> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> ram-above-4g relocation may be desired and the CPU wasn't configured
> with a big enough phys-bits, print an error message to the user
> and do not make the relocation of the above-4g-region if phys-bits
> is too low.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index a79fa1b6beeb..07025b510540 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
>  
> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
> +                                hwaddr above_4g_mem_start,
> +                                uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +

> +    if (!x86ms->above_4g_mem_size) {
> +        /*
> +         * 32-bit pci hole goes from
> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> +          */
> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> +    }
this hunk still bothers me (nothing changed wrt v5 issues around it)
issues recap: (
 1. correctness of it
 2. being limited to AMD only, while it seems pretty generic to me
 3. should be a separate patch
)

> +
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr start = x86ms->above_4g_mem_start;
> +    hwaddr maxphysaddr, maxusedaddr;
> +
> +    /* Bail out if max possible address does not cross HT range */
> +    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {

move it to the caller?

> +        return;
> +    }
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    start = AMD_ABOVE_1TB_START;
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    x86ms->above_4g_mem_start = start;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
>  
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +

stray newline?

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end()
  2022-07-07 15:21     ` Joao Martins
@ 2022-07-11 12:58       ` Igor Mammedov
  2022-07-11 14:32         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-11 12:58 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On Thu, 7 Jul 2022 16:21:07 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/7/22 14:05, Igor Mammedov wrote:
> > On Fri,  1 Jul 2022 17:10:11 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> This in preparation to allow pc_pci_hole64_start() to be called early
> >> in pc_memory_init(), handle CXL memory region end when its underlying
> >> memory region isn't yet initialized.
> >>
> >> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index 8655cc3b8894..d6dff71012ab 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -857,6 +857,19 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
> >>                  start = fw->mr.addr + memory_region_size(&fw->mr);
> >>              }
> >>          }
> >> +    } else {  
> > 
> >   
> >> +        hwaddr cxl_size = MiB;
> >> +
> >> +        start = pc_get_cxl_range_start(pcms);
> >> +        if (pcms->cxl_devices_state.fixed_windows) {
> >> +            GList *it;
> >> +
> >> +            start = ROUND_UP(start + cxl_size, 256 * MiB);
> >> +            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
> >> +                CXLFixedWindow *fw = it->data;
> >> +                start += fw->size;
> >> +            }
> >> +        }  
> > 
> > /me wondering if this can replace block above that supposedly does
> > the same only using initialized cxl memory regions?
> >   
> 
> I was thinking about the same thing as of writing.
> 
> If the calculation returns the same values might as well just replace it
> as opposed to branching out similar logic.

Let's drop not needed code, so reader won't have to wonder why
the same thing is done in 2 different ways.

> 
> I can do that in v7.
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-07-01 16:10 ` [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
  2022-07-04 14:27   ` Dr. David Alan Gilbert
@ 2022-07-11 13:03   ` Igor Mammedov
  2022-07-11 14:56     ` Joao Martins
  1 sibling, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-11 13:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Fri,  1 Jul 2022 17:10:14 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> The added enforcing is only relevant in the case of AMD where the
> range right before the 1TB is restricted and cannot be DMA mapped
> by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
> or possibly other kinds of IOMMU events in the AMD IOMMU.
> 
> Although, there's a case where it may make sense to disable the
> IOVA relocation/validation when migrating from a
> non-valid-IOVA-aware qemu to one that supports it.
> 
> Relocating RAM regions to after the 1Tb hole has consequences for
> guest ABI because we are changing the memory mapping, so make
> sure that only new machine enforce but not older ones.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c         | 6 ++++--
>  hw/i386/pc_piix.c    | 2 ++
>  hw/i386/pc_q35.c     | 2 ++
>  include/hw/i386/pc.h | 1 +
>  4 files changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 07025b510540..f99e16a5db4b 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1013,9 +1013,10 @@ void pc_memory_init(PCMachineState *pcms,
>      /*
>       * The HyperTransport range close to the 1T boundary is unique to AMD
>       * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> -     * to above 1T to AMD vCPUs only.
> +     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
> +     * older machine types (<= 7.0) for compatibility purposes.
>       */
> -    if (IS_AMD_CPU(&cpu->env)) {
> +    if (IS_AMD_CPU(&cpu->env) && pcmc->enforce_valid_iova) {
>          pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>  
>          /*
> @@ -1950,6 +1951,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>      pcmc->has_reserved_memory = true;
>      pcmc->kvmclock_enabled = true;
>      pcmc->enforce_aligned_dimm = true;
> +    pcmc->enforce_valid_iova = true;
>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>       * to be used at the moment, 32K should be enough for a while.  */
>      pcmc->acpi_data_size = 0x20000 + 0x8000;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index f3c726e42400..504ddd0deece 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -444,9 +444,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
>  
>  static void pc_i440fx_7_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_i440fx_7_1_machine_options(m);
>      m->alias = NULL;
>      m->is_default = false;
> +    pcmc->enforce_valid_iova = false;
>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>  }
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 5a4a737fe203..4b747c59c19a 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
>  
>  static void pc_q35_7_0_machine_options(MachineClass *m)
>  {
> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>      pc_q35_7_1_machine_options(m);
>      m->alias = NULL;
> +    pcmc->enforce_valid_iova = false;
>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>  }
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 568c226d3034..3a873ff69499 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -118,6 +118,7 @@ struct PCMachineClass {
>      bool has_reserved_memory;
>      bool enforce_aligned_dimm;
>      bool broken_reserved_end;
> +    bool enforce_valid_iova;

maybe
s/enforce_valid_iova/enforce_amd_1tb_hole/
to be less ambiguous

otherwise looks good to me so
Acked-by: Igor Mammedov <imammedo@redhat.com>

>  
>      /* generate legacy CPU hotplug AML */
>      bool legacy_cpu_hotplug;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 06/10] i386/pc: factor out cxl range start to helper
  2022-07-11 12:47       ` Igor Mammedov
@ 2022-07-11 14:28         ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-11 14:28 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On 7/11/22 13:47, Igor Mammedov wrote:
> On Thu, 7 Jul 2022 16:18:43 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/7/22 14:00, Igor Mammedov wrote:
>>> On Fri,  1 Jul 2022 17:10:10 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> Factor out the calculation of the base address of the MR. It will be
>>>> used later on for the cxl range end counterpart calculation and as
>>>> well in pc_memory_init() CXL mr initialization, thus avoiding
>>>> duplication.
>>>>
>>>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>  
>>>
>>> needs to be rebased on top of 
>>>
>>>
>>> [PATCH 2/3] hw/i386/pc: Always place CXL Memory Regions after device_memory
>>>   
>> Is Michael merging these or should I just respin v7 with the assumption
>> that these patches are there?
> 
> I'd do the later (just mention dependency in cover letter)
>  

Yeap -- Will do.

>> I can't see anything in his tree yet.
>>
>>>> ---
>>>>  hw/i386/pc.c | 28 +++++++++++++++++++---------
>>>>  1 file changed, 19 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index 0abbf81841a9..8655cc3b8894 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -825,6 +825,24 @@ static hwaddr pc_above_4g_end(PCMachineState *pcms)
>>>>      return x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>>>>  }
>>>>  
>>>> +static uint64_t pc_get_cxl_range_start(PCMachineState *pcms)
>>>> +{
>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>> +    MachineState *machine = MACHINE(pcms);
>>>> +    hwaddr cxl_base;
>>>> +
>>>> +    if (pcmc->has_reserved_memory && machine->device_memory->base) {
>>>> +        cxl_base = machine->device_memory->base;
>>>> +        if (!pcmc->broken_reserved_end) {
>>>> +            cxl_base += memory_region_size(&machine->device_memory->mr);
>>>> +        }
>>>> +    } else {
>>>> +        cxl_base = pc_above_4g_end(pcms);
>>>> +    }
>>>> +
>>>> +    return cxl_base;
>>>> +}
>>>> +
>>>>  static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>>  {
>>>>      uint64_t start = 0;
>>>> @@ -946,15 +964,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>          MemoryRegion *mr = &pcms->cxl_devices_state.host_mr;
>>>>          hwaddr cxl_size = MiB;
>>>>  
>>>> -        if (pcmc->has_reserved_memory && machine->device_memory->base) {
>>>> -            cxl_base = machine->device_memory->base;
>>>> -            if (!pcmc->broken_reserved_end) {
>>>> -                cxl_base += memory_region_size(&machine->device_memory->mr);
>>>> -            }
>>>> -        } else {
>>>> -            cxl_base = pc_above_4g_end(pcms);
>>>> -        }
>>>> -
>>>> +        cxl_base = pc_get_cxl_range_start(pcms);
>>>>          e820_add_entry(cxl_base, cxl_size, E820_RESERVED);
>>>>          memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
>>>>          memory_region_add_subregion(system_memory, cxl_base, mr);  
>>>   
>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end()
  2022-07-11 12:58       ` Igor Mammedov
@ 2022-07-11 14:32         ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-11 14:32 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit,
	Jonathan Cameron

On 7/11/22 13:58, Igor Mammedov wrote:
> On Thu, 7 Jul 2022 16:21:07 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/7/22 14:05, Igor Mammedov wrote:
>>> On Fri,  1 Jul 2022 17:10:11 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>   
>>>> This in preparation to allow pc_pci_hole64_start() to be called early
>>>> in pc_memory_init(), handle CXL memory region end when its underlying
>>>> memory region isn't yet initialized.
>>>>
>>>> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  hw/i386/pc.c | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index 8655cc3b8894..d6dff71012ab 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -857,6 +857,19 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>>                  start = fw->mr.addr + memory_region_size(&fw->mr);
>>>>              }
>>>>          }
>>>> +    } else {  
>>>
>>>   
>>>> +        hwaddr cxl_size = MiB;
>>>> +
>>>> +        start = pc_get_cxl_range_start(pcms);
>>>> +        if (pcms->cxl_devices_state.fixed_windows) {
>>>> +            GList *it;
>>>> +
>>>> +            start = ROUND_UP(start + cxl_size, 256 * MiB);
>>>> +            for (it = pcms->cxl_devices_state.fixed_windows; it; it = it->next) {
>>>> +                CXLFixedWindow *fw = it->data;
>>>> +                start += fw->size;
>>>> +            }
>>>> +        }  
>>>
>>> /me wondering if this can replace block above that supposedly does
>>> the same only using initialized cxl memory regions?
>>>   
>>
>> I was thinking about the same thing as of writing.
>>
>> If the calculation returns the same values might as well just replace it
>> as opposed to branching out similar logic.
> 
> Let's drop not needed code, so reader won't have to wonder why
> the same thing is done in 2 different ways.
> 
/me nods.

I've removed the old code in this patch and replace with the latter block for v7.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-11 12:56   ` Igor Mammedov
@ 2022-07-11 14:52     ` Joao Martins
  2022-07-11 15:31       ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-11 14:52 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/11/22 13:56, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:13 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index a79fa1b6beeb..07025b510540 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>  
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>> +                                hwaddr above_4g_mem_start,
>> +                                uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +
> 
>> +    if (!x86ms->above_4g_mem_size) {
>> +        /*
>> +         * 32-bit pci hole goes from
>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>> +          */
>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>> +    }
> this hunk still bothers me (nothing changed wrt v5 issues around it)
> issues recap: (
>  1. correctness of it
>  2. being limited to AMD only, while it seems pretty generic to me
>  3. should be a separate patch
> )
> 
How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:

if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
    hwaddr start = x86ms->above_4g_mem_start;

    if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
    }
    ...
}

Given that otherwise it is impossible to ever encounter the 1T boundary.

If not ... what other alternative would address your concern?

>> +
>> +    return pc_pci_hole64_start() + pci_hole64_size;
>> +}
>> +
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr start = x86ms->above_4g_mem_start;
>> +    hwaddr maxphysaddr, maxusedaddr;
>> +
>> +    /* Bail out if max possible address does not cross HT range */
>> +    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {
> 
> move it to the caller?
> 
Will do. I have replaced with this instead /in the caller/:

    if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
    }

>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    start = AMD_ABOVE_1TB_START;
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +    x86ms->above_4g_mem_start = start;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>  
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> +    /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env)) {
>> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>> +            cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
>> +
>>      /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>>                               0, x86ms->below_4g_mem_size);
>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>> +
> 
> stray newline?
> 
Yeap. I've already removed as per my earlier email to this patch.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type
  2022-07-11 13:03   ` Igor Mammedov
@ 2022-07-11 14:56     ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-11 14:56 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/11/22 14:03, Igor Mammedov wrote:
> On Fri,  1 Jul 2022 17:10:14 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> The added enforcing is only relevant in the case of AMD where the
>> range right before the 1TB is restricted and cannot be DMA mapped
>> by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
>> or possibly other kinds of IOMMU events in the AMD IOMMU.
>>
>> Although, there's a case where it may make sense to disable the
>> IOVA relocation/validation when migrating from a
>> non-valid-IOVA-aware qemu to one that supports it.
>>
>> Relocating RAM regions to after the 1Tb hole has consequences for
>> guest ABI because we are changing the memory mapping, so make
>> sure that only new machine enforce but not older ones.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c         | 6 ++++--
>>  hw/i386/pc_piix.c    | 2 ++
>>  hw/i386/pc_q35.c     | 2 ++
>>  include/hw/i386/pc.h | 1 +
>>  4 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 07025b510540..f99e16a5db4b 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -1013,9 +1013,10 @@ void pc_memory_init(PCMachineState *pcms,
>>      /*
>>       * The HyperTransport range close to the 1T boundary is unique to AMD
>>       * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> -     * to above 1T to AMD vCPUs only.
>> +     * to above 1T to AMD vCPUs only. @enforce_valid_iova is only false in
>> +     * older machine types (<= 7.0) for compatibility purposes.
>>       */
>> -    if (IS_AMD_CPU(&cpu->env)) {
>> +    if (IS_AMD_CPU(&cpu->env) && pcmc->enforce_valid_iova) {
>>          pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>  
>>          /*
>> @@ -1950,6 +1951,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>>      pcmc->has_reserved_memory = true;
>>      pcmc->kvmclock_enabled = true;
>>      pcmc->enforce_aligned_dimm = true;
>> +    pcmc->enforce_valid_iova = true;
>>      /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
>>       * to be used at the moment, 32K should be enough for a while.  */
>>      pcmc->acpi_data_size = 0x20000 + 0x8000;
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index f3c726e42400..504ddd0deece 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -444,9 +444,11 @@ DEFINE_I440FX_MACHINE(v7_1, "pc-i440fx-7.1", NULL,
>>  
>>  static void pc_i440fx_7_0_machine_options(MachineClass *m)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>>      pc_i440fx_7_1_machine_options(m);
>>      m->alias = NULL;
>>      m->is_default = false;
>> +    pcmc->enforce_valid_iova = false;
>>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>>  }
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 5a4a737fe203..4b747c59c19a 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -381,8 +381,10 @@ DEFINE_Q35_MACHINE(v7_1, "pc-q35-7.1", NULL,
>>  
>>  static void pc_q35_7_0_machine_options(MachineClass *m)
>>  {
>> +    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
>>      pc_q35_7_1_machine_options(m);
>>      m->alias = NULL;
>> +    pcmc->enforce_valid_iova = false;
>>      compat_props_add(m->compat_props, hw_compat_7_0, hw_compat_7_0_len);
>>      compat_props_add(m->compat_props, pc_compat_7_0, pc_compat_7_0_len);
>>  }
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 568c226d3034..3a873ff69499 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -118,6 +118,7 @@ struct PCMachineClass {
>>      bool has_reserved_memory;
>>      bool enforce_aligned_dimm;
>>      bool broken_reserved_end;
>> +    bool enforce_valid_iova;
> 
> maybe
> s/enforce_valid_iova/enforce_amd_1tb_hole/
> to be less ambiguous
> 
That's much better, let me change the name into that.

> otherwise looks good to me so
> Acked-by: Igor Mammedov <imammedo@redhat.com>
> 
Thanks!


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-11 14:52     ` Joao Martins
@ 2022-07-11 15:31       ` Joao Martins
  2022-07-11 20:03         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-11 15:31 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/11/22 15:52, Joao Martins wrote:
> On 7/11/22 13:56, Igor Mammedov wrote:
>> On Fri,  1 Jul 2022 17:10:13 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>> index a79fa1b6beeb..07025b510540 100644
>>> --- a/hw/i386/pc.c
>>> +++ b/hw/i386/pc.c
>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>      return start;
>>>  }
>>>  
>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>> +                                hwaddr above_4g_mem_start,
>>> +                                uint64_t pci_hole64_size)
>>> +{
>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +
>>
>>> +    if (!x86ms->above_4g_mem_size) {
>>> +        /*
>>> +         * 32-bit pci hole goes from
>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>> +          */
>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>> +    }
>> this hunk still bothers me (nothing changed wrt v5 issues around it)
>> issues recap: (
>>  1. correctness of it
>>  2. being limited to AMD only, while it seems pretty generic to me
>>  3. should be a separate patch
>> )
>>
> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
> 
> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>     hwaddr start = x86ms->above_4g_mem_start;
> 
>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>     }
>     ...
> }
> 
> Given that otherwise it is impossible to ever encounter the 1T boundary.
> 

And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
which would turn the function into this:

+static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
+{
+    return pc_pci_hole64_start() + pci_hole64_size;
+}

I would nuke the added helper if it wasn't for having 2 call sites in this patch.

> If not ... what other alternative would address your concern?
> 
>>> +
>>> +    return pc_pci_hole64_start() + pci_hole64_size;
>>> +}
>>> +
>>> +/*
>>> + * AMD systems with an IOMMU have an additional hole close to the
>>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>>> + * The ranges reserved for Hyper-Transport are:
>>> + *
>>> + * FD_0000_0000h - FF_FFFF_FFFFh
>>> + *
>>> + * The ranges represent the following:
>>> + *
>>> + * Base Address   Top Address  Use
>>> + *
>>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>>> + *
>>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>>> + * Table 3: Special Address Controls (GPA) for more information.
>>> + */
>>> +#define AMD_HT_START         0xfd00000000UL
>>> +#define AMD_HT_END           0xffffffffffUL
>>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>>> +
>>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>>> +                                          uint64_t pci_hole64_size)
>>> +{
>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +    hwaddr start = x86ms->above_4g_mem_start;
>>> +    hwaddr maxphysaddr, maxusedaddr;
>>> +
>>> +    /* Bail out if max possible address does not cross HT range */
>>> +    if (pc_max_used_gpa(pcms, start, pci_hole64_size) < AMD_HT_START) {
>>
>> move it to the caller?
>>
> Will do. I have replaced with this instead /in the caller/:
> 
>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>     }
> 
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>>> +     * So make sure phys-bits is required to be appropriately sized in order
>>> +     * to proceed with the above-4g-region relocation and thus boot.
>>> +     */
>>> +    start = AMD_ABOVE_1TB_START;
>>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>>> +    maxusedaddr = pc_max_used_gpa(pcms, start, pci_hole64_size);
>>> +    if (maxphysaddr < maxusedaddr) {
>>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>>> +        exit(EXIT_FAILURE);
>>> +    }
>>> +
>>> +    x86ms->above_4g_mem_start = start;
>>> +}
>>> +
>>>  void pc_memory_init(PCMachineState *pcms,
>>>                      MemoryRegion *system_memory,
>>>                      MemoryRegion *rom_memory,
>>> @@ -922,12 +1003,31 @@ void pc_memory_init(PCMachineState *pcms,
>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>  
>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>                                  x86ms->above_4g_mem_size);
>>>  
>>>      linux_boot = (machine->kernel_filename != NULL);
>>>  
>>> +    /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env)) {
>>> +        pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>> +
>>> +        /*
>>> +         * Advertise the HT region if address space covers the reserved
>>> +         * region or if we relocate.
>>> +         */
>>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>>> +            cpu->phys_bits >= 40) {
>>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>>> +        }
>>> +    }
>>> +
>>>      /*
>>>       * Split single memory region and use aliases to address portions of it,
>>>       * done for backwards compatibility with older qemus.
>>> @@ -938,6 +1038,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>                               0, x86ms->below_4g_mem_size);
>>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>>> +
>>
>> stray newline?
>>
> Yeap. I've already removed as per my earlier email to this patch.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-11 15:31       ` Joao Martins
@ 2022-07-11 20:03         ` Joao Martins
  2022-07-12  9:06           ` Igor Mammedov
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-11 20:03 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/11/22 16:31, Joao Martins wrote:
> On 7/11/22 15:52, Joao Martins wrote:
>> On 7/11/22 13:56, Igor Mammedov wrote:
>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index a79fa1b6beeb..07025b510540 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>>      return start;
>>>>  }
>>>>  
>>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>>> +                                hwaddr above_4g_mem_start,
>>>> +                                uint64_t pci_hole64_size)
>>>> +{
>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> +
>>>
>>>> +    if (!x86ms->above_4g_mem_size) {
>>>> +        /*
>>>> +         * 32-bit pci hole goes from
>>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>> +          */
>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>>> +    }
>>> this hunk still bothers me (nothing changed wrt v5 issues around it)
>>> issues recap: (
>>>  1. correctness of it
>>>  2. being limited to AMD only, while it seems pretty generic to me
>>>  3. should be a separate patch
>>> )
>>>
>> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
>> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
>>
>> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>     hwaddr start = x86ms->above_4g_mem_start;
>>
>>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>     }
>>     ...
>> }
>>
>> Given that otherwise it is impossible to ever encounter the 1T boundary.
>>
> 
> And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
> which would turn the function into this:
> 
> +static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
> +{
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> 
> I would nuke the added helper if it wasn't for having 2 call sites in this patch.
> 

Full patch diff further below -- after removing pc_max_used_gpa() and made further
cleanups given this code can be much simpler after using this approach.

>> If not ... what other alternative would address your concern?
>>

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index e178bbc4129c..1ded3faeffe0 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -882,6 +882,62 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }

+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
+                                          hwaddr maxusedaddr)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr;
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);

     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
     linux_boot = (machine->kernel_filename != NULL);

     /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
+        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
+
+        /* Bail out if max possible address does not cross HT range */
+        if (maxusedaddr >= AMD_HT_START) {
+            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
+        }
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
+            cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
+    /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
      */


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 03/10] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-07-11 10:01     ` Joao Martins
@ 2022-07-11 22:17       ` B
  2022-07-12  9:27         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: B @ 2022-07-11 22:17 UTC (permalink / raw)
  To: Joao Martins, Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Dr. David Alan Gilbert, Suravee Suthikulpanit, qemu-devel



Am 11. Juli 2022 10:01:49 UTC schrieb Joao Martins <joao.m.martins@oracle.com>:
>On 7/9/22 21:51, B wrote:
>> Am 1. Juli 2022 16:10:07 UTC schrieb Joao Martins <joao.m.martins@oracle.com>:
>>> Use the pre-initialized pci-host qdev and fetch the
>>> pci-hole64-size into pc_memory_init() newly added argument.
>>> piix needs a bit of care given all the !pci_enabled()
>>> and that the pci_hole64_size is private to i440fx.
>> 
>> It exposes this value as the property PCI_HOST_PROP_PCI_HOLE64_SIZE. 
>
>Indeed.
>
>> Reusing it allows for not touching i440fx in this patch at all.
>> 
>> For code symmetry reasons the analogue property could be used for Q35 as well.
>> 
>Presumably you want me to change into below while deleting i440fx_pci_hole64_size()
>from this patch (see snip below).

Yes, exactly.

>IMHO, gotta say that in q35 the code symmetry
>doesn't buy much readability here,

That's true. It communicates, though, that a value is used which was deliberately made public, IOW that the code isn't sneaky. (That's just my interpretation, not sure what the common understanding is) Feel free to do however you prefer.

Best regards,
Bernhard

>albeit it does remove any need for that other
>helper in i440fx.
>
>@Igor let me know if you agree with the change and whether I can keep the Reviewed-by.
>
>diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>index 504ddd0deece..cc0855066d06 100644
>--- a/hw/i386/pc_piix.c
>+++ b/hw/i386/pc_piix.c
>@@ -167,7 +167,9 @@ static void pc_init1(MachineState *machine,
>         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>         rom_memory = pci_memory;
>         i440fx_host = qdev_new(host_type);
>-        hole64_size = i440fx_pci_hole64_size(i440fx_host);
>+        hole64_size = object_property_get_uint(OBJECT(i440fx_host),
>+                                               PCI_HOST_PROP_PCI_HOLE64_SIZE,
>+                                               &error_abort);
>     } else {
>         pci_memory = NULL;
>         rom_memory = system_memory;
>diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>index 4b747c59c19a..466f3ef3c918 100644
>--- a/hw/i386/pc_q35.c
>+++ b/hw/i386/pc_q35.c
>@@ -208,7 +208,9 @@ static void pc_q35_init(MachineState *machine)
>     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>
>     if (pcmc->pci_enabled) {
>-        pci_hole64_size = q35_host->mch.pci_hole64_size;
>+        pci_hole64_size = object_property_get_uint(OBJECT(q35_host),
>+                                                   PCI_HOST_PROP_PCI_HOLE64_SIZE,
>+                                                   &error_abort);
>     }
>
>     /* allocate ram and load rom/bios */
>diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>index 15680da7d709..d5426ef4a53c 100644
>--- a/hw/pci-host/i440fx.c
>+++ b/hw/pci-host/i440fx.c
>@@ -237,13 +237,6 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>     }
> }
>
>-uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
>-{
>-        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
>-
>-        return i440fx->pci_hole64_size;
>-}
>-
> PCIBus *i440fx_init(const char *pci_type,
>                     DeviceState *dev,
>                     MemoryRegion *address_space_mem,


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-11 20:03         ` Joao Martins
@ 2022-07-12  9:06           ` Igor Mammedov
  2022-07-12 10:01             ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-12  9:06 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Mon, 11 Jul 2022 21:03:28 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/11/22 16:31, Joao Martins wrote:
> > On 7/11/22 15:52, Joao Martins wrote:  
> >> On 7/11/22 13:56, Igor Mammedov wrote:  
> >>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>>  
> >>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >>>> index a79fa1b6beeb..07025b510540 100644
> >>>> --- a/hw/i386/pc.c
> >>>> +++ b/hw/i386/pc.c
> >>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
> >>>>      return start;
> >>>>  }
> >>>>  
> >>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
> >>>> +                                hwaddr above_4g_mem_start,
> >>>> +                                uint64_t pci_hole64_size)
> >>>> +{
> >>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> +  
> >>>  
> >>>> +    if (!x86ms->above_4g_mem_size) {
> >>>> +        /*
> >>>> +         * 32-bit pci hole goes from
> >>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
> >>>> +          */
> >>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
> >>>> +    }  
> >>> this hunk still bothers me (nothing changed wrt v5 issues around it)
> >>> issues recap: (
> >>>  1. correctness of it
> >>>  2. being limited to AMD only, while it seems pretty generic to me
> >>>  3. should be a separate patch
> >>> )
> >>>  
> >> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
> >> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
> >>
> >> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
> >>     hwaddr start = x86ms->above_4g_mem_start;
> >>
> >>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
> >>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> >>     }
> >>     ...
> >> }
> >>
> >> Given that otherwise it is impossible to ever encounter the 1T boundary.
> >>  
> > 
> > And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
> > which would turn the function into this:
> > 
> > +static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
> > +{
> > +    return pc_pci_hole64_start() + pci_hole64_size;
> > +}
> > 
> > I would nuke the added helper if it wasn't for having 2 call sites in this patch.
> >   
> 
> Full patch diff further below -- after removing pc_max_used_gpa() and made further
> cleanups given this code can be much simpler after using this approach.
> 
> >> If not ... what other alternative would address your concern?
> >>  
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index e178bbc4129c..1ded3faeffe0 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -882,6 +882,62 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
> 
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
> +                                          hwaddr maxusedaddr)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr maxphysaddr;
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
> 
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>      linux_boot = (machine->kernel_filename != NULL);
> 
>      /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {

it has the same issue as pc_max_used_gpa(), i.e.
  x86ms->above_4g_mem_size != 0
doesn't mean that there isn't any memory above 4Gb nor that there aren't
any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
max_used_gpa

I'd prefer to keep pc_max_used_gpa(),
idea but make it work for above cases and be more generic (i.e. not to be
tied to AMD only) since 'pc_max_used_gpa() < physbits' applies to equally
to AMD and Intel (and to trip it, one just have to configure small enough
physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)



> +        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
> +
> +        /* Bail out if max possible address does not cross HT range */
> +        if (maxusedaddr >= AMD_HT_START) {
> +            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
> +        }
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +
> +    /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
>       */
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 03/10] i386/pc: pass pci_hole64_size to pc_memory_init()
  2022-07-11 22:17       ` B
@ 2022-07-12  9:27         ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-12  9:27 UTC (permalink / raw)
  To: B, Igor Mammedov
  Cc: Eduardo Habkost, Michael S. Tsirkin, Richard Henderson,
	Alex Williamson, Paolo Bonzini, Ani Sinha, Marcel Apfelbaum,
	Dr. David Alan Gilbert, Suravee Suthikulpanit, qemu-devel

On 7/11/22 23:17, B wrote:
> Am 11. Juli 2022 10:01:49 UTC schrieb Joao Martins <joao.m.martins@oracle.com>:
>> On 7/9/22 21:51, B wrote:
>>> Am 1. Juli 2022 16:10:07 UTC schrieb Joao Martins <joao.m.martins@oracle.com>:
>>>> Use the pre-initialized pci-host qdev and fetch the
>>>> pci-hole64-size into pc_memory_init() newly added argument.
>>>> piix needs a bit of care given all the !pci_enabled()
>>>> and that the pci_hole64_size is private to i440fx.
>>>
>>> It exposes this value as the property PCI_HOST_PROP_PCI_HOLE64_SIZE. 
>>
>> Indeed.
>>
>>> Reusing it allows for not touching i440fx in this patch at all.
>>>
>>> For code symmetry reasons the analogue property could be used for Q35 as well.
>>>
>> Presumably you want me to change into below while deleting i440fx_pci_hole64_size()
>>from this patch (see snip below).
> 
> Yes, exactly.
> 
>> IMHO, gotta say that in q35 the code symmetry
>> doesn't buy much readability here,
> 
> That's true. It communicates, though, that a value is used which was deliberately made public, IOW that the code isn't sneaky. (That's just my interpretation, not sure what the common understanding is) Feel free to do however you prefer.
> 
I think it's a good improvement, as avoids duplicating this new helper in i440fx pcihost
which also means less code for the same thing.

> Best regards,
> Bernhard
> 
>> albeit it does remove any need for that other
>> helper in i440fx.
>>
>> @Igor let me know if you agree with the change and whether I can keep the Reviewed-by.
>>
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index 504ddd0deece..cc0855066d06 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -167,7 +167,9 @@ static void pc_init1(MachineState *machine,
>>         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>>         rom_memory = pci_memory;
>>         i440fx_host = qdev_new(host_type);
>> -        hole64_size = i440fx_pci_hole64_size(i440fx_host);
>> +        hole64_size = object_property_get_uint(OBJECT(i440fx_host),
>> +                                               PCI_HOST_PROP_PCI_HOLE64_SIZE,
>> +                                               &error_abort);
>>     } else {
>>         pci_memory = NULL;
>>         rom_memory = system_memory;
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 4b747c59c19a..466f3ef3c918 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -208,7 +208,9 @@ static void pc_q35_init(MachineState *machine)
>>     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>>
>>     if (pcmc->pci_enabled) {
>> -        pci_hole64_size = q35_host->mch.pci_hole64_size;
>> +        pci_hole64_size = object_property_get_uint(OBJECT(q35_host),
>> +                                                   PCI_HOST_PROP_PCI_HOLE64_SIZE,
>> +                                                   &error_abort);
>>     }
>>
>>     /* allocate ram and load rom/bios */
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index 15680da7d709..d5426ef4a53c 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -237,13 +237,6 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>>     }
>> }
>>
>> -uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
>> -{
>> -        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
>> -
>> -        return i440fx->pci_hole64_size;
>> -}
>> -
>> PCIBus *i440fx_init(const char *pci_type,
>>                     DeviceState *dev,
>>                     MemoryRegion *address_space_mem,


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-12  9:06           ` Igor Mammedov
@ 2022-07-12 10:01             ` Joao Martins
  2022-07-12 10:21               ` Joao Martins
                                 ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-12 10:01 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/12/22 10:06, Igor Mammedov wrote:
> On Mon, 11 Jul 2022 21:03:28 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/11/22 16:31, Joao Martins wrote:
>>> On 7/11/22 15:52, Joao Martins wrote:  
>>>> On 7/11/22 13:56, Igor Mammedov wrote:  
>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>>  
>>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>>> index a79fa1b6beeb..07025b510540 100644
>>>>>> --- a/hw/i386/pc.c
>>>>>> +++ b/hw/i386/pc.c
>>>>>> @@ -907,6 +907,87 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>>>>>      return start;
>>>>>>  }
>>>>>>  
>>>>>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>>>>> +                                hwaddr above_4g_mem_start,
>>>>>> +                                uint64_t pci_hole64_size)
>>>>>> +{
>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> +  
>>>>>  
>>>>>> +    if (!x86ms->above_4g_mem_size) {
>>>>>> +        /*
>>>>>> +         * 32-bit pci hole goes from
>>>>>> +         * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
>>>>>> +          */
>>>>>> +        return IO_APIC_DEFAULT_ADDRESS - 1;
>>>>>> +    }  
>>>>> this hunk still bothers me (nothing changed wrt v5 issues around it)
>>>>> issues recap: (
>>>>>  1. correctness of it
>>>>>  2. being limited to AMD only, while it seems pretty generic to me
>>>>>  3. should be a separate patch
>>>>> )
>>>>>  
>>>> How about I instead delete this hunk, and only call pc_set_amd_above_4g_mem_start()
>>>> when there's @above_4g_mem_size ? Like in pc_memory_init() I would instead:
>>>>
>>>> if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>>>     hwaddr start = x86ms->above_4g_mem_start;
>>>>
>>>>     if (pc_max_used_gpa(pcms, start, pci_hole64_size) >= AMD_HT_START) {
>>>>         pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>>>     }
>>>>     ...
>>>> }
>>>>
>>>> Given that otherwise it is impossible to ever encounter the 1T boundary.
>>>>  
>>>
>>> And while at it I would also remove any unneeded arguments from pc_max_used_gpa(),
>>> which would turn the function into this:
>>>
>>> +static hwaddr pc_max_used_gpa(uint64_t pci_hole64_size)
>>> +{
>>> +    return pc_pci_hole64_start() + pci_hole64_size;
>>> +}
>>>
>>> I would nuke the added helper if it wasn't for having 2 call sites in this patch.
>>>   
>>
>> Full patch diff further below -- after removing pc_max_used_gpa() and made further
>> cleanups given this code can be much simpler after using this approach.
>>
>>>> If not ... what other alternative would address your concern?
>>>>  
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index e178bbc4129c..1ded3faeffe0 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -882,6 +882,62 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>> +                                          hwaddr maxusedaddr)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr maxphysaddr;
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>      linux_boot = (machine->kernel_filename != NULL);
>>
>>      /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
> 
> it has the same issue as pc_max_used_gpa(), i.e.
>   x86ms->above_4g_mem_size != 0
> doesn't mean that there isn't any memory above 4Gb nor that there aren't
> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
> max_used_gpa
> 
Argh yes, you are right. I see it now.

> I'd prefer to keep pc_max_used_gpa(),
> idea but make it work for above cases and be more generic (i.e. not to be
> tied to AMD only) since 'pc_max_used_gpa() < physbits'

Are you also indirectly suggesting here that the check inside
pc_set_amd_above_4g_mem_start() should be moved into pc_memory_init()
given that it's orthogonal to this issue. ISTR that you suggested this
at some point. If so, then there's probably very little reason to keep
pc_set_amd_above_4g_mem_start() around.

> applies to equally
> to AMD and Intel (and to trip it, one just have to configure small enough
> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
> 
I can reproduce the issue you're thinking with basic memory hotplug. Let me see
what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.

I would really love to have v7.1.0 with this issue fixed but I am not very
confident it is going to make it :(

Meanwhile, let me know if you have thoughts on this one:

https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/

I am going to assume that if no comments on the above that I'll keep things as is.

And also, whether I can retain your ack with Bernhard's suggestion here:

https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/

>> +        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
>> +
>> +        /* Bail out if max possible address does not cross HT range */
>> +        if (maxusedaddr >= AMD_HT_START) {
>> +            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
>> +        }
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>> +            cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
>> +
>> +    /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>>       */
>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-12 10:01             ` Joao Martins
@ 2022-07-12 10:21               ` Joao Martins
  2022-07-12 11:35               ` Joao Martins
  2022-07-14  9:30               ` Igor Mammedov
  2 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-12 10:21 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/12/22 11:01, Joao Martins wrote:
> On 7/12/22 10:06, Igor Mammedov wrote:
>> On Mon, 11 Jul 2022 21:03:28 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 7/11/22 16:31, Joao Martins wrote:
>>>> On 7/11/22 15:52, Joao Martins wrote:  
>>>>> On 7/11/22 13:56, Igor Mammedov wrote:  
>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>      linux_boot = (machine->kernel_filename != NULL);
>>>
>>>      /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>
>> it has the same issue as pc_max_used_gpa(), i.e.
>>   x86ms->above_4g_mem_size != 0
>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>> max_used_gpa
>>
> Argh yes, you are right. I see it now.
> 
>> I'd prefer to keep pc_max_used_gpa(),
>> idea but make it work for above cases and be more generic (i.e. not to be
>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
> 
> Are you also indirectly suggesting here that the check inside
> pc_set_amd_above_4g_mem_start() should be moved into pc_memory_init()
> given that it's orthogonal to this issue. ISTR that you suggested this
> at some point. If so, then there's probably very little reason to keep
> pc_set_amd_above_4g_mem_start() around.
> 

Hold on, I take that back as the check is AMD specific. And I just noticed a
mistake on v6 (other versions didn't had it) specifically on this phys-bits
boundaries. Given how pc_pci_hole64_start() uses x86ms::above_4g_mem_start the
point of the pc_max_used_gpa() < physbits check inside pc_set_amd_above_4g_mem_start() was
to test the boundaries with AMD_HT_START, not with the typical 4GiB. And
reusing pc_pci_hole64_start() introduced that problem.

So either I'll have to temporarily set x86ms::above_4g_mem_start inside
pc_max_used_gpa() based on passed @above_4g_start value.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-12 10:01             ` Joao Martins
  2022-07-12 10:21               ` Joao Martins
@ 2022-07-12 11:35               ` Joao Martins
  2022-07-14  9:28                 ` Igor Mammedov
  2022-07-14  9:30               ` Igor Mammedov
  2 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-12 11:35 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/12/22 11:01, Joao Martins wrote:
> On 7/12/22 10:06, Igor Mammedov wrote:
>> On Mon, 11 Jul 2022 21:03:28 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 7/11/22 16:31, Joao Martins wrote:
>>>> On 7/11/22 15:52, Joao Martins wrote:  
>>>>> On 7/11/22 13:56, Igor Mammedov wrote:  
>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>  void pc_memory_init(PCMachineState *pcms,
>>>                      MemoryRegion *system_memory,
>>>                      MemoryRegion *rom_memory,
>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>
>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>                                  x86ms->above_4g_mem_size);
>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>      linux_boot = (machine->kernel_filename != NULL);
>>>
>>>      /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {
>>
>> it has the same issue as pc_max_used_gpa(), i.e.
>>   x86ms->above_4g_mem_size != 0
>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>> max_used_gpa
>> I'd prefer to keep pc_max_used_gpa(),
>> idea but make it work for above cases and be more generic (i.e. not to be
>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>> applies to equally
>> to AMD and Intel (and to trip it, one just have to configure small enough
>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>
> I can reproduce the issue you're thinking with basic memory hotplug. 

I was mislead by a bug that only existed in v6. Which I fixed now.
So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:

	pc_pci_hole64_start() + pci_hole64_size;

which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).

And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
way.

So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.

[*] As reiterated here:

> Let me see
> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
> 

I was over-complicating things here. It turns out nothing else is needed aside in the
context of 1T hole.

This is because I only need to check address space limits (as consequence of
pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
32-bit guests this is never true and thus it things don't change behaviour from current
default for these guests. And thus I won't break qtests and things fail correctly in the
right places.

Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
max used GPA value, given that I return pci hole64 end (essentially). Do you still that
addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?

If so, our only option seems to be to check phys_bits <= 32 and return max CPU
boundary there? Unless you have something enterily different in mind?

> I would really love to have v7.1.0 with this issue fixed but I am not very
> confident it is going to make it :(
> 
> Meanwhile, let me know if you have thoughts on this one:
> 
> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> 
> I am going to assume that if no comments on the above that I'll keep things as is.
> 
> And also, whether I can retain your ack with Bernhard's suggestion here:
> 
> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> 


diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 668e15c8f2a6..45433cc53b5b 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -881,6 +881,67 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }

+static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
+{
+    return pc_pci_hole64_start() + pci_hole64_size;
+}
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
+                                          uint64_t pci_hole64_size)
+{
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr, maxusedaddr;
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u) cannot avoid AMD HT range",
+                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -896,6 +957,7 @@ void pc_memory_init(PCMachineState *pcms,
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);

     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -903,6 +965,27 @@ void pc_memory_init(PCMachineState *pcms,
     linux_boot = (machine->kernel_filename != NULL);

     /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env)) {
+        /* Bail out if max possible address does not cross HT range */
+        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
+            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
+        }
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
+            cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
+    /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
      */


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-12 11:35               ` Joao Martins
@ 2022-07-14  9:28                 ` Igor Mammedov
  2022-07-14  9:54                   ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-14  9:28 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Tue, 12 Jul 2022 12:35:49 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/12/22 11:01, Joao Martins wrote:
> > On 7/12/22 10:06, Igor Mammedov wrote:  
> >> On Mon, 11 Jul 2022 21:03:28 +0100
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> On 7/11/22 16:31, Joao Martins wrote:  
> >>>> On 7/11/22 15:52, Joao Martins wrote:    
> >>>>> On 7/11/22 13:56, Igor Mammedov wrote:    
> >>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>>  void pc_memory_init(PCMachineState *pcms,
> >>>                      MemoryRegion *system_memory,
> >>>                      MemoryRegion *rom_memory,
> >>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
> >>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>      hwaddr cxl_base, cxl_resv_end = 0;
> >>> +    X86CPU *cpu = X86_CPU(first_cpu);
> >>>
> >>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
> >>>                                  x86ms->above_4g_mem_size);
> >>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
> >>>      linux_boot = (machine->kernel_filename != NULL);
> >>>
> >>>      /*
> >>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >>> +     * to above 1T to AMD vCPUs only.
> >>> +     */
> >>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {  
> >>
> >> it has the same issue as pc_max_used_gpa(), i.e.
> >>   x86ms->above_4g_mem_size != 0
> >> doesn't mean that there isn't any memory above 4Gb nor that there aren't
> >> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
> >> max_used_gpa
> >> I'd prefer to keep pc_max_used_gpa(),
> >> idea but make it work for above cases and be more generic (i.e. not to be
> >> tied to AMD only) since 'pc_max_used_gpa() < physbits'
> >> applies to equally
> >> to AMD and Intel (and to trip it, one just have to configure small enough
> >> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
> >>  
> > I can reproduce the issue you're thinking with basic memory hotplug.   
> 
> I was mislead by a bug that only existed in v6. Which I fixed now.
> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
> 
> 	pc_pci_hole64_start() + pci_hole64_size;
> 
> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
> 
> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
> way.
> 
> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
> 
> [*] As reiterated here:
> 
> > Let me see
> > what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
> >   
> 
> I was over-complicating things here. It turns out nothing else is needed aside in the
> context of 1T hole.
> 
> This is because I only need to check address space limits (as consequence of
> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
> 32-bit guests this is never true and thus it things don't change behaviour from current
> default for these guests. And thus I won't break qtests and things fail correctly in the
> right places.
> 
> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
> 
> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
> boundary there? Unless you have something enterily different in mind?
> 
> > I would really love to have v7.1.0 with this issue fixed but I am not very
> > confident it is going to make it :(
> > 
> > Meanwhile, let me know if you have thoughts on this one:
> > 
> > https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> > 
> > I am going to assume that if no comments on the above that I'll keep things as is.
> > 
> > And also, whether I can retain your ack with Bernhard's suggestion here:
> > 
> > https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> >   
> 
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 668e15c8f2a6..45433cc53b5b 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -881,6 +881,67 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
> 
> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
> +{
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
> +                                          uint64_t pci_hole64_size)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr maxphysaddr, maxusedaddr;
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -896,6 +957,7 @@ void pc_memory_init(PCMachineState *pcms,
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
> 
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
> @@ -903,6 +965,27 @@ void pc_memory_init(PCMachineState *pcms,
>      linux_boot = (machine->kernel_filename != NULL);
> 
>      /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        /* Bail out if max possible address does not cross HT range */
> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);

I'd replace call with 
   x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;

> +        }
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> +            cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }

and then here check that pc_max_used_gpa() fits into phys_bits
which should cover AMD case and case where pci64_hole goes beyond 
supported address range even without 1TB hole

> +
> +    /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
>       */
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-12 10:01             ` Joao Martins
  2022-07-12 10:21               ` Joao Martins
  2022-07-12 11:35               ` Joao Martins
@ 2022-07-14  9:30               ` Igor Mammedov
  2 siblings, 0 replies; 48+ messages in thread
From: Igor Mammedov @ 2022-07-14  9:30 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Tue, 12 Jul 2022 11:01:18 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/12/22 10:06, Igor Mammedov wrote:
> > On Mon, 11 Jul 2022 21:03:28 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> On 7/11/22 16:31, Joao Martins wrote:  
> >>> On 7/11/22 15:52, Joao Martins wrote:    
> >>>> On 7/11/22 13:56, Igor Mammedov wrote:    
> >>>>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
[...]
> I would really love to have v7.1.0 with this issue fixed but I am not very
> confident it is going to make it :(

it still can make into current release

> 
> Meanwhile, let me know if you have thoughts on this one:
> 
> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> 
> I am going to assume that if no comments on the above that I'll keep things as is.
> 
> And also, whether I can retain your ack with Bernhard's suggestion here:
> 
> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> 
> >> +        hwaddr maxusedaddr = pc_pci_hole64_start() + pci_hole64_size;
> >> +
> >> +        /* Bail out if max possible address does not cross HT range */
> >> +        if (maxusedaddr >= AMD_HT_START) {
> >> +            pc_set_amd_above_4g_mem_start(pcms, maxusedaddr);
> >> +        }
> >> +
> >> +        /*
> >> +         * Advertise the HT region if address space covers the reserved
> >> +         * region or if we relocate.
> >> +         */
> >> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> >> +            cpu->phys_bits >= 40) {
> >> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >> +        }
> >> +    }
> >> +
> >> +    /*
> >>       * Split single memory region and use aliases to address portions of it,
> >>       * done for backwards compatibility with older qemus.
> >>       */
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-14  9:28                 ` Igor Mammedov
@ 2022-07-14  9:54                   ` Joao Martins
  2022-07-14 10:47                     ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-14  9:54 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/14/22 10:28, Igor Mammedov wrote:
> On Tue, 12 Jul 2022 12:35:49 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/12/22 11:01, Joao Martins wrote:
>>> On 7/12/22 10:06, Igor Mammedov wrote:  
>>>> On Mon, 11 Jul 2022 21:03:28 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 7/11/22 16:31, Joao Martins wrote:  
>>>>>> On 7/11/22 15:52, Joao Martins wrote:    
>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:    
>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>  void pc_memory_init(PCMachineState *pcms,
>>>>>                      MemoryRegion *system_memory,
>>>>>                      MemoryRegion *rom_memory,
>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>>>
>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>>>                                  x86ms->above_4g_mem_size);
>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>      linux_boot = (machine->kernel_filename != NULL);
>>>>>
>>>>>      /*
>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>> +     * to above 1T to AMD vCPUs only.
>>>>> +     */
>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {  
>>>>
>>>> it has the same issue as pc_max_used_gpa(), i.e.
>>>>   x86ms->above_4g_mem_size != 0
>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>>>> max_used_gpa
>>>> I'd prefer to keep pc_max_used_gpa(),
>>>> idea but make it work for above cases and be more generic (i.e. not to be
>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>>>> applies to equally
>>>> to AMD and Intel (and to trip it, one just have to configure small enough
>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>>>  
>>> I can reproduce the issue you're thinking with basic memory hotplug.   
>>
>> I was mislead by a bug that only existed in v6. Which I fixed now.
>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
>>
>> 	pc_pci_hole64_start() + pci_hole64_size;
>>
>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
>>
>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
>> way.
>>
>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
>>
>> [*] As reiterated here:
>>
>>> Let me see
>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
>>>   
>>
>> I was over-complicating things here. It turns out nothing else is needed aside in the
>> context of 1T hole.
>>
>> This is because I only need to check address space limits (as consequence of
>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
>> 32-bit guests this is never true and thus it things don't change behaviour from current
>> default for these guests. And thus I won't break qtests and things fail correctly in the
>> right places.
>>
>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
>>
>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
>> boundary there? Unless you have something enterily different in mind?
>>
>>> I would really love to have v7.1.0 with this issue fixed but I am not very
>>> confident it is going to make it :(
>>>
>>> Meanwhile, let me know if you have thoughts on this one:
>>>
>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
>>>
>>> I am going to assume that if no comments on the above that I'll keep things as is.
>>>
>>> And also, whether I can retain your ack with Bernhard's suggestion here:
>>>
>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
>>>   
>>
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 668e15c8f2a6..45433cc53b5b 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -881,6 +881,67 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
>> +{
>> +    return pc_pci_hole64_start() + pci_hole64_size;
>> +}
>> +
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void pc_set_amd_above_4g_mem_start(PCMachineState *pcms,
>> +                                          uint64_t pci_hole64_size)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr maxphysaddr, maxusedaddr;
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u) cannot avoid AMD HT range",
>> +                     maxphysaddr, maxusedaddr, X86_CPU(first_cpu)->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -896,6 +957,7 @@ void pc_memory_init(PCMachineState *pcms,
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>> @@ -903,6 +965,27 @@ void pc_memory_init(PCMachineState *pcms,
>>      linux_boot = (machine->kernel_filename != NULL);
>>
>>      /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env)) {
>> +        /* Bail out if max possible address does not cross HT range */
>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
> 
> I'd replace call with 
>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> 
See below.

>> +        }
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>> +            cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
> 
> and then here check that pc_max_used_gpa() fits into phys_bits
> which should cover AMD case and case where pci64_hole goes beyond 
> supported address range even without 1TB hole
> 

When you say 'here' you mean outside IS_AMD_CPU() ?

If we put outside (and thus generic) where it was ... it will break qtests
as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
inside the scope of:

	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)

Which means I will for sure have a pci_hole64.
Making it generic to /outside/ this conditional requires addressing this
earlier comment I made:

 our only option seems to be to check phys_bits <= 32 and return max CPU
 boundary there?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-14  9:54                   ` Joao Martins
@ 2022-07-14 10:47                     ` Joao Martins
  2022-07-14 11:50                       ` Igor Mammedov
  0 siblings, 1 reply; 48+ messages in thread
From: Joao Martins @ 2022-07-14 10:47 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/14/22 10:54, Joao Martins wrote:
> On 7/14/22 10:28, Igor Mammedov wrote:
>> On Tue, 12 Jul 2022 12:35:49 +0100
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>> On 7/12/22 11:01, Joao Martins wrote:
>>>> On 7/12/22 10:06, Igor Mammedov wrote:  
>>>>> On Mon, 11 Jul 2022 21:03:28 +0100
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>> On 7/11/22 16:31, Joao Martins wrote:  
>>>>>>> On 7/11/22 15:52, Joao Martins wrote:    
>>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:    
>>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>>  void pc_memory_init(PCMachineState *pcms,
>>>>>>                      MemoryRegion *system_memory,
>>>>>>                      MemoryRegion *rom_memory,
>>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>>>>
>>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>>>>                                  x86ms->above_4g_mem_size);
>>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>      linux_boot = (machine->kernel_filename != NULL);
>>>>>>
>>>>>>      /*
>>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>>> +     * to above 1T to AMD vCPUs only.
>>>>>> +     */
>>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {  
>>>>>
>>>>> it has the same issue as pc_max_used_gpa(), i.e.
>>>>>   x86ms->above_4g_mem_size != 0
>>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>>>>> max_used_gpa
>>>>> I'd prefer to keep pc_max_used_gpa(),
>>>>> idea but make it work for above cases and be more generic (i.e. not to be
>>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>>>>> applies to equally
>>>>> to AMD and Intel (and to trip it, one just have to configure small enough
>>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>>>>  
>>>> I can reproduce the issue you're thinking with basic memory hotplug.   
>>>
>>> I was mislead by a bug that only existed in v6. Which I fixed now.
>>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
>>>
>>> 	pc_pci_hole64_start() + pci_hole64_size;
>>>
>>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
>>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
>>>
>>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
>>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
>>> way.
>>>
>>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
>>>
>>> [*] As reiterated here:
>>>
>>>> Let me see
>>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
>>>>   
>>>
>>> I was over-complicating things here. It turns out nothing else is needed aside in the
>>> context of 1T hole.
>>>
>>> This is because I only need to check address space limits (as consequence of
>>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
>>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
>>> 32-bit guests this is never true and thus it things don't change behaviour from current
>>> default for these guests. And thus I won't break qtests and things fail correctly in the
>>> right places.
>>>
>>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
>>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
>>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
>>>
>>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
>>> boundary there? Unless you have something enterily different in mind?
>>>
>>>> I would really love to have v7.1.0 with this issue fixed but I am not very
>>>> confident it is going to make it :(
>>>>
>>>> Meanwhile, let me know if you have thoughts on this one:
>>>>
>>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
>>>>
>>>> I am going to assume that if no comments on the above that I'll keep things as is.
>>>>
>>>> And also, whether I can retain your ack with Bernhard's suggestion here:
>>>>
>>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
>>>>   
>>>
>>>

[...]

>>>      /*
>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>> +     * to above 1T to AMD vCPUs only.
>>> +     */
>>> +    if (IS_AMD_CPU(&cpu->env)) {
>>> +        /* Bail out if max possible address does not cross HT range */
>>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);
>>
>> I'd replace call with 
>>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>>
> See below.
> 
>>> +        }
>>> +
>>> +        /*
>>> +         * Advertise the HT region if address space covers the reserved
>>> +         * region or if we relocate.
>>> +         */
>>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>>> +            cpu->phys_bits >= 40) {
>>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>>> +        }
>>> +    }
>>
>> and then here check that pc_max_used_gpa() fits into phys_bits
>> which should cover AMD case and case where pci64_hole goes beyond 
>> supported address range even without 1TB hole
>>
> 
> When you say 'here' you mean outside IS_AMD_CPU() ?
> 
> If we put outside (and thus generic) where it was ... it will break qtests
> as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
> Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
> inside the scope of:
> 
> 	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)
> 
> Which means I will for sure have a pci_hole64.
> Making it generic to /outside/ this conditional requires addressing this
> earlier comment I made:
> 
>  our only option seems to be to check phys_bits <= 32 and return max CPU
>  boundary there?
> 

Here's how this patch looks like, after your comments and the above issue
I am talking. The added part is inside pc_max_used_gpa().

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 668e15c8f2a6..2d85c66502d5 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -881,6 +881,51 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
     return start;
 }

+static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
+{
+    X86CPU *cpu = X86_CPU(first_cpu);
+
+    if (cpu->phys_bits <= 32) {
+        return (1ULL << cpu->phys_bits) - 1ULL;
+    }
+
+    return pc_pci_hole64_start() + pci_hole64_size;
+}
+
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -895,7 +940,9 @@ void pc_memory_init(PCMachineState *pcms,
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
     X86MachineState *x86ms = X86_MACHINE(pcms);
+    hwaddr maxphysaddr, maxusedaddr;
     hwaddr cxl_base, cxl_resv_end = 0;
+    X86CPU *cpu = X86_CPU(first_cpu);

     assert(machine->ram_size == x86ms->below_4g_mem_size +
                                 x86ms->above_4g_mem_size);
@@ -903,6 +950,40 @@ void pc_memory_init(PCMachineState *pcms,
     linux_boot = (machine->kernel_filename != NULL);

     /*
+     * The HyperTransport range close to the 1T boundary is unique to AMD
+     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
+     * to above 1T to AMD vCPUs only.
+     */
+    if (IS_AMD_CPU(&cpu->env)) {
+        /* Bail out if max possible address does not cross HT range */
+        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
+            x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+        }
+
+        /*
+         * Advertise the HT region if address space covers the reserved
+         * region or if we relocate.
+         */
+        if (cpu->phys_bits >= 40) {
+            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+        }
+    }
+
+    /*
+     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
+     * So make sure phys-bits is required to be appropriately sized in order
+     * to proceed with the above-4g-region relocation and thus boot.
+     */
+    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
+    maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
+    if (maxphysaddr < maxusedaddr) {
+        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
+                     " phys-bits too low (%u)",
+                     maxphysaddr, maxusedaddr, cpu->phys_bits);
+        exit(EXIT_FAILURE);
+    }
+
+    /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
      */


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-14 10:47                     ` Joao Martins
@ 2022-07-14 11:50                       ` Igor Mammedov
  2022-07-14 15:39                         ` Joao Martins
  0 siblings, 1 reply; 48+ messages in thread
From: Igor Mammedov @ 2022-07-14 11:50 UTC (permalink / raw)
  To: Joao Martins
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On Thu, 14 Jul 2022 11:47:19 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 7/14/22 10:54, Joao Martins wrote:
> > On 7/14/22 10:28, Igor Mammedov wrote:  
> >> On Tue, 12 Jul 2022 12:35:49 +0100
> >> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>> On 7/12/22 11:01, Joao Martins wrote:  
> >>>> On 7/12/22 10:06, Igor Mammedov wrote:    
> >>>>> On Mon, 11 Jul 2022 21:03:28 +0100
> >>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>> On 7/11/22 16:31, Joao Martins wrote:    
> >>>>>>> On 7/11/22 15:52, Joao Martins wrote:      
> >>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:      
> >>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
> >>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>>  void pc_memory_init(PCMachineState *pcms,
> >>>>>>                      MemoryRegion *system_memory,
> >>>>>>                      MemoryRegion *rom_memory,
> >>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
> >>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
> >>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
> >>>>>>
> >>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
> >>>>>>                                  x86ms->above_4g_mem_size);
> >>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
> >>>>>>      linux_boot = (machine->kernel_filename != NULL);
> >>>>>>
> >>>>>>      /*
> >>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >>>>>> +     * to above 1T to AMD vCPUs only.
> >>>>>> +     */
> >>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {    
> >>>>>
> >>>>> it has the same issue as pc_max_used_gpa(), i.e.
> >>>>>   x86ms->above_4g_mem_size != 0
> >>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
> >>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
> >>>>> max_used_gpa
> >>>>> I'd prefer to keep pc_max_used_gpa(),
> >>>>> idea but make it work for above cases and be more generic (i.e. not to be
> >>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
> >>>>> applies to equally
> >>>>> to AMD and Intel (and to trip it, one just have to configure small enough
> >>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
> >>>>>    
> >>>> I can reproduce the issue you're thinking with basic memory hotplug.     
> >>>
> >>> I was mislead by a bug that only existed in v6. Which I fixed now.
> >>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
> >>>
> >>> 	pc_pci_hole64_start() + pci_hole64_size;
> >>>
> >>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
> >>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
> >>>
> >>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
> >>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
> >>> way.
> >>>
> >>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
> >>>
> >>> [*] As reiterated here:
> >>>  
> >>>> Let me see
> >>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
> >>>>     
> >>>
> >>> I was over-complicating things here. It turns out nothing else is needed aside in the
> >>> context of 1T hole.
> >>>
> >>> This is because I only need to check address space limits (as consequence of
> >>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
> >>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
> >>> 32-bit guests this is never true and thus it things don't change behaviour from current
> >>> default for these guests. And thus I won't break qtests and things fail correctly in the
> >>> right places.
> >>>
> >>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
> >>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
> >>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
> >>>
> >>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
> >>> boundary there? Unless you have something enterily different in mind?
> >>>  
> >>>> I would really love to have v7.1.0 with this issue fixed but I am not very
> >>>> confident it is going to make it :(
> >>>>
> >>>> Meanwhile, let me know if you have thoughts on this one:
> >>>>
> >>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
> >>>>
> >>>> I am going to assume that if no comments on the above that I'll keep things as is.
> >>>>
> >>>> And also, whether I can retain your ack with Bernhard's suggestion here:
> >>>>
> >>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
> >>>>     
> >>>
> >>>  
> 
> [...]
> 
> >>>      /*
> >>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> >>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> >>> +     * to above 1T to AMD vCPUs only.
> >>> +     */
> >>> +    if (IS_AMD_CPU(&cpu->env)) {
> >>> +        /* Bail out if max possible address does not cross HT range */
> >>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
> >>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);  
> >>
> >> I'd replace call with 
> >>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> >>  
> > See below.
> >   
> >>> +        }
> >>> +
> >>> +        /*
> >>> +         * Advertise the HT region if address space covers the reserved
> >>> +         * region or if we relocate.
> >>> +         */
> >>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
> >>> +            cpu->phys_bits >= 40) {
> >>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >>> +        }
> >>> +    }  
> >>
> >> and then here check that pc_max_used_gpa() fits into phys_bits
> >> which should cover AMD case and case where pci64_hole goes beyond 
> >> supported address range even without 1TB hole
> >>  
> > 
> > When you say 'here' you mean outside IS_AMD_CPU() ?
> > 
> > If we put outside (and thus generic) where it was ... it will break qtests
> > as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
> > Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
> > inside the scope of:
> > 
> > 	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)
> > 
> > Which means I will for sure have a pci_hole64.
> > Making it generic to /outside/ this conditional requires addressing this
> > earlier comment I made:
> > 
> >  our only option seems to be to check phys_bits <= 32 and return max CPU
> >  boundary there?
> >   
> 
> Here's how this patch looks like, after your comments and the above issue
> I am talking. The added part is inside pc_max_used_gpa().
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 668e15c8f2a6..2d85c66502d5 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -881,6 +881,51 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>      return start;
>  }
> 
> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
> +{
> +    X86CPU *cpu = X86_CPU(first_cpu);
> +
> +    if (cpu->phys_bits <= 32) {

> +        return (1ULL << cpu->phys_bits) - 1ULL;
Add a comment here as to why this value is returned

> +    }
> +
> +    return pc_pci_hole64_start() + pci_hole64_size;
> +}
> +
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -895,7 +940,9 @@ void pc_memory_init(PCMachineState *pcms,
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>      X86MachineState *x86ms = X86_MACHINE(pcms);
> +    hwaddr maxphysaddr, maxusedaddr;
>      hwaddr cxl_base, cxl_resv_end = 0;
> +    X86CPU *cpu = X86_CPU(first_cpu);
> 
>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>                                  x86ms->above_4g_mem_size);
> @@ -903,6 +950,40 @@ void pc_memory_init(PCMachineState *pcms,
>      linux_boot = (machine->kernel_filename != NULL);
> 
>      /*
> +     * The HyperTransport range close to the 1T boundary is unique to AMD
> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
> +     * to above 1T to AMD vCPUs only.
> +     */
> +    if (IS_AMD_CPU(&cpu->env)) {
> +        /* Bail out if max possible address does not cross HT range */
> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
> +            x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +        }
> +
> +        /*
> +         * Advertise the HT region if address space covers the reserved
> +         * region or if we relocate.
> +         */
> +        if (cpu->phys_bits >= 40) {
> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +        }
> +    }
> +
> +    /*
> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
> +     * So make sure phys-bits is required to be appropriately sized in order
> +     * to proceed with the above-4g-region relocation and thus boot.
> +     */
> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
> +    maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
> +    if (maxphysaddr < maxusedaddr) {
> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
> +                     " phys-bits too low (%u)",
> +                     maxphysaddr, maxusedaddr, cpu->phys_bits);
> +        exit(EXIT_FAILURE);
> +    }
> +

it looks fine to me

> +    /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
>       */
> 



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable
  2022-07-14 11:50                       ` Igor Mammedov
@ 2022-07-14 15:39                         ` Joao Martins
  0 siblings, 0 replies; 48+ messages in thread
From: Joao Martins @ 2022-07-14 15:39 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, Eduardo Habkost, Michael S. Tsirkin,
	Richard Henderson, Alex Williamson, Paolo Bonzini, Ani Sinha,
	Marcel Apfelbaum, Dr. David Alan Gilbert, Suravee Suthikulpanit

On 7/14/22 12:50, Igor Mammedov wrote:
> On Thu, 14 Jul 2022 11:47:19 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 7/14/22 10:54, Joao Martins wrote:
>>> On 7/14/22 10:28, Igor Mammedov wrote:  
>>>> On Tue, 12 Jul 2022 12:35:49 +0100
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 7/12/22 11:01, Joao Martins wrote:  
>>>>>> On 7/12/22 10:06, Igor Mammedov wrote:    
>>>>>>> On Mon, 11 Jul 2022 21:03:28 +0100
>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>>> On 7/11/22 16:31, Joao Martins wrote:    
>>>>>>>>> On 7/11/22 15:52, Joao Martins wrote:      
>>>>>>>>>> On 7/11/22 13:56, Igor Mammedov wrote:      
>>>>>>>>>>> On Fri,  1 Jul 2022 17:10:13 +0100
>>>>>>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
>>>>>>>>  void pc_memory_init(PCMachineState *pcms,
>>>>>>>>                      MemoryRegion *system_memory,
>>>>>>>>                      MemoryRegion *rom_memory,
>>>>>>>> @@ -897,6 +953,7 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>>>      hwaddr cxl_base, cxl_resv_end = 0;
>>>>>>>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>>>>>>>
>>>>>>>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>>>>>>>                                  x86ms->above_4g_mem_size);
>>>>>>>> @@ -904,6 +961,29 @@ void pc_memory_init(PCMachineState *pcms,
>>>>>>>>      linux_boot = (machine->kernel_filename != NULL);
>>>>>>>>
>>>>>>>>      /*
>>>>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>>>>> +     * to above 1T to AMD vCPUs only.
>>>>>>>> +     */
>>>>>>>> +    if (IS_AMD_CPU(&cpu->env) && x86ms->above_4g_mem_size) {    
>>>>>>>
>>>>>>> it has the same issue as pc_max_used_gpa(), i.e.
>>>>>>>   x86ms->above_4g_mem_size != 0
>>>>>>> doesn't mean that there isn't any memory above 4Gb nor that there aren't
>>>>>>> any MMIO (sgx/cxl/pci64hole), that's was the reason we were are considering
>>>>>>> max_used_gpa
>>>>>>> I'd prefer to keep pc_max_used_gpa(),
>>>>>>> idea but make it work for above cases and be more generic (i.e. not to be
>>>>>>> tied to AMD only) since 'pc_max_used_gpa() < physbits'
>>>>>>> applies to equally
>>>>>>> to AMD and Intel (and to trip it, one just have to configure small enough
>>>>>>> physbits or large enough hotpluggable RAM/CXL/PCI64HOLE)
>>>>>>>    
>>>>>> I can reproduce the issue you're thinking with basic memory hotplug.     
>>>>>
>>>>> I was mislead by a bug that only existed in v6. Which I fixed now.
>>>>> So any bug possibility with hotplug, SGX and CXL, or pcihole64 is simply covered with:
>>>>>
>>>>> 	pc_pci_hole64_start() + pci_hole64_size;
>>>>>
>>>>> which is what pc_max_used_gpa() does. This works fine /without/ above_4g_mem_size != 0
>>>>> check even without above_4g_mem_size (e.g. mem=2G,maxmem=1024G).
>>>>>
>>>>> And as a reminder: SGX, hotplug, CXL and pci-hole64 *require* memory above 4G[*]. And part
>>>>> of the point of us moving to pc_pci_hole64_start() was to make these all work in a generic
>>>>> way.
>>>>>
>>>>> So I've removed the x86ms->above_4g_mem_size != 0 check. Current patch diff pasted at the end.
>>>>>
>>>>> [*] As reiterated here:
>>>>>  
>>>>>> Let me see
>>>>>> what I can come up in pc_max_used_gpa() to cover this one. I'll respond here with a proposal.
>>>>>>     
>>>>>
>>>>> I was over-complicating things here. It turns out nothing else is needed aside in the
>>>>> context of 1T hole.
>>>>>
>>>>> This is because I only need to check address space limits (as consequence of
>>>>> pc_set_amd_above_4g_mem_start()) when pc_max_used_gpa() surprasses HT_START. Which
>>>>> requires fundamentally a value closer to 1T well beyond what 32-bit can cover. So on
>>>>> 32-bit guests this is never true and thus it things don't change behaviour from current
>>>>> default for these guests. And thus I won't break qtests and things fail correctly in the
>>>>> right places.
>>>>>
>>>>> Now I should say that pc_max_used_gpa() is still not returning the accurate 32-bit guest
>>>>> max used GPA value, given that I return pci hole64 end (essentially). Do you still that
>>>>> addressed out of correctness even if it doesn't matter much for the 64-bit 1T case?
>>>>>
>>>>> If so, our only option seems to be to check phys_bits <= 32 and return max CPU
>>>>> boundary there? Unless you have something enterily different in mind?
>>>>>  
>>>>>> I would really love to have v7.1.0 with this issue fixed but I am not very
>>>>>> confident it is going to make it :(
>>>>>>
>>>>>> Meanwhile, let me know if you have thoughts on this one:
>>>>>>
>>>>>> https://lore.kernel.org/qemu-devel/1b2fa957-74f6-b5a9-3fc1-65c5d68300ce@oracle.com/
>>>>>>
>>>>>> I am going to assume that if no comments on the above that I'll keep things as is.
>>>>>>
>>>>>> And also, whether I can retain your ack with Bernhard's suggestion here:
>>>>>>
>>>>>> https://lore.kernel.org/qemu-devel/0eefb382-4ac6-4335-ca61-035babb95a88@oracle.com/
>>>>>>     
>>>>>
>>>>>  
>>
>> [...]
>>
>>>>>      /*
>>>>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>>>>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>>>>> +     * to above 1T to AMD vCPUs only.
>>>>> +     */
>>>>> +    if (IS_AMD_CPU(&cpu->env)) {
>>>>> +        /* Bail out if max possible address does not cross HT range */
>>>>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>>>>> +            pc_set_amd_above_4g_mem_start(pcms, pci_hole64_size);  
>>>>
>>>> I'd replace call with 
>>>>    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>>>>  
>>> See below.
>>>   
>>>>> +        }
>>>>> +
>>>>> +        /*
>>>>> +         * Advertise the HT region if address space covers the reserved
>>>>> +         * region or if we relocate.
>>>>> +         */
>>>>> +        if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START ||
>>>>> +            cpu->phys_bits >= 40) {
>>>>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>>>>> +        }
>>>>> +    }  
>>>>
>>>> and then here check that pc_max_used_gpa() fits into phys_bits
>>>> which should cover AMD case and case where pci64_hole goes beyond 
>>>> supported address range even without 1TB hole
>>>>  
>>>
>>> When you say 'here' you mean outside IS_AMD_CPU() ?
>>>
>>> If we put outside (and thus generic) where it was ... it will break qtests
>>> as pc_max_used_gpa() does not handle 32-bit case, as mentioned earlier.
>>> Hence why it is inside pc_set_amd_above_4g_mem_start(), or in other words
>>> inside the scope of:
>>>
>>> 	if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START)
>>>
>>> Which means I will for sure have a pci_hole64.
>>> Making it generic to /outside/ this conditional requires addressing this
>>> earlier comment I made:
>>>
>>>  our only option seems to be to check phys_bits <= 32 and return max CPU
>>>  boundary there?
>>>   
>>
>> Here's how this patch looks like, after your comments and the above issue
>> I am talking. The added part is inside pc_max_used_gpa().
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 668e15c8f2a6..2d85c66502d5 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -881,6 +881,51 @@ static uint64_t pc_get_cxl_range_end(PCMachineState *pcms)
>>      return start;
>>  }
>>
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t pci_hole64_size)
>> +{
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>> +
>> +    if (cpu->phys_bits <= 32) {
> 
>> +        return (1ULL << cpu->phys_bits) - 1ULL;
> Add a comment here as to why this value is returned
> 

I have added this so far:

+    /* 32-bit systems don't have hole64 thus return max phys address */

>> +    }
>> +
>> +    return pc_pci_hole64_start() + pci_hole64_size;
>> +}
>> +

And also a - 1 in the calculation above as this was off by one.

>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -895,7 +940,9 @@ void pc_memory_init(PCMachineState *pcms,
>>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>>      PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    hwaddr maxphysaddr, maxusedaddr;
>>      hwaddr cxl_base, cxl_resv_end = 0;
>> +    X86CPU *cpu = X86_CPU(first_cpu);
>>
>>      assert(machine->ram_size == x86ms->below_4g_mem_size +
>>                                  x86ms->above_4g_mem_size);
>> @@ -903,6 +950,40 @@ void pc_memory_init(PCMachineState *pcms,
>>      linux_boot = (machine->kernel_filename != NULL);
>>
>>      /*
>> +     * The HyperTransport range close to the 1T boundary is unique to AMD
>> +     * hosts with IOMMUs enabled. Restrict the ram-above-4g relocation
>> +     * to above 1T to AMD vCPUs only.
>> +     */
>> +    if (IS_AMD_CPU(&cpu->env)) {
>> +        /* Bail out if max possible address does not cross HT range */
>> +        if (pc_max_used_gpa(pcms, pci_hole64_size) >= AMD_HT_START) {
>> +            x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +        }
>> +
>> +        /*
>> +         * Advertise the HT region if address space covers the reserved
>> +         * region or if we relocate.
>> +         */
>> +        if (cpu->phys_bits >= 40) {
>> +            e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> +     * So make sure phys-bits is required to be appropriately sized in order
>> +     * to proceed with the above-4g-region relocation and thus boot.
>> +     */
>> +    maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
>> +    maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
>> +    if (maxphysaddr < maxusedaddr) {
>> +        error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> +                     " phys-bits too low (%u)",
>> +                     maxphysaddr, maxusedaddr, cpu->phys_bits);
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
> 
> it looks fine to me
> 

Cool, let me respin v7 today/tomorrow.

>> +    /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>>       */
>>
> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2022-07-14 15:42 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-01 16:10 [PATCH v6 00/10] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-07-01 16:10 ` [PATCH v6 01/10] hw/i386: add 4g boundary start to X86MachineState Joao Martins
2022-07-01 16:10 ` [PATCH v6 02/10] i386/pc: create pci-host qdev prior to pc_memory_init() Joao Martins
2022-07-01 16:10 ` [PATCH v6 03/10] i386/pc: pass pci_hole64_size " Joao Martins
2022-07-09 20:51   ` B
2022-07-11 10:01     ` Joao Martins
2022-07-11 22:17       ` B
2022-07-12  9:27         ` Joao Martins
2022-07-01 16:10 ` [PATCH v6 04/10] i386/pc: factor out above-4g end to an helper Joao Martins
2022-07-07 12:42   ` Igor Mammedov
2022-07-07 15:14     ` Joao Martins
2022-07-01 16:10 ` [PATCH v6 05/10] i386/pc: factor out cxl range end to helper Joao Martins
2022-07-07 12:57   ` Igor Mammedov
2022-07-07 15:17     ` Joao Martins
2022-07-01 16:10 ` [PATCH v6 06/10] i386/pc: factor out cxl range start " Joao Martins
2022-07-07 13:00   ` Igor Mammedov
2022-07-07 15:18     ` Joao Martins
2022-07-11 12:47       ` Igor Mammedov
2022-07-11 14:28         ` Joao Martins
2022-07-01 16:10 ` [PATCH v6 07/10] i386/pc: handle unitialized mr in pc_get_cxl_range_end() Joao Martins
2022-07-07 13:05   ` Igor Mammedov
2022-07-07 15:21     ` Joao Martins
2022-07-11 12:58       ` Igor Mammedov
2022-07-11 14:32         ` Joao Martins
2022-07-01 16:10 ` [PATCH v6 08/10] i386/pc: factor out device_memory base/size to helper Joao Martins
2022-07-07 13:15   ` Igor Mammedov
2022-07-07 15:23     ` Joao Martins
2022-07-01 16:10 ` [PATCH v6 09/10] i386/pc: relocate 4g start to 1T where applicable Joao Martins
2022-07-07 15:53   ` Joao Martins
2022-07-11 12:56   ` Igor Mammedov
2022-07-11 14:52     ` Joao Martins
2022-07-11 15:31       ` Joao Martins
2022-07-11 20:03         ` Joao Martins
2022-07-12  9:06           ` Igor Mammedov
2022-07-12 10:01             ` Joao Martins
2022-07-12 10:21               ` Joao Martins
2022-07-12 11:35               ` Joao Martins
2022-07-14  9:28                 ` Igor Mammedov
2022-07-14  9:54                   ` Joao Martins
2022-07-14 10:47                     ` Joao Martins
2022-07-14 11:50                       ` Igor Mammedov
2022-07-14 15:39                         ` Joao Martins
2022-07-14  9:30               ` Igor Mammedov
2022-07-01 16:10 ` [PATCH v6 10/10] i386/pc: restrict AMD only enforcing of valid IOVAs to new machine type Joao Martins
2022-07-04 14:27   ` Dr. David Alan Gilbert
2022-07-05  8:48     ` Joao Martins
2022-07-11 13:03   ` Igor Mammedov
2022-07-11 14:56     ` Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.