All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFCv2 0/4] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU
@ 2022-02-07 20:24 Joao Martins
  2022-02-07 20:24 ` [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState Joao Martins
                   ` (3 more replies)
  0 siblings, 4 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-07 20:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

RFC[0] -> RFCv2:

* At Igor's suggestion in one of the patches I reworked the series enterily,
and more or less as he was thinking it is far simpler to relocate the
ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
and less intrusive. (patch 1 & 2)
* Check phys-bits is big enough prior to relocating (new patch 3)
* Remove the machine property, and it's only internal and set by new machine
version (Igor, patch 4).
* Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
* Add IOMMU SDM in the commit message (Igor, patch 2)

Note: It still makes me a tiny bit unconfortable to just remove memory from
[4G  - 1010G] range, but it's a little baseless. It's definitely a lot
better to maintain this set given its simplicity. For long term ideas proposed
here, perhaps a Igor's pc-dimm based model idea or equivalent's Alex's
suggestion of an option to control reserved address ranges could enable
adjusting the 1Tb hole to be closer to baremetal. 

The one downside of this approach is CMOS loosing its meaning of the above 4G
ram blocks, but it was mentioned over RFC that CMOS is only useful for very
old seabios. If so, either I leave it as is, or perhaps folks prefer that
I just set the ram above 4G in CMOS as 0.

[0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/

---

This series lets Qemu properly spawn i386 guests with >= 1010G with VFIO,
particularly when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This small series tries to address that by dealing with this AMD-specific 1Tb hole,
but rather than dealing like the 4G hole, it instead relocates RAM above 4G
to be above the 1T if the maximum RAM range crosses the HT reserved range.
It is organized as following:

patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
address of the 4G boundary

patch 2: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
possible address acrosses the HT region.

patch 3: Warns user if phys-bits is too low

patch 4: Ensure valid IOVAs only on new machine types, but not older
ones (<= v6.2.0)

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40
(TCG_PHYS_BITS). Today there's already a precedent to depend on the user to
pick the right value of phys-bits (regardless of this series), so we warn in
case phys-bits aren't enough.

Additionally, the reserved region is added to E820 if the relocation is done.

Alternative options considered (RFCv1):

a) Dealing with the 1T hole like the 4G hole -- which also represents what
hardware closely does.

Thanks,
	Joao

[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

Joao Martins (4):
  hw/i386: add 4g boundary start to X86MachineState
  i386/pc: relocate 4g start to 1T where applicable
  i386/pc: warn if phys-bits is too low
  i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine
    type

 hw/i386/acpi-build.c  |  2 +-
 hw/i386/pc.c          | 87 +++++++++++++++++++++++++++++++++++++++++--
 hw/i386/pc_piix.c     |  2 +
 hw/i386/pc_q35.c      |  2 +
 hw/i386/sgx.c         |  2 +-
 hw/i386/x86.c         |  1 +
 include/hw/i386/pc.h  |  1 +
 include/hw/i386/x86.h |  3 ++
 target/i386/cpu.h     |  4 ++
 9 files changed, 98 insertions(+), 6 deletions(-)

-- 
2.17.2



^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState
  2022-02-07 20:24 [PATCH RFCv2 0/4] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
@ 2022-02-07 20:24 ` Joao Martins
  2022-02-14 13:19   ` Igor Mammedov
  2022-02-07 20:24 ` [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable Joao Martins
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-07 20:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

Rather than hardcoding the 4G boundary everywhere, introduce a
X86MachineState property @above_4g_mem_start and use it
accordingly.

This is in preparation for relocating ram-above-4g to be
dynamically start at 1T on AMD platforms.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/acpi-build.c  | 2 +-
 hw/i386/pc.c          | 9 +++++----
 hw/i386/sgx.c         | 2 +-
 hw/i386/x86.c         | 1 +
 include/hw/i386/x86.h | 3 +++
 5 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index ebd47aa26fd8..4bf54ccdab91 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2063,7 +2063,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
                 build_srat_memory(table_data, mem_base, mem_len, i - 1,
                                   MEM_AFFINITY_ENABLED);
             }
-            mem_base = 1ULL << 32;
+            mem_base = x86ms->above_4g_mem_start;
             mem_len = next_base - x86ms->below_4g_mem_size;
             next_base = mem_base + mem_len;
         }
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index c8696ac01e85..7de0e87f4a3f 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -837,9 +837,10 @@ void pc_memory_init(PCMachineState *pcms,
                                  machine->ram,
                                  x86ms->below_4g_mem_size,
                                  x86ms->above_4g_mem_size);
-        memory_region_add_subregion(system_memory, 0x100000000ULL,
+        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
                                     ram_above_4g);
-        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
+        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
+                       E820_RAM);
     }
 
     if (pcms->sgx_epc.size != 0) {
@@ -880,7 +881,7 @@ void pc_memory_init(PCMachineState *pcms,
             machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
         } else {
             machine->device_memory->base =
-                0x100000000ULL + x86ms->above_4g_mem_size;
+                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
         }
 
         machine->device_memory->base =
@@ -972,7 +973,7 @@ uint64_t pc_pci_hole64_start(void)
     } else if (pcms->sgx_epc.size != 0) {
             hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
     } else {
-        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
+        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
     }
 
     return ROUND_UP(hole64_start, 1 * GiB);
diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
index a2b318dd9387..164ee1ddb8de 100644
--- a/hw/i386/sgx.c
+++ b/hw/i386/sgx.c
@@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
         return;
     }
 
-    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
+    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
 
     memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
     memory_region_add_subregion(get_system_memory(), sgx_epc->base,
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index b84840a1bb99..912e96718ee8 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1319,6 +1319,7 @@ static void x86_machine_initfn(Object *obj)
     x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
     x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
     x86ms->bus_lock_ratelimit = 0;
+    x86ms->above_4g_mem_start = 0x100000000ULL;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index a145a303703f..2de7ec046b75 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -58,6 +58,9 @@ struct X86MachineState {
     /* RAM information (sizes, addresses, configuration): */
     ram_addr_t below_4g_mem_size, above_4g_mem_size;
 
+    /* RAM information when there's a hole in 1Tb */
+    ram_addr_t above_4g_mem_start;
+
     /* CPU and apic information: */
     bool apic_xrupt_override;
     unsigned pci_irq_mask;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-07 20:24 [PATCH RFCv2 0/4] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-02-07 20:24 ` [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState Joao Martins
@ 2022-02-07 20:24 ` Joao Martins
  2022-02-14 14:53   ` Igor Mammedov
  2022-02-07 20:24 ` [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low Joao Martins
  2022-02-07 20:24 ` [PATCH RFCv2 4/4] i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine type Joao Martins
  3 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-07 20:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

0000000000000000-000000007fffffff (prio 0, i/o):
	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
0000010000000000-000001ff7fffffff (prio 0, i/o):
	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
 target/i386/cpu.h |  4 +++
 2 files changed, 70 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 7de0e87f4a3f..b060aedd38f3 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN       0x800
 #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD_0000_0000h - FF_FFFF_FFFFh
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
+ * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
+ * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
+ * FD_F910_0000h FD_F91F_FFFFh System Management
+ * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
+ * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
+ * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
+ * FD_FE00_0000h FD_FFFF_FFFFh Configuration
+ * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
+ * FE_2000_0000h FF_FFFF_FFFFh Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START         0xfd00000000UL
+#define AMD_HT_END           0xffffffffffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static void relocate_4g(MachineState *machine, PCMachineState *pcms)
+{
+    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+    X86MachineState *x86ms = X86_MACHINE(pcms);
+    ram_addr_t device_mem_size = 0;
+    uint32_t eax, vendor[3];
+
+    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
+    if (!IS_AMD_VENDOR(vendor)) {
+        return;
+    }
+
+    if (pcmc->has_reserved_memory &&
+       (machine->ram_size < machine->maxram_size)) {
+        device_mem_size = machine->maxram_size - machine->ram_size;
+    }
+
+    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
+         device_mem_size) < AMD_HT_START) {
+        return;
+    }
+
+    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+}
+
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
@@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
 
     linux_boot = (machine->kernel_filename != NULL);
 
+    relocate_4g(machine, pcms);
+
     /*
      * Split single memory region and use aliases to address portions of it,
      * done for backwards compatibility with older qemus.
@@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
                              0, x86ms->below_4g_mem_size);
     memory_region_add_subregion(system_memory, 0, ram_below_4g);
     e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
+
+    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
+        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
+    }
+
     if (x86ms->above_4g_mem_size > 0) {
         ram_above_4g = g_malloc(sizeof(*ram_above_4g));
         memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 9911d7c8711b..1acebc569b02 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
 #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
                          (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
                          (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
+#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
+                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
+                         (vendor[2]) == CPUID_VENDOR_AMD_3)
+
 
 #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
 #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-07 20:24 [PATCH RFCv2 0/4] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
  2022-02-07 20:24 ` [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState Joao Martins
  2022-02-07 20:24 ` [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-02-07 20:24 ` Joao Martins
  2022-02-14 13:15   ` David Edmondson
  2022-02-14 15:03   ` Igor Mammedov
  2022-02-07 20:24 ` [PATCH RFCv2 4/4] i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine type Joao Martins
  3 siblings, 2 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-07 20:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
to address 1Tb (0xff ffff ffff). On AMD platforms, if a
ram-above-4g relocation happens and the CPU wasn't configured
with a big enough phys-bits, warn the user. There isn't a
catastrophic failure exactly, the guest will still boot, but
most likely won't be able to use more than ~4G of RAM.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index b060aedd38f3..f8712eb8427e 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
     X86MachineState *x86ms = X86_MACHINE(pcms);
     ram_addr_t device_mem_size = 0;
     uint32_t eax, vendor[3];
+    hwaddr maxphysaddr;
 
     host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
     if (!IS_AMD_VENDOR(vendor)) {
@@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
         return;
     }
 
+    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
+    if (maxphysaddr < AMD_ABOVE_1TB_START)
+        warn_report("Relocated RAM above 4G to start at %lu "
+                    "phys-bits too low (%u)",
+                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);
+
     x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
 }
 
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH RFCv2 4/4] i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine type
  2022-02-07 20:24 [PATCH RFCv2 0/4] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
                   ` (2 preceding siblings ...)
  2022-02-07 20:24 ` [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low Joao Martins
@ 2022-02-07 20:24 ` Joao Martins
  3 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-07 20:24 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	Daniel Jordan, David Edmondson, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Joao Martins, Suravee Suthikulpanit

The added enforcing is only relevant in the case of AMD where the
range right before the 1TB is restricted and cannot be DMA mapped
by the kernel consequently leading to IOMMU INVALID_DEVICE_REQUEST
or possibly other kinds of IOMMU events in the AMD IOMMU.

Although, there's a case where it may make sense to disable the
IOVA relocation/validation when migrating from a
non-valid-IOVA-aware qemu to one that supports it.

Relocating RAM regions to after the 1Tb hole has consequences for
guest ABI because we are changing the memory mapping, so make
sure that only new machine enforce but not older ones.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 hw/i386/pc.c         | 5 +++++
 hw/i386/pc_piix.c    | 2 ++
 hw/i386/pc_q35.c     | 2 ++
 include/hw/i386/pc.h | 1 +
 4 files changed, 10 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f8712eb8427e..e62d446b28c7 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -844,6 +844,10 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
     uint32_t eax, vendor[3];
     hwaddr maxphysaddr;
 
+    if (!pcmc->enforce_valid_iova) {
+        return;
+    }
+
     host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
     if (!IS_AMD_VENDOR(vendor)) {
         return;
@@ -1787,6 +1791,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     pcmc->has_reserved_memory = true;
     pcmc->kvmclock_enabled = true;
     pcmc->enforce_aligned_dimm = true;
+    pcmc->enforce_valid_iova = true;
     /* BIOS ACPI tables: 128K. Other BIOS datastructures: less than 4K reported
      * to be used at the moment, 32K should be enough for a while.  */
     pcmc->acpi_data_size = 0x20000 + 0x8000;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index d9b344248dac..ccf8b6d9895f 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -429,9 +429,11 @@ DEFINE_I440FX_MACHINE(v7_0, "pc-i440fx-7.0", NULL,
 
 static void pc_i440fx_6_2_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_i440fx_7_0_machine_options(m);
     m->alias = NULL;
     m->is_default = false;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_6_2, hw_compat_6_2_len);
     compat_props_add(m->compat_props, pc_compat_6_2, pc_compat_6_2_len);
 }
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 1780f79bc127..1022abf4953d 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -373,8 +373,10 @@ DEFINE_Q35_MACHINE(v7_0, "pc-q35-7.0", NULL,
 
 static void pc_q35_6_2_machine_options(MachineClass *m)
 {
+    PCMachineClass *pcmc = PC_MACHINE_CLASS(m);
     pc_q35_7_0_machine_options(m);
     m->alias = NULL;
+    pcmc->enforce_valid_iova = false;
     compat_props_add(m->compat_props, hw_compat_6_2, hw_compat_6_2_len);
     compat_props_add(m->compat_props, pc_compat_6_2, pc_compat_6_2_len);
 }
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9c9f4ac74810..10dba9767861 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -117,6 +117,7 @@ struct PCMachineClass {
     bool has_reserved_memory;
     bool enforce_aligned_dimm;
     bool broken_reserved_end;
+    bool enforce_valid_iova;
 
     /* generate legacy CPU hotplug AML */
     bool legacy_cpu_hotplug;
-- 
2.17.2



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-07 20:24 ` [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low Joao Martins
@ 2022-02-14 13:15   ` David Edmondson
  2022-02-14 13:18     ` Joao Martins
  2022-02-14 15:03   ` Igor Mammedov
  1 sibling, 1 reply; 37+ messages in thread
From: David Edmondson @ 2022-02-14 13:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On Monday, 2022-02-07 at 20:24:21 GMT, Joao Martins wrote:

> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> ram-above-4g relocation happens and the CPU wasn't configured
> with a big enough phys-bits, warn the user. There isn't a
> catastrophic failure exactly, the guest will still boot, but
> most likely won't be able to use more than ~4G of RAM.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index b060aedd38f3..f8712eb8427e 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      ram_addr_t device_mem_size = 0;
>      uint32_t eax, vendor[3];
> +    hwaddr maxphysaddr;
>
>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>      if (!IS_AMD_VENDOR(vendor)) {
> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>          return;
>      }
>
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    if (maxphysaddr < AMD_ABOVE_1TB_START)

Braces around the block are required, I believe.

> +        warn_report("Relocated RAM above 4G to start at %lu "

Should use PRIu64?

> +                    "phys-bits too low (%u)",
> +                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);
> +
>      x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;

And a real nit - until above_4g_mem_start is modified, the number of
phys_bits is fine, so I would have put the warning after the assignment.

>  }

dme.
-- 
Tonight I'm gonna bury that horse in the ground.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-14 13:15   ` David Edmondson
@ 2022-02-14 13:18     ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-14 13:18 UTC (permalink / raw)
  To: David Edmondson
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, Alex Williamson, Paolo Bonzini,
	Ani Sinha, Igor Mammedov, Suravee Suthikulpanit

On 2/14/22 13:15, David Edmondson wrote:
> On Monday, 2022-02-07 at 20:24:21 GMT, Joao Martins wrote:
> 
>> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
>> ram-above-4g relocation happens and the CPU wasn't configured
>> with a big enough phys-bits, warn the user. There isn't a
>> catastrophic failure exactly, the guest will still boot, but
>> most likely won't be able to use more than ~4G of RAM.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index b060aedd38f3..f8712eb8427e 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      ram_addr_t device_mem_size = 0;
>>      uint32_t eax, vendor[3];
>> +    hwaddr maxphysaddr;
>>
>>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>      if (!IS_AMD_VENDOR(vendor)) {
>> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>>          return;
>>      }
>>
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    if (maxphysaddr < AMD_ABOVE_1TB_START)
> 
> Braces around the block are required, I believe.
> 
Hmmm, my distration sorry about that -- checkpatch.pl wasn't so keen on warn me.

>> +        warn_report("Relocated RAM above 4G to start at %lu "
> 
> Should use PRIu64?
> 
Yeap, will do.

>> +                    "phys-bits too low (%u)",
>> +                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);
>> +
>>      x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> 
> And a real nit - until above_4g_mem_start is modified, the number of
> phys_bits is fine, so I would have put the warning after the assignment.
> 
Good point. I'll reverse the order.

Thanks!
	Joao


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState
  2022-02-07 20:24 ` [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState Joao Martins
@ 2022-02-14 13:19   ` Igor Mammedov
  2022-02-14 13:21     ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-14 13:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Mon,  7 Feb 2022 20:24:19 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> Rather than hardcoding the 4G boundary everywhere, introduce a
> X86MachineState property @above_4g_mem_start and use it
> accordingly.
> 
> This is in preparation for relocating ram-above-4g to be
> dynamically start at 1T on AMD platforms.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/acpi-build.c  | 2 +-
>  hw/i386/pc.c          | 9 +++++----
>  hw/i386/sgx.c         | 2 +-
>  hw/i386/x86.c         | 1 +
>  include/hw/i386/x86.h | 3 +++
>  5 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index ebd47aa26fd8..4bf54ccdab91 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2063,7 +2063,7 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>                  build_srat_memory(table_data, mem_base, mem_len, i - 1,
>                                    MEM_AFFINITY_ENABLED);
>              }
> -            mem_base = 1ULL << 32;
> +            mem_base = x86ms->above_4g_mem_start;
>              mem_len = next_base - x86ms->below_4g_mem_size;
>              next_base = mem_base + mem_len;
>          }
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index c8696ac01e85..7de0e87f4a3f 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -837,9 +837,10 @@ void pc_memory_init(PCMachineState *pcms,
>                                   machine->ram,
>                                   x86ms->below_4g_mem_size,
>                                   x86ms->above_4g_mem_size);
> -        memory_region_add_subregion(system_memory, 0x100000000ULL,
> +        memory_region_add_subregion(system_memory, x86ms->above_4g_mem_start,
>                                      ram_above_4g);
> -        e820_add_entry(0x100000000ULL, x86ms->above_4g_mem_size, E820_RAM);
> +        e820_add_entry(x86ms->above_4g_mem_start, x86ms->above_4g_mem_size,
> +                       E820_RAM);
>      }
>  
>      if (pcms->sgx_epc.size != 0) {
> @@ -880,7 +881,7 @@ void pc_memory_init(PCMachineState *pcms,
>              machine->device_memory->base = sgx_epc_above_4g_end(&pcms->sgx_epc);
>          } else {
>              machine->device_memory->base =
> -                0x100000000ULL + x86ms->above_4g_mem_size;
> +                x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>          }
>  
>          machine->device_memory->base =
> @@ -972,7 +973,7 @@ uint64_t pc_pci_hole64_start(void)
>      } else if (pcms->sgx_epc.size != 0) {
>              hole64_start = sgx_epc_above_4g_end(&pcms->sgx_epc);
>      } else {
> -        hole64_start = 0x100000000ULL + x86ms->above_4g_mem_size;
> +        hole64_start = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>      }
>  
>      return ROUND_UP(hole64_start, 1 * GiB);
> diff --git a/hw/i386/sgx.c b/hw/i386/sgx.c
> index a2b318dd9387..164ee1ddb8de 100644
> --- a/hw/i386/sgx.c
> +++ b/hw/i386/sgx.c
> @@ -295,7 +295,7 @@ void pc_machine_init_sgx_epc(PCMachineState *pcms)
>          return;
>      }
>  
> -    sgx_epc->base = 0x100000000ULL + x86ms->above_4g_mem_size;
> +    sgx_epc->base = x86ms->above_4g_mem_start + x86ms->above_4g_mem_size;
>  
>      memory_region_init(&sgx_epc->mr, OBJECT(pcms), "sgx-epc", UINT64_MAX);
>      memory_region_add_subregion(get_system_memory(), sgx_epc->base,
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index b84840a1bb99..912e96718ee8 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -1319,6 +1319,7 @@ static void x86_machine_initfn(Object *obj)
>      x86ms->oem_id = g_strndup(ACPI_BUILD_APPNAME6, 6);
>      x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
>      x86ms->bus_lock_ratelimit = 0;
> +    x86ms->above_4g_mem_start = 0x100000000ULL;
>  }
>  
>  static void x86_machine_class_init(ObjectClass *oc, void *data)
> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
> index a145a303703f..2de7ec046b75 100644
> --- a/include/hw/i386/x86.h
> +++ b/include/hw/i386/x86.h
> @@ -58,6 +58,9 @@ struct X86MachineState {
>      /* RAM information (sizes, addresses, configuration): */
>      ram_addr_t below_4g_mem_size, above_4g_mem_size;
>  
> +    /* RAM information when there's a hole in 1Tb */

s/^^^/GPA of the part of initial RAM above 4G/

or something like that, it doesn't have anything to do with hole at 1Tb
on some hosts.

> +    ram_addr_t above_4g_mem_start;
> +
>      /* CPU and apic information: */
>      bool apic_xrupt_override;
>      unsigned pci_irq_mask;



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState
  2022-02-14 13:19   ` Igor Mammedov
@ 2022-02-14 13:21     ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-14 13:21 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/14/22 13:19, Igor Mammedov wrote:
> On Mon,  7 Feb 2022 20:24:19 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>>  static void x86_machine_class_init(ObjectClass *oc, void *data)
>> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
>> index a145a303703f..2de7ec046b75 100644
>> --- a/include/hw/i386/x86.h
>> +++ b/include/hw/i386/x86.h
>> @@ -58,6 +58,9 @@ struct X86MachineState {
>>      /* RAM information (sizes, addresses, configuration): */
>>      ram_addr_t below_4g_mem_size, above_4g_mem_size;
>>  
>> +    /* RAM information when there's a hole in 1Tb */
> 
> s/^^^/GPA of the part of initial RAM above 4G/
> 
> or something like that, it doesn't have anything to do with hole at 1Tb
> on some hosts.
> 
Yeap, will do.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-07 20:24 ` [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable Joao Martins
@ 2022-02-14 14:53   ` Igor Mammedov
  2022-02-14 15:05     ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-14 14:53 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Mon,  7 Feb 2022 20:24:20 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> 0000000000000000-000000007fffffff (prio 0, i/o):
> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
>  target/i386/cpu.h |  4 +++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 7de0e87f4a3f..b060aedd38f3 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN       0x800
>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD_0000_0000h - FF_FFFF_FFFFh
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> + * FD_F910_0000h FD_F91F_FFFFh System Management
> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START         0xfd00000000UL
> +#define AMD_HT_END           0xffffffffffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static void relocate_4g(MachineState *machine, PCMachineState *pcms)

perhaps rename it to x86_update_above_4g_mem_start() ?


> +{
> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> +    ram_addr_t device_mem_size = 0;
> +    uint32_t eax, vendor[3];
> +
> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +    if (!IS_AMD_VENDOR(vendor)) {
> +        return;
> +    }
> +
> +    if (pcmc->has_reserved_memory &&
> +       (machine->ram_size < machine->maxram_size)) {
> +        device_mem_size = machine->maxram_size - machine->ram_size;
> +    }
> +
> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> +         device_mem_size) < AMD_HT_START) {
should it account for sgx as well?

what if above sum ends up right before AMD_HT_START,
and exit without adjusting above_4g_mem_start, but
pci64 hole eventually will fall into HT range?
Is it expected behaviour?

> +        return;
> +    }
> +
> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> +}
> +
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
>  
>      linux_boot = (machine->kernel_filename != NULL);
>  
> +    relocate_4g(machine, pcms);
> +
>      /*
>       * Split single memory region and use aliases to address portions of it,
>       * done for backwards compatibility with older qemus.
> @@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
>                               0, x86ms->below_4g_mem_size);
>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> +
> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> +    }

btw: do we have to add reservation record for HT zone, why?

>      if (x86ms->above_4g_mem_size > 0) {
>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 9911d7c8711b..1acebc569b02 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> +
>  
>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-07 20:24 ` [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low Joao Martins
  2022-02-14 13:15   ` David Edmondson
@ 2022-02-14 15:03   ` Igor Mammedov
  2022-02-14 15:18     ` Joao Martins
  1 sibling, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-14 15:03 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Mon,  7 Feb 2022 20:24:21 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> ram-above-4g relocation happens and the CPU wasn't configured
> with a big enough phys-bits, warn the user. There isn't a
> catastrophic failure exactly, the guest will still boot, but
> most likely won't be able to use more than ~4G of RAM.

how 'unable to use" would manifest?
It might be better to prevent QEMU startup with broken setup (CLI)
rather then letting guest run and trying to figure out what's
going wrong when users start to complain. 

> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  hw/i386/pc.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index b060aedd38f3..f8712eb8427e 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>      X86MachineState *x86ms = X86_MACHINE(pcms);
>      ram_addr_t device_mem_size = 0;
>      uint32_t eax, vendor[3];
> +    hwaddr maxphysaddr;
>  
>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>      if (!IS_AMD_VENDOR(vendor)) {
> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>          return;
>      }
>  
> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> +    if (maxphysaddr < AMD_ABOVE_1TB_START)
> +        warn_report("Relocated RAM above 4G to start at %lu "
> +                    "phys-bits too low (%u)",
> +                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);

perhaps this hunk belongs to the end of pc_memory_init(),
it's not HT fixup specific at all?

Also I'm not sure but there are host_phys_bits/host_phys_bits_limit properties,
perhaps they need to be checked/verified as well

>      x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>  }
>  



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-14 14:53   ` Igor Mammedov
@ 2022-02-14 15:05     ` Joao Martins
  2022-02-14 15:31       ` Igor Mammedov
  0 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-14 15:05 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini



On 2/14/22 14:53, Igor Mammedov wrote:
> On Mon,  7 Feb 2022 20:24:20 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
>> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
>> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
>>
>> [*] there is another reserved region unrelated to HT that exists
>> in the 256T boundaru in Fam 17h according to Errata #1286,
>> documeted also in "Open-Source Register Reference for AMD Family
>> 17h Processors (PUB)"
>>
>> When creating the region above 4G, take into account that on AMD
>> platforms the HyperTransport range is reserved and hence it
>> cannot be used either as GPAs. On those cases rather than
>> establishing the start of ram-above-4g to be 4G, relocate instead
>> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
>> Topology", for more information on the underlying restriction of
>> IOVAs.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> 0000000000000000-000000007fffffff (prio 0, i/o):
>> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
>> 0000010000000000-000001ff7fffffff (prio 0, i/o):
>> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
>>
>> If the relocation is done, we also add the the reserved HT
>> e820 range as reserved.
>>
>> Suggested-by: Igor Mammedov <imammedo@redhat.com>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
>>  target/i386/cpu.h |  4 +++
>>  2 files changed, 70 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index 7de0e87f4a3f..b060aedd38f3 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
>>  #define PC_ROM_ALIGN       0x800
>>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
>>  
>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD_0000_0000h - FF_FFFF_FFFFh
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
>> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
>> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
>> + * FD_F910_0000h FD_F91F_FFFFh System Management
>> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
>> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
>> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
>> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
>> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
>> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START         0xfd00000000UL
>> +#define AMD_HT_END           0xffffffffffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static void relocate_4g(MachineState *machine, PCMachineState *pcms)
> 
> perhaps rename it to x86_update_above_4g_mem_start() ?
> 
Yeap.

>> +{
>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>> +    ram_addr_t device_mem_size = 0;
>> +    uint32_t eax, vendor[3];
>> +
>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> +    if (!IS_AMD_VENDOR(vendor)) {
>> +        return;
>> +    }
>> +
>> +    if (pcmc->has_reserved_memory &&
>> +       (machine->ram_size < machine->maxram_size)) {
>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>> +    }
>> +
>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>> +         device_mem_size) < AMD_HT_START) {
> should it account for sgx as well?
> 
Yes, I missed that one.

> what if above sum ends up right before AMD_HT_START,
> and exit without adjusting above_4g_mem_start, but
> pci64 hole eventually will fall into HT range?
> Is it expected behaviour?
> 
No -- it should not be any reserved range really.

And I was at two minds on this one, whether to advertise *always*
the 1T hole, regardless of relocation. Or account the size
we advertise for the pci64 hole and make that part of the equation
above. Although that has the flaw that the firmware at admin request
may pick some ludricous number (limited by maxphysaddr).

>> +        return;
>> +    }
>> +
>> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
>> +}
>> +
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
>>  
>>      linux_boot = (machine->kernel_filename != NULL);
>>  
>> +    relocate_4g(machine, pcms);
>> +
>>      /*
>>       * Split single memory region and use aliases to address portions of it,
>>       * done for backwards compatibility with older qemus.
>> @@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
>>                               0, x86ms->below_4g_mem_size);
>>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
>>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
>> +
>> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
>> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
>> +    }
> 
> btw: do we have to add reservation record for HT zone, why?
> 

This is what real hardware does fwiw :D

I understand that we you do the relocation, firmware /should/
pick the first address after RAM as start of pci64 hole, and hence
past the HT range.

But if felt that for correctness we would tell the guest that this
range cannot be used regardless. I take that perhaps you're thinking
that you omit the E820_RESERVED just like the 4G hole?

>>      if (x86ms->above_4g_mem_size > 0) {
>>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
>>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index 9911d7c8711b..1acebc569b02 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
>>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
>>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
>>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
>> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
>> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
>> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
>> +
>>  
>>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
>>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-14 15:03   ` Igor Mammedov
@ 2022-02-14 15:18     ` Joao Martins
  2022-02-14 15:41       ` Igor Mammedov
  2022-02-23 17:18       ` Joao Martins
  0 siblings, 2 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-14 15:18 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/14/22 15:03, Igor Mammedov wrote:
> On Mon,  7 Feb 2022 20:24:21 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
>> ram-above-4g relocation happens and the CPU wasn't configured
>> with a big enough phys-bits, warn the user. There isn't a
>> catastrophic failure exactly, the guest will still boot, but
>> most likely won't be able to use more than ~4G of RAM.
> 
> how 'unable to use" would manifest?
> It might be better to prevent QEMU startup with broken setup (CLI)
> rather then letting guest run and trying to figure out what's
> going wrong when users start to complain. 
> 
Sounds better to be conservative here.

I will change from warn_report() to error_report()
and exit.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  hw/i386/pc.c | 7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index b060aedd38f3..f8712eb8427e 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>      ram_addr_t device_mem_size = 0;
>>      uint32_t eax, vendor[3];
>> +    hwaddr maxphysaddr;
>>  
>>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>      if (!IS_AMD_VENDOR(vendor)) {
>> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>>          return;
>>      }
>>  
>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>> +    if (maxphysaddr < AMD_ABOVE_1TB_START)
>> +        warn_report("Relocated RAM above 4G to start at %lu "
>> +                    "phys-bits too low (%u)",
>> +                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);
> 
> perhaps this hunk belongs to the end of pc_memory_init(),
> it's not HT fixup specific at all?
> 
It is HT fixup related. Because we are relocating the whole above-4g-ram,
on what used to be enough with just 40 phys bits (default).

> Also I'm not sure but there are host_phys_bits/host_phys_bits_limit properties,
> perhaps they need to be checked/verified as well

When booted with +host-phys-bits and/or with a host-phys-bits-limit=X, the @phys_bits
value will be either set to host, and ultimately bound to a maximum of
host_phys_bits_limit (if at all set).

So essentially the selected phys_bits that we're checking above is the only thing
we need to care about IIUC.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-14 15:05     ` Joao Martins
@ 2022-02-14 15:31       ` Igor Mammedov
  2022-02-15  9:53         ` Gerd Hoffmann
  2022-02-18 17:12         ` Joao Martins
  0 siblings, 2 replies; 37+ messages in thread
From: Igor Mammedov @ 2022-02-14 15:31 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Mon, 14 Feb 2022 15:05:00 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/14/22 14:53, Igor Mammedov wrote:
> > On Mon,  7 Feb 2022 20:24:20 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> It is assumed that the whole GPA space is available to be DMA
> >> addressable, within a given address space limit, expect for a
> >> tiny region before the 4G. Since Linux v5.4, VFIO validates
> >> whether the selected GPA is indeed valid i.e. not reserved by
> >> IOMMU on behalf of some specific devices or platform-defined
> >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
> >>  -EINVAL.
> >>
> >> AMD systems with an IOMMU are examples of such platforms and
> >> particularly may only have these ranges as allowed:
> >>
> >> 	0000000000000000 - 00000000fedfffff (0      .. 3.982G)
> >> 	00000000fef00000 - 000000fcffffffff (3.983G .. 1011.9G)
> >> 	0000010000000000 - ffffffffffffffff (1Tb    .. 16Pb[*])
> >>
> >> We already account for the 4G hole, albeit if the guest is big
> >> enough we will fail to allocate a guest with  >1010G due to the
> >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> >>
> >> [*] there is another reserved region unrelated to HT that exists
> >> in the 256T boundaru in Fam 17h according to Errata #1286,
> >> documeted also in "Open-Source Register Reference for AMD Family
> >> 17h Processors (PUB)"
> >>
> >> When creating the region above 4G, take into account that on AMD
> >> platforms the HyperTransport range is reserved and hence it
> >> cannot be used either as GPAs. On those cases rather than
> >> establishing the start of ram-above-4g to be 4G, relocate instead
> >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> >> Topology", for more information on the underlying restriction of
> >> IOVAs.
> >>
> >> After accounting for the 1Tb hole on AMD hosts, mtree should
> >> look like:
> >>
> >> 0000000000000000-000000007fffffff (prio 0, i/o):
> >> 	 alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff
> >> 0000010000000000-000001ff7fffffff (prio 0, i/o):
> >> 	alias ram-above-4g @pc.ram 0000000080000000-000000ffffffffff
> >>
> >> If the relocation is done, we also add the the reserved HT
> >> e820 range as reserved.
> >>
> >> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  target/i386/cpu.h |  4 +++
> >>  2 files changed, 70 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index 7de0e87f4a3f..b060aedd38f3 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms)
> >>  #define PC_ROM_ALIGN       0x800
> >>  #define PC_ROM_SIZE        (PC_ROM_MAX - PC_ROM_MIN_VGA)
> >>  
> >> +/*
> >> + * AMD systems with an IOMMU have an additional hole close to the
> >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> >> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
> >> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> >> + * The ranges reserved for Hyper-Transport are:
> >> + *
> >> + * FD_0000_0000h - FF_FFFF_FFFFh
> >> + *
> >> + * The ranges represent the following:
> >> + *
> >> + * Base Address   Top Address  Use
> >> + *
> >> + * FD_0000_0000h FD_F7FF_FFFFh Reserved interrupt address space
> >> + * FD_F800_0000h FD_F8FF_FFFFh Interrupt/EOI IntCtl
> >> + * FD_F900_0000h FD_F90F_FFFFh Legacy PIC IACK
> >> + * FD_F910_0000h FD_F91F_FFFFh System Management
> >> + * FD_F920_0000h FD_FAFF_FFFFh Reserved Page Tables
> >> + * FD_FB00_0000h FD_FBFF_FFFFh Address Translation
> >> + * FD_FC00_0000h FD_FDFF_FFFFh I/O Space
> >> + * FD_FE00_0000h FD_FFFF_FFFFh Configuration
> >> + * FE_0000_0000h FE_1FFF_FFFFh Extended Configuration/Device Messages
> >> + * FE_2000_0000h FF_FFFF_FFFFh Reserved
> >> + *
> >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> >> + * Table 3: Special Address Controls (GPA) for more information.
> >> + */
> >> +#define AMD_HT_START         0xfd00000000UL
> >> +#define AMD_HT_END           0xffffffffffUL
> >> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> >> +#define AMD_HT_SIZE          (AMD_ABOVE_1TB_START - AMD_HT_START)
> >> +
> >> +static void relocate_4g(MachineState *machine, PCMachineState *pcms)  
> > 
> > perhaps rename it to x86_update_above_4g_mem_start() ?
> >   
> Yeap.
> 
> >> +{
> >> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >> +    ram_addr_t device_mem_size = 0;
> >> +    uint32_t eax, vendor[3];
> >> +
> >> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >> +    if (!IS_AMD_VENDOR(vendor)) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (pcmc->has_reserved_memory &&
> >> +       (machine->ram_size < machine->maxram_size)) {
> >> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >> +    }
> >> +
> >> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> >> +         device_mem_size) < AMD_HT_START) {  
> > should it account for sgx as well?
> >   
> Yes, I missed that one.
> 
> > what if above sum ends up right before AMD_HT_START,
> > and exit without adjusting above_4g_mem_start, but
> > pci64 hole eventually will fall into HT range?
> > Is it expected behaviour?
> >   
> No -- it should not be any reserved range really.
> 
> And I was at two minds on this one, whether to advertise *always*
> the 1T hole, regardless of relocation. Or account the size
> we advertise for the pci64 hole and make that part of the equation
> above. Although that has the flaw that the firmware at admin request
> may pick some ludricous number (limited by maxphysaddr).

it this point we have only pci64 hole size (machine property),
so I'd include that in equation to make firmware assign
pci64 aperture above HT range.

as for checking maxphysaddr, we can only check 'default' PCI hole
range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
and hard error on it.

I don't know what behavior should be if firmware tries to program
PCI64 hole beyond supported phys-bits.

Michael, Gerd
  perhaps you can suggest something here

 
> >> +        return;
> >> +    }
> >> +
> >> +    x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
> >> +}
> >> +
> >>  void pc_memory_init(PCMachineState *pcms,
> >>                      MemoryRegion *system_memory,
> >>                      MemoryRegion *rom_memory,
> >> @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms,
> >>  
> >>      linux_boot = (machine->kernel_filename != NULL);
> >>  
> >> +    relocate_4g(machine, pcms);
> >> +
> >>      /*
> >>       * Split single memory region and use aliases to address portions of it,
> >>       * done for backwards compatibility with older qemus.
> >> @@ -831,6 +892,11 @@ void pc_memory_init(PCMachineState *pcms,
> >>                               0, x86ms->below_4g_mem_size);
> >>      memory_region_add_subregion(system_memory, 0, ram_below_4g);
> >>      e820_add_entry(0, x86ms->below_4g_mem_size, E820_RAM);
> >> +
> >> +    if (x86ms->above_4g_mem_start == AMD_ABOVE_1TB_START) {
> >> +        e820_add_entry(AMD_HT_START, AMD_HT_SIZE, E820_RESERVED);
> >> +    }  
> > 
> > btw: do we have to add reservation record for HT zone, why?
> >   
> 
> This is what real hardware does fwiw :D
> 
> I understand that we you do the relocation, firmware /should/
> pick the first address after RAM as start of pci64 hole, and hence
> past the HT range.
> 
> But if felt that for correctness we would tell the guest that this
> range cannot be used regardless. I take that perhaps you're thinking
> that you omit the E820_RESERVED just like the 4G hole?

It's fine to advertise region as reserved, for completeness wit real HW
and in case if we ever allow guest OS to reconfigure PCI64
hole aperture it should prevent guest picking reserved range
(theoretically, whether it actually would work or not I'm not sure)

> 
> >>      if (x86ms->above_4g_mem_size > 0) {
> >>          ram_above_4g = g_malloc(sizeof(*ram_above_4g));
> >>          memory_region_init_alias(ram_above_4g, NULL, "ram-above-4g",
> >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> >> index 9911d7c8711b..1acebc569b02 100644
> >> --- a/target/i386/cpu.h
> >> +++ b/target/i386/cpu.h
> >> @@ -906,6 +906,10 @@ typedef uint64_t FeatureWordArray[FEATURE_WORDS];
> >>  #define IS_AMD_CPU(env) ((env)->cpuid_vendor1 == CPUID_VENDOR_AMD_1 && \
> >>                           (env)->cpuid_vendor2 == CPUID_VENDOR_AMD_2 && \
> >>                           (env)->cpuid_vendor3 == CPUID_VENDOR_AMD_3)
> >> +#define IS_AMD_VENDOR(vendor) ((vendor[0]) == CPUID_VENDOR_AMD_1 && \
> >> +                         (vendor[1]) == CPUID_VENDOR_AMD_2 && \
> >> +                         (vendor[2]) == CPUID_VENDOR_AMD_3)
> >> +
> >>  
> >>  #define CPUID_MWAIT_IBE     (1U << 1) /* Interrupts can exit capability */
> >>  #define CPUID_MWAIT_EMX     (1U << 0) /* enumeration supported */  
> >   
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-14 15:18     ` Joao Martins
@ 2022-02-14 15:41       ` Igor Mammedov
  2022-02-14 15:48         ` Joao Martins
  2022-02-23 17:18       ` Joao Martins
  1 sibling, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-14 15:41 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Mon, 14 Feb 2022 15:18:43 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/14/22 15:03, Igor Mammedov wrote:
> > On Mon,  7 Feb 2022 20:24:21 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:
> >   
> >> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
> >> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> >> ram-above-4g relocation happens and the CPU wasn't configured
> >> with a big enough phys-bits, warn the user. There isn't a
> >> catastrophic failure exactly, the guest will still boot, but
> >> most likely won't be able to use more than ~4G of RAM.  
> > 
> > how 'unable to use" would manifest?
> > It might be better to prevent QEMU startup with broken setup (CLI)
> > rather then letting guest run and trying to figure out what's
> > going wrong when users start to complain. 
> >   
> Sounds better to be conservative here.
> 
> I will change from warn_report() to error_report()
> and exit.
> 
> >>
> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> >> ---
> >>  hw/i386/pc.c | 7 +++++++
> >>  1 file changed, 7 insertions(+)
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index b060aedd38f3..f8712eb8427e 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
> >>      X86MachineState *x86ms = X86_MACHINE(pcms);
> >>      ram_addr_t device_mem_size = 0;
> >>      uint32_t eax, vendor[3];
> >> +    hwaddr maxphysaddr;
> >>  
> >>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>      if (!IS_AMD_VENDOR(vendor)) {
> >> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
> >>          return;
> >>      }
> >>  
> >> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
> >> +    if (maxphysaddr < AMD_ABOVE_1TB_START)
> >> +        warn_report("Relocated RAM above 4G to start at %lu "
> >> +                    "phys-bits too low (%u)",
> >> +                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);  
> > 
> > perhaps this hunk belongs to the end of pc_memory_init(),
> > it's not HT fixup specific at all?
> >   
> It is HT fixup related. Because we are relocating the whole above-4g-ram,
> on what used to be enough with just 40 phys bits (default).

what if user starts QEMU with amount of RAM that won't fit into
configured phys bits (whether it's default one or one that comes from host)
and not on AMD host at that so no relocation happens but we still have
broken configuration. What matters here is the end address that might
be used by guest (pci64_bit hole end) is reachable.

That's why I suggested to make it generic check instead of AMD specific one. 
 
> > Also I'm not sure but there are host_phys_bits/host_phys_bits_limit properties,
> > perhaps they need to be checked/verified as well  
> 
> When booted with +host-phys-bits and/or with a host-phys-bits-limit=X, the @phys_bits
> value will be either set to host, and ultimately bound to a maximum of
> host_phys_bits_limit (if at all set).
> 
> So essentially the selected phys_bits that we're checking above is the only thing
> we need to care about IIUC.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-14 15:41       ` Igor Mammedov
@ 2022-02-14 15:48         ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-14 15:48 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/14/22 15:41, Igor Mammedov wrote:
> On Mon, 14 Feb 2022 15:18:43 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/14/22 15:03, Igor Mammedov wrote:
>>> On Mon,  7 Feb 2022 20:24:21 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>> index b060aedd38f3..f8712eb8427e 100644
>>>> --- a/hw/i386/pc.c
>>>> +++ b/hw/i386/pc.c
>>>> @@ -842,6 +842,7 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>>>>      X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>      ram_addr_t device_mem_size = 0;
>>>>      uint32_t eax, vendor[3];
>>>> +    hwaddr maxphysaddr;
>>>>  
>>>>      host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>      if (!IS_AMD_VENDOR(vendor)) {
>>>> @@ -858,6 +859,12 @@ static void relocate_4g(MachineState *machine, PCMachineState *pcms)
>>>>          return;
>>>>      }
>>>>  
>>>> +    maxphysaddr = ((hwaddr)1 << X86_CPU(first_cpu)->phys_bits) - 1;
>>>> +    if (maxphysaddr < AMD_ABOVE_1TB_START)
>>>> +        warn_report("Relocated RAM above 4G to start at %lu "
>>>> +                    "phys-bits too low (%u)",
>>>> +                    AMD_ABOVE_1TB_START, X86_CPU(first_cpu)->phys_bits);  
>>>
>>> perhaps this hunk belongs to the end of pc_memory_init(),
>>> it's not HT fixup specific at all?
>>>   
>> It is HT fixup related. Because we are relocating the whole above-4g-ram,
>> on what used to be enough with just 40 phys bits (default).
> 
> what if user starts QEMU with amount of RAM that won't fit into
> configured phys bits (whether it's default one or one that comes from host)
> and not on AMD host at that so no relocation happens but we still have
> broken configuration. What matters here is the end address that might
> be used by guest (pci64_bit hole end) is reachable.
> 
> That's why I suggested to make it generic check instead of AMD specific one. 
>  
OK, I see.

If I'm being dense, that would require replacing AMD_ABOVE_1TB_START in the
comparison to something that computes the max used addr -- Let me try that then.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-14 15:31       ` Igor Mammedov
@ 2022-02-15  9:53         ` Gerd Hoffmann
  2022-02-15 19:37           ` Joao Martins
  2022-02-16  9:51           ` Daniel P. Berrangé
  2022-02-18 17:12         ` Joao Martins
  1 sibling, 2 replies; 37+ messages in thread
From: Gerd Hoffmann @ 2022-02-15  9:53 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini, Joao Martins

  Hi,

> I don't know what behavior should be if firmware tries to program
> PCI64 hole beyond supported phys-bits.

Well, you are basically f*cked.

Unfortunately there is no reliable way to figure what phys-bits actually
is.  Because of that the firmware (both seabios and edk2) tries to place
the pci64 hole as low as possible.

The long version:

qemu advertises phys-bits=40 to the guest by default.  Probably because
this is what the first amd opteron processors had, assuming that it
would be a safe default.  Then intel came, releasing processors with
phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
Boom.

End result is that edk2 uses a 32G pci64 window by default, which is
placed at the first 32G border beyond normal ram.  So for virtual
machines with up to ~ 30G ram (including reservations for memory
hotplug) the pci64 hole covers 32G -> 64G in guest physical address
space, which is low enough that it works on hardware with phys-bits=36.

If your VM has more than 32G of memory the pci64 hole will move and
phys-bits=36 isn't enough any more, but given that you probably only do
that on more beefy hosts which can take >= 64G of RAM and have a larger
physical address space this heuristic works good enough in practice.

Changing phys-bits behavior has been discussed on and off since years.
It's tricky to change for live migration compatibility reasons.

We got the host-phys-bits and host-phys-bits-limit properties, which
solve some of the phys-bits problems.

 * host-phys-bits=on makes sure the phys-bits advertised to the guest
   actually works.  It's off by default though for backward
   compatibility reasons (except microvm).  Also because turning it on
   breaks live migration of machines between hosts with different
   phys-bits.

 * host-phys-bits-limit can be used to tweak phys-bits to
   be lower than what the host supports.  Which can be used for
   live migration compatibility, i.e. if you have a pool of machines
   where some have 36 and some 39 you can limit phys-bits to 36 so
   live migration from 39 hosts to 36 hosts works.

What is missing:

 * Some way for the firmware to get a phys-bits value it can actually
   use.  One possible way would be to have a paravirtual bit somewhere
   telling whenever host-phys-bits is enabled or not.

If edk2 could figure what the usable (guest) physical address space
actually is it could:

  (a) make sure it never crosses that limit, and
  (b) pick better defaults, for example make the pci64 hole larger
      than 32G in case the available address space allows that.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-15  9:53         ` Gerd Hoffmann
@ 2022-02-15 19:37           ` Joao Martins
  2022-02-16  8:19             ` Gerd Hoffmann
  2022-02-16  9:51           ` Daniel P. Berrangé
  1 sibling, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-15 19:37 UTC (permalink / raw)
  To: Gerd Hoffmann, Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/15/22 09:53, Gerd Hoffmann wrote:
> What is missing:
> 
>  * Some way for the firmware to get a phys-bits value it can actually
>    use.  One possible way would be to have a paravirtual bit somewhere
>    telling whenever host-phys-bits is enabled or not.
> 
If we are not talking about *very old* processors... isn't what already
advertised in CPUID.80000008 EAX enough? That's the maxphysaddr width
on x86, which on qemu we do set it with the phys-bits value;

	Joao


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-15 19:37           ` Joao Martins
@ 2022-02-16  8:19             ` Gerd Hoffmann
  2022-02-16 11:54               ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Gerd Hoffmann @ 2022-02-16  8:19 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Paolo Bonzini, Ani Sinha, Igor Mammedov

On Tue, Feb 15, 2022 at 07:37:40PM +0000, Joao Martins wrote:
> On 2/15/22 09:53, Gerd Hoffmann wrote:
> > What is missing:
> > 
> >  * Some way for the firmware to get a phys-bits value it can actually
> >    use.  One possible way would be to have a paravirtual bit somewhere
> >    telling whenever host-phys-bits is enabled or not.
> > 
> If we are not talking about *very old* processors... isn't what already
> advertised in CPUID.80000008 EAX enough? That's the maxphysaddr width
> on x86, which on qemu we do set it with the phys-bits value;

Sigh.  Nope.  Did you read the complete reply?

Problem is this is not reliable.  With host-phys-bits=off (default) qemu
allows to set phys-bits to whatever value you want, including values
larger than what the host actually supports.  Which renders
CPUID.80000008.EAX unusable.  To make things even worse:  The default
value (phys-bits=40) is larger than what many intel boxes support.

host-phys-bits=on fixes that.  It makes guest-phys-bits == host-phys-bits
by default, and also enforces guest-phys-bits <= host-phys-bits.
So with host-phys-bits=on the guest can actually use CPUID.80000008.EAX
to figure how big the guest physical address space is.

Problem is the guest doesn't know whenever host-phys-bits is enabled or
not ...

So the options to fix that are:

  (1) Make the host-phys-bits option visible to the guest (as suggested
      above), or
  (2) Advertise a _reliable_ phys-bits value to the guest somehow.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-15  9:53         ` Gerd Hoffmann
  2022-02-15 19:37           ` Joao Martins
@ 2022-02-16  9:51           ` Daniel P. Berrangé
  2022-02-21 13:15             ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 37+ messages in thread
From: Daniel P. Berrangé @ 2022-02-16  9:51 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Paolo Bonzini, Ani Sinha, Igor Mammedov,
	Joao Martins

On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:
>   Hi,
> 
> > I don't know what behavior should be if firmware tries to program
> > PCI64 hole beyond supported phys-bits.
> 
> Well, you are basically f*cked.
> 
> Unfortunately there is no reliable way to figure what phys-bits actually
> is.  Because of that the firmware (both seabios and edk2) tries to place
> the pci64 hole as low as possible.
> 
> The long version:
> 
> qemu advertises phys-bits=40 to the guest by default.  Probably because
> this is what the first amd opteron processors had, assuming that it
> would be a safe default.  Then intel came, releasing processors with
> phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> Boom.
> 
> End result is that edk2 uses a 32G pci64 window by default, which is
> placed at the first 32G border beyond normal ram.  So for virtual
> machines with up to ~ 30G ram (including reservations for memory
> hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> space, which is low enough that it works on hardware with phys-bits=36.
> 
> If your VM has more than 32G of memory the pci64 hole will move and
> phys-bits=36 isn't enough any more, but given that you probably only do
> that on more beefy hosts which can take >= 64G of RAM and have a larger
> physical address space this heuristic works good enough in practice.
> 
> Changing phys-bits behavior has been discussed on and off since years.
> It's tricky to change for live migration compatibility reasons.
> 
> We got the host-phys-bits and host-phys-bits-limit properties, which
> solve some of the phys-bits problems.
> 
>  * host-phys-bits=on makes sure the phys-bits advertised to the guest
>    actually works.  It's off by default though for backward
>    compatibility reasons (except microvm).  Also because turning it on
>    breaks live migration of machines between hosts with different
>    phys-bits.

RHEL has shipped with host-phys-bits=on in its machine types
sinec RHEL-7. If it is good enough for RHEL machine types
for 8 years, IMHO, it is a sign that its reasonable to do the
same with upstream for new machine types.


>  * host-phys-bits-limit can be used to tweak phys-bits to
>    be lower than what the host supports.  Which can be used for
>    live migration compatibility, i.e. if you have a pool of machines
>    where some have 36 and some 39 you can limit phys-bits to 36 so
>    live migration from 39 hosts to 36 hosts works.

RHEL machine types have set this to host-phys-bits-limit=48
since RHEL-8 days, to avoid accidentally enabling 5-level
paging in guests without explicit user opt-in.

> What is missing:
> 
>  * Some way for the firmware to get a phys-bits value it can actually
>    use.  One possible way would be to have a paravirtual bit somewhere
>    telling whenever host-phys-bits is enabled or not.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-16  8:19             ` Gerd Hoffmann
@ 2022-02-16 11:54               ` Joao Martins
  2022-02-16 12:32                 ` Gerd Hoffmann
  0 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-16 11:54 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Paolo Bonzini, Ani Sinha, Igor Mammedov

On 2/16/22 08:19, Gerd Hoffmann wrote:
> On Tue, Feb 15, 2022 at 07:37:40PM +0000, Joao Martins wrote:
>> On 2/15/22 09:53, Gerd Hoffmann wrote:
>>> What is missing:
>>>
>>>  * Some way for the firmware to get a phys-bits value it can actually
>>>    use.  One possible way would be to have a paravirtual bit somewhere
>>>    telling whenever host-phys-bits is enabled or not.
>>>
>> If we are not talking about *very old* processors... isn't what already
>> advertised in CPUID.80000008 EAX enough? That's the maxphysaddr width
>> on x86, which on qemu we do set it with the phys-bits value;
> 
> Sigh.  Nope.  Did you read the complete reply?
> 
Yes, I did.

What I overlooked was the emphasis you had on desktops (qemu default bigger than
host-advertised), where I am thinking mostly in servers.

> Problem is this is not reliable.  With host-phys-bits=off (default) qemu
> allows to set phys-bits to whatever value you want, including values
> larger than what the host actually supports.  Which renders
> CPUID.80000008.EAX unusable. 

I am seeing from another angle, which the way to convey the phys-bits is
via this CPUID leaf is *maybe* enough (like real hardware). But we are setting
with a bigger value than we should have (or in other words ... not honoring
our physical boundary).

> To make things even worse:  The default
> value (phys-bits=40) is larger than what many intel boxes support.
> 
> host-phys-bits=on fixes that.  It makes guest-phys-bits == host-phys-bits
> by default, and also enforces guest-phys-bits <= host-phys-bits.
> So with host-phys-bits=on the guest can actually use CPUID.80000008.EAX
> to figure how big the guest physical address space is.
> 
Your 2 paragraphs sound like it's two different things, but +host-phys-bits
just sets CPUID.80000008.EAX with the host CPUID equivalent leaf/register
value. Which yes, it makes it reliable, but the way to convey is still the
same. That is regardless, of phys-bits=bogus-bigger-than-host-number,
host-phys-bits=on or host-phys-bits-limit=N.

> Problem is the guest doesn't know whenever host-phys-bits is enabled or
> not ...
> 
> So the options to fix that are:
> 
>   (1) Make the host-phys-bits option visible to the guest (as suggested
>       above), or
>   (2) Advertise a _reliable_ phys-bits value to the guest somehow.

What I am not enterily sure from (1) is the value on having a 'guest phys-bits'
and a 'host phys-bits' *exposed to the guest* when it seems we are picking the wrong
value for guests. It seems unnecessary redirection (compared to real hw) unless
there's a use-case in keeping both that I am probably missing.

	Joao


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-16 11:54               ` Joao Martins
@ 2022-02-16 12:32                 ` Gerd Hoffmann
  0 siblings, 0 replies; 37+ messages in thread
From: Gerd Hoffmann @ 2022-02-16 12:32 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Igor Mammedov, Ani Sinha, Paolo Bonzini

  Hi,

> What I overlooked was the emphasis you had on desktops (qemu default bigger than
> host-advertised), where I am thinking mostly in servers.

Yep, on servers you have the reverse problem that phys-bits=40 might be
too small for very large guests.

> > To make things even worse:  The default
> > value (phys-bits=40) is larger than what many intel boxes support.
> > 
> > host-phys-bits=on fixes that.  It makes guest-phys-bits == host-phys-bits
> > by default, and also enforces guest-phys-bits <= host-phys-bits.
> > So with host-phys-bits=on the guest can actually use CPUID.80000008.EAX
> > to figure how big the guest physical address space is.
> > 
> Your 2 paragraphs sound like it's two different things, but +host-phys-bits
> just sets CPUID.80000008.EAX with the host CPUID equivalent leaf/register
> value. Which yes, it makes it reliable, but the way to convey is still the
> same. That is regardless, of phys-bits=bogus-bigger-than-host-number,
> host-phys-bits=on or host-phys-bits-limit=N.

Yep, it's CPUID.80000008.EAX in all cases.

> > Problem is the guest doesn't know whenever host-phys-bits is enabled or
> > not ...
> > 
> > So the options to fix that are:
> > 
> >   (1) Make the host-phys-bits option visible to the guest (as suggested
> >       above), or
> >   (2) Advertise a _reliable_ phys-bits value to the guest somehow.
> 
> What I am not enterily sure from (1) is the value on having a 'guest phys-bits'
> and a 'host phys-bits' *exposed to the guest* when it seems we are picking the wrong
> value for guests.

Well, the guest can read CPUID.80000008.EAX but as of today the guest
just doesn't know whenever it's bogus or not.

> It seems unnecessary redirection (compared to real hw) unless
> there's a use-case in keeping both that I am probably missing.

Well, the use case is backward compatibility.  If we want make better
use of the guest physical address space on newer qemu without breaking
compatibility with older qemu versions the firmware needs to do
something along the lines of:

    if (i-can-trust-phys-bits) {
        // new qemu
        read CPUID.80000008.EAX and go with that
    } else {
        // old qemu
        use current heuristic, placing stuff as low as possible.
    }

And for that to actually work new qemu needs to expose something to the
guest which can be used to evaluate the "i-can-trust-phys-bits"
expression.  A guest-readable bit somewhere which is 1 for
host-phys-bits=on and 0 otherwise would do the trick, but there are
surely other ways to solve the problem.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-14 15:31       ` Igor Mammedov
  2022-02-15  9:53         ` Gerd Hoffmann
@ 2022-02-18 17:12         ` Joao Martins
  2022-02-21  6:58           ` Igor Mammedov
  1 sibling, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-18 17:12 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/14/22 15:31, Igor Mammedov wrote:
> On Mon, 14 Feb 2022 15:05:00 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 2/14/22 14:53, Igor Mammedov wrote:
>>> On Mon,  7 Feb 2022 20:24:20 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> +{
>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>> +    ram_addr_t device_mem_size = 0;
>>>> +    uint32_t eax, vendor[3];
>>>> +
>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (pcmc->has_reserved_memory &&
>>>> +       (machine->ram_size < machine->maxram_size)) {
>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>>> +    }
>>>> +
>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>>>> +         device_mem_size) < AMD_HT_START) {  
>>>   
>> And I was at two minds on this one, whether to advertise *always*
>> the 1T hole, regardless of relocation. Or account the size
>> we advertise for the pci64 hole and make that part of the equation
>> above. Although that has the flaw that the firmware at admin request
>> may pick some ludricous number (limited by maxphysaddr).
> 
> it this point we have only pci64 hole size (machine property),
> so I'd include that in equation to make firmware assign
> pci64 aperture above HT range.
> 
> as for checking maxphysaddr, we can only check 'default' PCI hole
> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
> and hard error on it.
> 

Igor, in the context of your comment above, I'll be introducing another
preparatory patch that adds up pci_hole64_size to pc_memory_init() such
that all used/max physaddr space checks are consolidated in pc_memory_init().

To that end, the changes involve mainly moves the the pcihost qdev creation
to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
needs slightly more of a dance to extract that from i440fx_init() and also
because most i440fx state is private (hence the new helper for size). But
the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
it is just to extra pci_hole64_size from the object + user passed args (-global etc).

Raw staging changes below the scissors mark so far.

-->8--

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index b2e43eba1106..902977081350 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms)
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory)
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size)
 {
     int linux_boot, i;
     MemoryRegion *option_rom_mr;
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index d9b344248dac..5a608e30e28f 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine,
     MemoryRegion *pci_memory;
     MemoryRegion *rom_memory;
     ram_addr_t lowmem;
+    uint64_t hole64_size;
+    DeviceState *i440fx_dev;

     /*
      * Calculate ram split, for memory below and above 4G.  It's a bit
@@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine,
         pci_memory = g_new(MemoryRegion, 1);
         memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
         rom_memory = pci_memory;
+        i440fx_dev = qdev_new(host_type);
+        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
     } else {
         pci_memory = NULL;
         rom_memory = system_memory;
+        i440fx_dev = NULL;
+        hole64_size = 0;
     }

     pc_guest_info_init(pcms);
@@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine,
     /* allocate ram and load rom/bios */
     if (!xen_enabled()) {
         pc_memory_init(pcms, system_memory,
-                       rom_memory, &ram_memory);
+                       rom_memory, &ram_memory, hole64_size);
     } else {
         pc_system_flash_cleanup_unused(pcms);
         if (machine->kernel_filename != NULL) {
@@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine,

         pci_bus = i440fx_init(host_type,
                               pci_type,
-                              &i440fx_state,
+                              i440fx_dev, &i440fx_state,
                               system_memory, system_io, machine->ram_size,
                               x86ms->below_4g_mem_size,
                               x86ms->above_4g_mem_size,
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 1780f79bc127..b7cf44d4755e 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine)
                             pcms->smbios_entry_point_type);
     }

-    /* allocate ram and load rom/bios */
-    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
-
     /* create pci host bus */
     q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));

+    /* allocate ram and load rom/bios */
+    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
+                   q35_host->mch.pci_hole64_size);
+
     object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
     object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
                              OBJECT(ram_memory), NULL);
diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index e08716142b6e..c5cc28250d5c 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -237,7 +237,15 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
     }
 }

+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
+{
+        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
+
+        return i440fx->pci_hole64_size;
+}
+
 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
+                    DeviceState *dev,
                     PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
@@ -247,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_address_space,
                     MemoryRegion *ram_memory)
 {
-    DeviceState *dev;
     PCIBus *b;
     PCIDevice *d;
     PCIHostState *s;
@@ -255,7 +262,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
     unsigned i;
     I440FXState *i440fx;

-    dev = qdev_new(host_type);
     s = PCI_HOST_BRIDGE(dev);
     b = pci_root_bus_new(dev, NULL, pci_address_space,
                          address_space_io, 0, TYPE_PCI_BUS);
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 9c9f4ac74810..d8b9c4ebd748 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
 void pc_memory_init(PCMachineState *pcms,
                     MemoryRegion *system_memory,
                     MemoryRegion *rom_memory,
-                    MemoryRegion **ram_memory);
+                    MemoryRegion **ram_memory,
+                    uint64_t pci_hole64_size);
 uint64_t pc_pci_hole64_start(void);
 DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
 void pc_basic_device_init(struct PCMachineState *pcms,
diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
index f068aaba8fda..1299d6a2b0e4 100644
--- a/include/hw/pci-host/i440fx.h
+++ b/include/hw/pci-host/i440fx.h
@@ -36,7 +36,7 @@ struct PCII440FXState {
 #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"

 PCIBus *i440fx_init(const char *host_type, const char *pci_type,
-                    PCII440FXState **pi440fx_state,
+                    DeviceState *dev, PCII440FXState **pi440fx_state,
                     MemoryRegion *address_space_mem,
                     MemoryRegion *address_space_io,
                     ram_addr_t ram_size,
@@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
                     MemoryRegion *pci_memory,
                     MemoryRegion *ram_memory);

+uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);

 #endif


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-18 17:12         ` Joao Martins
@ 2022-02-21  6:58           ` Igor Mammedov
  2022-02-21 15:28             ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-21  6:58 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Fri, 18 Feb 2022 17:12:21 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/14/22 15:31, Igor Mammedov wrote:
> > On Mon, 14 Feb 2022 15:05:00 +0000
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> On 2/14/22 14:53, Igor Mammedov wrote:  
> >>> On Mon,  7 Feb 2022 20:24:20 +0000
> >>> Joao Martins <joao.m.martins@oracle.com> wrote:  
> >>>> +{
> >>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>> +    ram_addr_t device_mem_size = 0;
> >>>> +    uint32_t eax, vendor[3];
> >>>> +
> >>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    if (pcmc->has_reserved_memory &&
> >>>> +       (machine->ram_size < machine->maxram_size)) {
> >>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >>>> +    }
> >>>> +
> >>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> >>>> +         device_mem_size) < AMD_HT_START) {    
> >>>     
> >> And I was at two minds on this one, whether to advertise *always*
> >> the 1T hole, regardless of relocation. Or account the size
> >> we advertise for the pci64 hole and make that part of the equation
> >> above. Although that has the flaw that the firmware at admin request
> >> may pick some ludricous number (limited by maxphysaddr).  
> > 
> > it this point we have only pci64 hole size (machine property),
> > so I'd include that in equation to make firmware assign
> > pci64 aperture above HT range.
> > 
> > as for checking maxphysaddr, we can only check 'default' PCI hole
> > range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
> > and hard error on it.
> >   
> 
> Igor, in the context of your comment above, I'll be introducing another
> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
> that all used/max physaddr space checks are consolidated in pc_memory_init().
>
> To that end, the changes involve mainly moves the the pcihost qdev creation
> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
> needs slightly more of a dance to extract that from i440fx_init() and also
> because most i440fx state is private (hence the new helper for size). But
> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
> it is just to extra pci_hole64_size from the object + user passed args (-global etc).

Shuffling init order is looks too intrusive and in practice
quite risky.
How about moving maxphysaddr check to pc_machine_done() instead?
(this way you won't have to move pcihost around)


> Raw staging changes below the scissors mark so far.
> 
> -->8--  
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index b2e43eba1106..902977081350 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms)
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> -                    MemoryRegion **ram_memory)
> +                    MemoryRegion **ram_memory,
> +                    uint64_t pci_hole64_size)
>  {
>      int linux_boot, i;
>      MemoryRegion *option_rom_mr;
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index d9b344248dac..5a608e30e28f 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine,
>      MemoryRegion *pci_memory;
>      MemoryRegion *rom_memory;
>      ram_addr_t lowmem;
> +    uint64_t hole64_size;
> +    DeviceState *i440fx_dev;
> 
>      /*
>       * Calculate ram split, for memory below and above 4G.  It's a bit
> @@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine,
>          pci_memory = g_new(MemoryRegion, 1);
>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>          rom_memory = pci_memory;
> +        i440fx_dev = qdev_new(host_type);
> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
>      } else {
>          pci_memory = NULL;
>          rom_memory = system_memory;
> +        i440fx_dev = NULL;
> +        hole64_size = 0;
>      }
> 
>      pc_guest_info_init(pcms);
> @@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine,
>      /* allocate ram and load rom/bios */
>      if (!xen_enabled()) {
>          pc_memory_init(pcms, system_memory,
> -                       rom_memory, &ram_memory);
> +                       rom_memory, &ram_memory, hole64_size);
>      } else {
>          pc_system_flash_cleanup_unused(pcms);
>          if (machine->kernel_filename != NULL) {
> @@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine,
> 
>          pci_bus = i440fx_init(host_type,
>                                pci_type,
> -                              &i440fx_state,
> +                              i440fx_dev, &i440fx_state,
>                                system_memory, system_io, machine->ram_size,
>                                x86ms->below_4g_mem_size,
>                                x86ms->above_4g_mem_size,
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 1780f79bc127..b7cf44d4755e 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine)
>                              pcms->smbios_entry_point_type);
>      }
> 
> -    /* allocate ram and load rom/bios */
> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
> -
>      /* create pci host bus */
>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
> 
> +    /* allocate ram and load rom/bios */
> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
> +                   q35_host->mch.pci_hole64_size);
> +
>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>                               OBJECT(ram_memory), NULL);
> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
> index e08716142b6e..c5cc28250d5c 100644
> --- a/hw/pci-host/i440fx.c
> +++ b/hw/pci-host/i440fx.c
> @@ -237,7 +237,15 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>      }
>  }
> 
> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
> +{
> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
> +
> +        return i440fx->pci_hole64_size;
> +}
> +
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> +                    DeviceState *dev,
>                      PCII440FXState **pi440fx_state,
>                      MemoryRegion *address_space_mem,
>                      MemoryRegion *address_space_io,
> @@ -247,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      MemoryRegion *pci_address_space,
>                      MemoryRegion *ram_memory)
>  {
> -    DeviceState *dev;
>      PCIBus *b;
>      PCIDevice *d;
>      PCIHostState *s;
> @@ -255,7 +262,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>      unsigned i;
>      I440FXState *i440fx;
> 
> -    dev = qdev_new(host_type);
>      s = PCI_HOST_BRIDGE(dev);
>      b = pci_root_bus_new(dev, NULL, pci_address_space,
>                           address_space_io, 0, TYPE_PCI_BUS);
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 9c9f4ac74810..d8b9c4ebd748 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
>  void pc_memory_init(PCMachineState *pcms,
>                      MemoryRegion *system_memory,
>                      MemoryRegion *rom_memory,
> -                    MemoryRegion **ram_memory);
> +                    MemoryRegion **ram_memory,
> +                    uint64_t pci_hole64_size);
>  uint64_t pc_pci_hole64_start(void);
>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>  void pc_basic_device_init(struct PCMachineState *pcms,
> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
> index f068aaba8fda..1299d6a2b0e4 100644
> --- a/include/hw/pci-host/i440fx.h
> +++ b/include/hw/pci-host/i440fx.h
> @@ -36,7 +36,7 @@ struct PCII440FXState {
>  #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
> 
>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
> -                    PCII440FXState **pi440fx_state,
> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>                      MemoryRegion *address_space_mem,
>                      MemoryRegion *address_space_io,
>                      ram_addr_t ram_size,
> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>                      MemoryRegion *pci_memory,
>                      MemoryRegion *ram_memory);
> 
> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
> 
>  #endif
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-16  9:51           ` Daniel P. Berrangé
@ 2022-02-21 13:15             ` Dr. David Alan Gilbert
  2022-02-22  8:46               ` Igor Mammedov
  0 siblings, 1 reply; 37+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-21 13:15 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Igor Mammedov, Ani Sinha,
	Paolo Bonzini, Joao Martins

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:
> >   Hi,
> > 
> > > I don't know what behavior should be if firmware tries to program
> > > PCI64 hole beyond supported phys-bits.
> > 
> > Well, you are basically f*cked.
> > 
> > Unfortunately there is no reliable way to figure what phys-bits actually
> > is.  Because of that the firmware (both seabios and edk2) tries to place
> > the pci64 hole as low as possible.
> > 
> > The long version:
> > 
> > qemu advertises phys-bits=40 to the guest by default.  Probably because
> > this is what the first amd opteron processors had, assuming that it
> > would be a safe default.  Then intel came, releasing processors with
> > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> > Boom.
> > 
> > End result is that edk2 uses a 32G pci64 window by default, which is
> > placed at the first 32G border beyond normal ram.  So for virtual
> > machines with up to ~ 30G ram (including reservations for memory
> > hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> > space, which is low enough that it works on hardware with phys-bits=36.
> > 
> > If your VM has more than 32G of memory the pci64 hole will move and
> > phys-bits=36 isn't enough any more, but given that you probably only do
> > that on more beefy hosts which can take >= 64G of RAM and have a larger
> > physical address space this heuristic works good enough in practice.
> > 
> > Changing phys-bits behavior has been discussed on and off since years.
> > It's tricky to change for live migration compatibility reasons.
> > 
> > We got the host-phys-bits and host-phys-bits-limit properties, which
> > solve some of the phys-bits problems.
> > 
> >  * host-phys-bits=on makes sure the phys-bits advertised to the guest
> >    actually works.  It's off by default though for backward
> >    compatibility reasons (except microvm).  Also because turning it on
> >    breaks live migration of machines between hosts with different
> >    phys-bits.
> 
> RHEL has shipped with host-phys-bits=on in its machine types
> sinec RHEL-7. If it is good enough for RHEL machine types
> for 8 years, IMHO, it is a sign that its reasonable to do the
> same with upstream for new machine types.

And the upstream code is now pretty much identical except for the
default;  note that for TCG you do need to keep to 40 I think.

Dave
> 
> >  * host-phys-bits-limit can be used to tweak phys-bits to
> >    be lower than what the host supports.  Which can be used for
> >    live migration compatibility, i.e. if you have a pool of machines
> >    where some have 36 and some 39 you can limit phys-bits to 36 so
> >    live migration from 39 hosts to 36 hosts works.
> 
> RHEL machine types have set this to host-phys-bits-limit=48
> since RHEL-8 days, to avoid accidentally enabling 5-level
> paging in guests without explicit user opt-in.
> 
> > What is missing:
> > 
> >  * Some way for the firmware to get a phys-bits value it can actually
> >    use.  One possible way would be to have a paravirtual bit somewhere
> >    telling whenever host-phys-bits is enabled or not.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-21  6:58           ` Igor Mammedov
@ 2022-02-21 15:28             ` Joao Martins
  2022-02-22 11:00               ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-21 15:28 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini



On 2/21/22 06:58, Igor Mammedov wrote:
> On Fri, 18 Feb 2022 17:12:21 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 2/14/22 15:31, Igor Mammedov wrote:
>>> On Mon, 14 Feb 2022 15:05:00 +0000
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> On 2/14/22 14:53, Igor Mammedov wrote:  
>>>>> On Mon,  7 Feb 2022 20:24:20 +0000
>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>> +{
>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>> +    uint32_t eax, vendor[3];
>>>>>> +
>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (pcmc->has_reserved_memory &&
>>>>>> +       (machine->ram_size < machine->maxram_size)) {
>>>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>>>>> +    }
>>>>>> +
>>>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>>>>>> +         device_mem_size) < AMD_HT_START) {    
>>>>>     
>>>> And I was at two minds on this one, whether to advertise *always*
>>>> the 1T hole, regardless of relocation. Or account the size
>>>> we advertise for the pci64 hole and make that part of the equation
>>>> above. Although that has the flaw that the firmware at admin request
>>>> may pick some ludricous number (limited by maxphysaddr).  
>>>
>>> it this point we have only pci64 hole size (machine property),
>>> so I'd include that in equation to make firmware assign
>>> pci64 aperture above HT range.
>>>
>>> as for checking maxphysaddr, we can only check 'default' PCI hole
>>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
>>> and hard error on it.
>>>   
>>
>> Igor, in the context of your comment above, I'll be introducing another
>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
>> that all used/max physaddr space checks are consolidated in pc_memory_init().
>>
>> To that end, the changes involve mainly moves the the pcihost qdev creation
>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
>> needs slightly more of a dance to extract that from i440fx_init() and also
>> because most i440fx state is private (hence the new helper for size). But
>> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
>> it is just to extra pci_hole64_size from the object + user passed args (-global etc).
> 
> Shuffling init order is looks too intrusive and in practice
> quite risky.

Yeah, it is an intrusive change sadly. Although, why would you consider it
risky (curious)? We are "only" moving this:

	qdev_new(host_type);

... located at the very top of i440fx_init() and called at the top of q35_host
initilization to be instead before pc_memory_init(). And that means that an instance of an
object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and
its properties. I don't see a particular dependence in PC code to tell that this
would affect its surroundings parts.

The actual pcihost-related initialization is still kept entirely unchanged.

> How about moving maxphysaddr check to pc_machine_done() instead?
> (this way you won't have to move pcihost around)
> 
I can move it. But be there will be a slight disconnect between what pc_memory_init()
checks against "max used address"  between ... dictating if the 4G mem start should change
to 1T or not ...  and when the phys-bits check is actually made which includes the pci hole64.

For example, we create a guest with maxram 1009G (which 4G mem start would get at
unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e.
across the HT region). Here it would need an extra check and fail if pci_hole64 crosses
the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to
1T and the guest creation would still proceed.

> 
>> Raw staging changes below the scissors mark so far.
>>
>> -->8--  
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index b2e43eba1106..902977081350 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms)
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> -                    MemoryRegion **ram_memory)
>> +                    MemoryRegion **ram_memory,
>> +                    uint64_t pci_hole64_size)
>>  {
>>      int linux_boot, i;
>>      MemoryRegion *option_rom_mr;
>> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
>> index d9b344248dac..5a608e30e28f 100644
>> --- a/hw/i386/pc_piix.c
>> +++ b/hw/i386/pc_piix.c
>> @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine,
>>      MemoryRegion *pci_memory;
>>      MemoryRegion *rom_memory;
>>      ram_addr_t lowmem;
>> +    uint64_t hole64_size;
>> +    DeviceState *i440fx_dev;
>>
>>      /*
>>       * Calculate ram split, for memory below and above 4G.  It's a bit
>> @@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine,
>>          pci_memory = g_new(MemoryRegion, 1);
>>          memory_region_init(pci_memory, NULL, "pci", UINT64_MAX);
>>          rom_memory = pci_memory;
>> +        i440fx_dev = qdev_new(host_type);
>> +        hole64_size = i440fx_pci_hole64_size(i440fx_dev);
>>      } else {
>>          pci_memory = NULL;
>>          rom_memory = system_memory;
>> +        i440fx_dev = NULL;
>> +        hole64_size = 0;
>>      }
>>
>>      pc_guest_info_init(pcms);
>> @@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine,
>>      /* allocate ram and load rom/bios */
>>      if (!xen_enabled()) {
>>          pc_memory_init(pcms, system_memory,
>> -                       rom_memory, &ram_memory);
>> +                       rom_memory, &ram_memory, hole64_size);
>>      } else {
>>          pc_system_flash_cleanup_unused(pcms);
>>          if (machine->kernel_filename != NULL) {
>> @@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine,
>>
>>          pci_bus = i440fx_init(host_type,
>>                                pci_type,
>> -                              &i440fx_state,
>> +                              i440fx_dev, &i440fx_state,
>>                                system_memory, system_io, machine->ram_size,
>>                                x86ms->below_4g_mem_size,
>>                                x86ms->above_4g_mem_size,
>> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
>> index 1780f79bc127..b7cf44d4755e 100644
>> --- a/hw/i386/pc_q35.c
>> +++ b/hw/i386/pc_q35.c
>> @@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine)
>>                              pcms->smbios_entry_point_type);
>>      }
>>
>> -    /* allocate ram and load rom/bios */
>> -    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
>> -
>>      /* create pci host bus */
>>      q35_host = Q35_HOST_DEVICE(qdev_new(TYPE_Q35_HOST_DEVICE));
>>
>> +    /* allocate ram and load rom/bios */
>> +    pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory,
>> +                   q35_host->mch.pci_hole64_size);
>> +
>>      object_property_add_child(qdev_get_machine(), "q35", OBJECT(q35_host));
>>      object_property_set_link(OBJECT(q35_host), MCH_HOST_PROP_RAM_MEM,
>>                               OBJECT(ram_memory), NULL);
>> diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
>> index e08716142b6e..c5cc28250d5c 100644
>> --- a/hw/pci-host/i440fx.c
>> +++ b/hw/pci-host/i440fx.c
>> @@ -237,7 +237,15 @@ static void i440fx_realize(PCIDevice *dev, Error **errp)
>>      }
>>  }
>>
>> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev)
>> +{
>> +        I440FXState *i440fx = I440FX_PCI_HOST_BRIDGE(i440fx_dev);
>> +
>> +        return i440fx->pci_hole64_size;
>> +}
>> +
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>> +                    DeviceState *dev,
>>                      PCII440FXState **pi440fx_state,
>>                      MemoryRegion *address_space_mem,
>>                      MemoryRegion *address_space_io,
>> @@ -247,7 +255,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      MemoryRegion *pci_address_space,
>>                      MemoryRegion *ram_memory)
>>  {
>> -    DeviceState *dev;
>>      PCIBus *b;
>>      PCIDevice *d;
>>      PCIHostState *s;
>> @@ -255,7 +262,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>      unsigned i;
>>      I440FXState *i440fx;
>>
>> -    dev = qdev_new(host_type);
>>      s = PCI_HOST_BRIDGE(dev);
>>      b = pci_root_bus_new(dev, NULL, pci_address_space,
>>                           address_space_io, 0, TYPE_PCI_BUS);
>> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
>> index 9c9f4ac74810..d8b9c4ebd748 100644
>> --- a/include/hw/i386/pc.h
>> +++ b/include/hw/i386/pc.h
>> @@ -158,7 +158,8 @@ void xen_load_linux(PCMachineState *pcms);
>>  void pc_memory_init(PCMachineState *pcms,
>>                      MemoryRegion *system_memory,
>>                      MemoryRegion *rom_memory,
>> -                    MemoryRegion **ram_memory);
>> +                    MemoryRegion **ram_memory,
>> +                    uint64_t pci_hole64_size);
>>  uint64_t pc_pci_hole64_start(void);
>>  DeviceState *pc_vga_init(ISABus *isa_bus, PCIBus *pci_bus);
>>  void pc_basic_device_init(struct PCMachineState *pcms,
>> diff --git a/include/hw/pci-host/i440fx.h b/include/hw/pci-host/i440fx.h
>> index f068aaba8fda..1299d6a2b0e4 100644
>> --- a/include/hw/pci-host/i440fx.h
>> +++ b/include/hw/pci-host/i440fx.h
>> @@ -36,7 +36,7 @@ struct PCII440FXState {
>>  #define TYPE_IGD_PASSTHROUGH_I440FX_PCI_DEVICE "igd-passthrough-i440FX"
>>
>>  PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>> -                    PCII440FXState **pi440fx_state,
>> +                    DeviceState *dev, PCII440FXState **pi440fx_state,
>>                      MemoryRegion *address_space_mem,
>>                      MemoryRegion *address_space_io,
>>                      ram_addr_t ram_size,
>> @@ -45,5 +45,6 @@ PCIBus *i440fx_init(const char *host_type, const char *pci_type,
>>                      MemoryRegion *pci_memory,
>>                      MemoryRegion *ram_memory);
>>
>> +uint64_t i440fx_pci_hole64_size(DeviceState *i440fx_dev);
>>
>>  #endif
>>
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-21 13:15             ` Dr. David Alan Gilbert
@ 2022-02-22  8:46               ` Igor Mammedov
  2022-02-22  9:30                 ` Dr. David Alan Gilbert
  2022-02-22  9:42                 ` Gerd Hoffmann
  0 siblings, 2 replies; 37+ messages in thread
From: Igor Mammedov @ 2022-02-22  8:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini,
	Joao Martins

On Mon, 21 Feb 2022 13:15:40 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:  
> > >   Hi,
> > >   
> > > > I don't know what behavior should be if firmware tries to program
> > > > PCI64 hole beyond supported phys-bits.  
> > > 
> > > Well, you are basically f*cked.
> > > 
> > > Unfortunately there is no reliable way to figure what phys-bits actually
> > > is.  Because of that the firmware (both seabios and edk2) tries to place
> > > the pci64 hole as low as possible.
> > > 
> > > The long version:
> > > 
> > > qemu advertises phys-bits=40 to the guest by default.  Probably because
> > > this is what the first amd opteron processors had, assuming that it
> > > would be a safe default.  Then intel came, releasing processors with
> > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> > > Boom.
> > > 
> > > End result is that edk2 uses a 32G pci64 window by default, which is
> > > placed at the first 32G border beyond normal ram.  So for virtual
> > > machines with up to ~ 30G ram (including reservations for memory
> > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> > > space, which is low enough that it works on hardware with phys-bits=36.
> > > 
> > > If your VM has more than 32G of memory the pci64 hole will move and
> > > phys-bits=36 isn't enough any more, but given that you probably only do
> > > that on more beefy hosts which can take >= 64G of RAM and have a larger
> > > physical address space this heuristic works good enough in practice.
> > > 
> > > Changing phys-bits behavior has been discussed on and off since years.
> > > It's tricky to change for live migration compatibility reasons.
> > > 
> > > We got the host-phys-bits and host-phys-bits-limit properties, which
> > > solve some of the phys-bits problems.
> > > 
> > >  * host-phys-bits=on makes sure the phys-bits advertised to the guest
> > >    actually works.  It's off by default though for backward
> > >    compatibility reasons (except microvm).  Also because turning it on
> > >    breaks live migration of machines between hosts with different
> > >    phys-bits.  
> > 
> > RHEL has shipped with host-phys-bits=on in its machine types
> > sinec RHEL-7. If it is good enough for RHEL machine types
> > for 8 years, IMHO, it is a sign that its reasonable to do the
> > same with upstream for new machine types.  
> 
> And the upstream code is now pretty much identical except for the
> default;  note that for TCG you do need to keep to 40 I think.

will TCG work with 40bits on host that supports less than that?

Also quick look at host-phys-bits shows that it affects only 'host'
cpu model and is NOP for all other models.
If it's so than we probably need to expand it's scope to other cpu
models to cap them at actually supported range.

> 
> Dave
> >   
> > >  * host-phys-bits-limit can be used to tweak phys-bits to
> > >    be lower than what the host supports.  Which can be used for
> > >    live migration compatibility, i.e. if you have a pool of machines
> > >    where some have 36 and some 39 you can limit phys-bits to 36 so
> > >    live migration from 39 hosts to 36 hosts works.  
> > 
> > RHEL machine types have set this to host-phys-bits-limit=48
> > since RHEL-8 days, to avoid accidentally enabling 5-level
> > paging in guests without explicit user opt-in.
> >   
> > > What is missing:
> > > 
> > >  * Some way for the firmware to get a phys-bits value it can actually
> > >    use.  One possible way would be to have a paravirtual bit somewhere
> > >    telling whenever host-phys-bits is enabled or not.  
> > 
> > 
> > Regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > 
> >   



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-22  8:46               ` Igor Mammedov
@ 2022-02-22  9:30                 ` Dr. David Alan Gilbert
  2022-02-22  9:42                 ` Gerd Hoffmann
  1 sibling, 0 replies; 37+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-22  9:30 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini,
	Joao Martins

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Mon, 21 Feb 2022 13:15:40 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote:  
> > > >   Hi,
> > > >   
> > > > > I don't know what behavior should be if firmware tries to program
> > > > > PCI64 hole beyond supported phys-bits.  
> > > > 
> > > > Well, you are basically f*cked.
> > > > 
> > > > Unfortunately there is no reliable way to figure what phys-bits actually
> > > > is.  Because of that the firmware (both seabios and edk2) tries to place
> > > > the pci64 hole as low as possible.
> > > > 
> > > > The long version:
> > > > 
> > > > qemu advertises phys-bits=40 to the guest by default.  Probably because
> > > > this is what the first amd opteron processors had, assuming that it
> > > > would be a safe default.  Then intel came, releasing processors with
> > > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39.
> > > > Boom.
> > > > 
> > > > End result is that edk2 uses a 32G pci64 window by default, which is
> > > > placed at the first 32G border beyond normal ram.  So for virtual
> > > > machines with up to ~ 30G ram (including reservations for memory
> > > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address
> > > > space, which is low enough that it works on hardware with phys-bits=36.
> > > > 
> > > > If your VM has more than 32G of memory the pci64 hole will move and
> > > > phys-bits=36 isn't enough any more, but given that you probably only do
> > > > that on more beefy hosts which can take >= 64G of RAM and have a larger
> > > > physical address space this heuristic works good enough in practice.
> > > > 
> > > > Changing phys-bits behavior has been discussed on and off since years.
> > > > It's tricky to change for live migration compatibility reasons.
> > > > 
> > > > We got the host-phys-bits and host-phys-bits-limit properties, which
> > > > solve some of the phys-bits problems.
> > > > 
> > > >  * host-phys-bits=on makes sure the phys-bits advertised to the guest
> > > >    actually works.  It's off by default though for backward
> > > >    compatibility reasons (except microvm).  Also because turning it on
> > > >    breaks live migration of machines between hosts with different
> > > >    phys-bits.  
> > > 
> > > RHEL has shipped with host-phys-bits=on in its machine types
> > > sinec RHEL-7. If it is good enough for RHEL machine types
> > > for 8 years, IMHO, it is a sign that its reasonable to do the
> > > same with upstream for new machine types.  
> > 
> > And the upstream code is now pretty much identical except for the
> > default;  note that for TCG you do need to keep to 40 I think.
> 
> will TCG work with 40bits on host that supports less than that?
> 
> Also quick look at host-phys-bits shows that it affects only 'host'
> cpu model and is NOP for all other models.
> If it's so than we probably need to expand it's scope to other cpu
> models to cap them at actually supported range.

(We shouldn't really bring TCG oddities into this series!)

As I remember it effectively gets it from the accelerator, and TCG being
portable, there's no portable way of reading the phys-bits.

Whether it would work, hmm.  I'm assuming the host OS would stop you
allocating a huge ram block, so it shouldn't break from that.
But then the guest address translation is done in software, not using
the host MMU, so I think the guests view of addressing should be able
to be larger than the host. (Unless you try things like vfio/iommu on
tcg, which I'm told does work in some combos).

Dave


> > 
> > Dave
> > >   
> > > >  * host-phys-bits-limit can be used to tweak phys-bits to
> > > >    be lower than what the host supports.  Which can be used for
> > > >    live migration compatibility, i.e. if you have a pool of machines
> > > >    where some have 36 and some 39 you can limit phys-bits to 36 so
> > > >    live migration from 39 hosts to 36 hosts works.  
> > > 
> > > RHEL machine types have set this to host-phys-bits-limit=48
> > > since RHEL-8 days, to avoid accidentally enabling 5-level
> > > paging in guests without explicit user opt-in.
> > >   
> > > > What is missing:
> > > > 
> > > >  * Some way for the firmware to get a phys-bits value it can actually
> > > >    use.  One possible way would be to have a paravirtual bit somewhere
> > > >    telling whenever host-phys-bits is enabled or not.  
> > > 
> > > 
> > > Regards,
> > > Daniel
> > > -- 
> > > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > > 
> > >   
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-22  8:46               ` Igor Mammedov
  2022-02-22  9:30                 ` Dr. David Alan Gilbert
@ 2022-02-22  9:42                 ` Gerd Hoffmann
  2022-02-23  8:43                   ` Igor Mammedov
  1 sibling, 1 reply; 37+ messages in thread
From: Gerd Hoffmann @ 2022-02-22  9:42 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Dr. David Alan Gilbert, David Edmondson,
	Alex Williamson, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini,
	Joao Martins, Daniel Jordan

  Hi,

> > And the upstream code is now pretty much identical except for the
> > default;  note that for TCG you do need to keep to 40 I think.
> 
> will TCG work with 40bits on host that supports less than that?

When I understand things correctly the problem is that the phys-bits
limit applies to the npt/ept tables too, effectively restricting guest
physical address space to host physical address space.

TCG is not affected by that and should work just fine.

Not sure what happens if you turn off npt/ept and run on softmmu.
Possibly that works fine too.

> Also quick look at host-phys-bits shows that it affects only 'host'
> cpu model and is NOP for all other models.

I don't think so.  microvm forces host-phys-bits=on and that works with
all cpu models.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-21 15:28             ` Joao Martins
@ 2022-02-22 11:00               ` Joao Martins
  2022-02-23  8:38                 ` Igor Mammedov
  0 siblings, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-22 11:00 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/21/22 15:28, Joao Martins wrote:
> On 2/21/22 06:58, Igor Mammedov wrote:
>> On Fri, 18 Feb 2022 17:12:21 +0000
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> On 2/14/22 15:31, Igor Mammedov wrote:
>>>> On Mon, 14 Feb 2022 15:05:00 +0000
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>> On 2/14/22 14:53, Igor Mammedov wrote:  
>>>>>> On Mon,  7 Feb 2022 20:24:20 +0000
>>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>>>>> +{
>>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
>>>>>>> +    ram_addr_t device_mem_size = 0;
>>>>>>> +    uint32_t eax, vendor[3];
>>>>>>> +
>>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (pcmc->has_reserved_memory &&
>>>>>>> +       (machine->ram_size < machine->maxram_size)) {
>>>>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>>>>>>> +         device_mem_size) < AMD_HT_START) {    
>>>>>>     
>>>>> And I was at two minds on this one, whether to advertise *always*
>>>>> the 1T hole, regardless of relocation. Or account the size
>>>>> we advertise for the pci64 hole and make that part of the equation
>>>>> above. Although that has the flaw that the firmware at admin request
>>>>> may pick some ludricous number (limited by maxphysaddr).  
>>>>
>>>> it this point we have only pci64 hole size (machine property),
>>>> so I'd include that in equation to make firmware assign
>>>> pci64 aperture above HT range.
>>>>
>>>> as for checking maxphysaddr, we can only check 'default' PCI hole
>>>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
>>>> and hard error on it.
>>>>   
>>>
>>> Igor, in the context of your comment above, I'll be introducing another
>>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
>>> that all used/max physaddr space checks are consolidated in pc_memory_init().
>>>
>>> To that end, the changes involve mainly moves the the pcihost qdev creation
>>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
>>> needs slightly more of a dance to extract that from i440fx_init() and also
>>> because most i440fx state is private (hence the new helper for size). But
>>> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
>>> it is just to extra pci_hole64_size from the object + user passed args (-global etc).
>>
>> Shuffling init order is looks too intrusive and in practice
>> quite risky.
> 
> Yeah, it is an intrusive change sadly. Although, why would you consider it
> risky (curious)? We are "only" moving this:
> 
> 	qdev_new(host_type);
> 
> ... located at the very top of i440fx_init() and called at the top of q35_host
> initilization to be instead before pc_memory_init(). And that means that an instance of an
> object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and
> its properties. I don't see a particular dependence in PC code to tell that this
> would affect its surroundings parts.
> 
> The actual pcihost-related initialization is still kept entirely unchanged.
> 
>> How about moving maxphysaddr check to pc_machine_done() instead?
>> (this way you won't have to move pcihost around)
>>
> I can move it. But be there will be a slight disconnect between what pc_memory_init()
> checks against "max used address"  between ... dictating if the 4G mem start should change
> to 1T or not ...  and when the phys-bits check is actually made which includes the pci hole64.
> 
> For example, we create a guest with maxram 1009G (which 4G mem start would get at
> unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e.
> across the HT region). Here it would need an extra check and fail if pci_hole64 crosses
> the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to
> 1T and the guest creation would still proceed.
> 
Actually, on a second thought, not having the pci_hole64_size
to pc_memory_init() to instead introduce it in pc_machine_done() to
include pci_hole64_size looks like a half-step :( because otherwise the user
needs to play games on how much it should include as -m|-object-memory*
number to force it to relocate to 1T and avoid the guest creation
failure.

So either we:

1) consider pci_hole64_size in pc_memory_init() and move default
pci-hole64-size (sort of what I was proposing in this last exchange)

2) keep the maxphysaddr check as is (but generic in pc_memory_init()
and disregarding pci-hole64-size) and advertise the actual 1T reserved hole
(regardless of above-4G relocation) letting BIOS consider reserved regions
when picking hole64-start.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-22 11:00               ` Joao Martins
@ 2022-02-23  8:38                 ` Igor Mammedov
  0 siblings, 0 replies; 37+ messages in thread
From: Igor Mammedov @ 2022-02-23  8:38 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Tue, 22 Feb 2022 11:00:55 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/21/22 15:28, Joao Martins wrote:
> > On 2/21/22 06:58, Igor Mammedov wrote:  
> >> On Fri, 18 Feb 2022 17:12:21 +0000
> >> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>  
> >>> On 2/14/22 15:31, Igor Mammedov wrote:  
> >>>> On Mon, 14 Feb 2022 15:05:00 +0000
> >>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>> On 2/14/22 14:53, Igor Mammedov wrote:    
> >>>>>> On Mon,  7 Feb 2022 20:24:20 +0000
> >>>>>> Joao Martins <joao.m.martins@oracle.com> wrote:    
> >>>>>>> +{
> >>>>>>> +    PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> >>>>>>> +    X86MachineState *x86ms = X86_MACHINE(pcms);
> >>>>>>> +    ram_addr_t device_mem_size = 0;
> >>>>>>> +    uint32_t eax, vendor[3];
> >>>>>>> +
> >>>>>>> +    host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>>>>>> +    if (!IS_AMD_VENDOR(vendor)) {
> >>>>>>> +        return;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    if (pcmc->has_reserved_memory &&
> >>>>>>> +       (machine->ram_size < machine->maxram_size)) {
> >>>>>>> +        device_mem_size = machine->maxram_size - machine->ram_size;
> >>>>>>> +    }
> >>>>>>> +
> >>>>>>> +    if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> >>>>>>> +         device_mem_size) < AMD_HT_START) {      
> >>>>>>       
> >>>>> And I was at two minds on this one, whether to advertise *always*
> >>>>> the 1T hole, regardless of relocation. Or account the size
> >>>>> we advertise for the pci64 hole and make that part of the equation
> >>>>> above. Although that has the flaw that the firmware at admin request
> >>>>> may pick some ludricous number (limited by maxphysaddr).    
> >>>>
> >>>> it this point we have only pci64 hole size (machine property),
> >>>> so I'd include that in equation to make firmware assign
> >>>> pci64 aperture above HT range.
> >>>>
> >>>> as for checking maxphysaddr, we can only check 'default' PCI hole
> >>>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM)
> >>>> and hard error on it.
> >>>>     
> >>>
> >>> Igor, in the context of your comment above, I'll be introducing another
> >>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such
> >>> that all used/max physaddr space checks are consolidated in pc_memory_init().
> >>>
> >>> To that end, the changes involve mainly moves the the pcihost qdev creation
> >>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx
> >>> needs slightly more of a dance to extract that from i440fx_init() and also
> >>> because most i440fx state is private (hence the new helper for size). But
> >>> the actual initialization of I440fx/q35 pci host is still after pc_memory_init(),
> >>> it is just to extra pci_hole64_size from the object + user passed args (-global etc).  
> >>
> >> Shuffling init order is looks too intrusive and in practice
> >> quite risky.  
> > 
> > Yeah, it is an intrusive change sadly. Although, why would you consider it
> > risky (curious)? We are "only" moving this:
> > 
> > 	qdev_new(host_type);
> > 
> > ... located at the very top of i440fx_init() and called at the top of q35_host
> > initilization to be instead before pc_memory_init(). And that means that an instance of an
> > object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and
> > its properties. I don't see a particular dependence in PC code to tell that this
> > would affect its surroundings parts.

I don't see anything wrong her as well
(but I'm probably overcautious since more often than not
changing initialization order, has broken things in non obvious ways)

> > 
> > The actual pcihost-related initialization is still kept entirely unchanged.
> >   
> >> How about moving maxphysaddr check to pc_machine_done() instead?
> >> (this way you won't have to move pcihost around)
> >>  
> > I can move it. But be there will be a slight disconnect between what pc_memory_init()
> > checks against "max used address"  between ... dictating if the 4G mem start should change
> > to 1T or not ...  and when the phys-bits check is actually made which includes the pci hole64.
> > 
> > For example, we create a guest with maxram 1009G (which 4G mem start would get at
> > unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e.
> > across the HT region). Here it would need an extra check and fail if pci_hole64 crosses
> > the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to
> > 1T and the guest creation would still proceed.
> >   
> Actually, on a second thought, not having the pci_hole64_size
> to pc_memory_init() to instead introduce it in pc_machine_done() to
> include pci_hole64_size looks like a half-step :( because otherwise the user
> needs to play games on how much it should include as -m|-object-memory*
> number to force it to relocate to 1T and avoid the guest creation
> failure.
> 
> So either we:
> 
> 1) consider pci_hole64_size in pc_memory_init() and move default
> pci-hole64-size (sort of what I was proposing in this last exchange)

ok, lets go with this approach 

> 
> 2) keep the maxphysaddr check as is (but generic in pc_memory_init()
> and disregarding pci-hole64-size) and advertise the actual 1T reserved hole
> (regardless of above-4G relocation) letting BIOS consider reserved regions
> when picking hole64-start.
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-22  9:42                 ` Gerd Hoffmann
@ 2022-02-23  8:43                   ` Igor Mammedov
  2022-02-23  9:16                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-23  8:43 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Dr. David Alan Gilbert, David Edmondson,
	Alex Williamson, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini,
	Joao Martins, Daniel Jordan

On Tue, 22 Feb 2022 10:42:55 +0100
Gerd Hoffmann <kraxel@redhat.com> wrote:

>   Hi,
> 
> > > And the upstream code is now pretty much identical except for the
> > > default;  note that for TCG you do need to keep to 40 I think.  
> > 
> > will TCG work with 40bits on host that supports less than that?  
> 
> When I understand things correctly the problem is that the phys-bits
> limit applies to the npt/ept tables too, effectively restricting guest
> physical address space to host physical address space.
> 
> TCG is not affected by that and should work just fine.
> 
> Not sure what happens if you turn off npt/ept and run on softmmu.
> Possibly that works fine too.
> 
> > Also quick look at host-phys-bits shows that it affects only 'host'
> > cpu model and is NOP for all other models.  
> 
> I don't think so.  microvm forces host-phys-bits=on and that works with
> all cpu models.

I just don't see how host-phys-bits can work for other than 'host' cpu model.
It's true that property is available for all cpu models, but the field it sets
is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit.
Am I missing something?

> 
> take care,
>   Gerd
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23  8:43                   ` Igor Mammedov
@ 2022-02-23  9:16                     ` Dr. David Alan Gilbert
  2022-02-23  9:31                       ` Igor Mammedov
  0 siblings, 1 reply; 37+ messages in thread
From: Dr. David Alan Gilbert @ 2022-02-23  9:16 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini,
	Joao Martins

* Igor Mammedov (imammedo@redhat.com) wrote:
> On Tue, 22 Feb 2022 10:42:55 +0100
> Gerd Hoffmann <kraxel@redhat.com> wrote:
> 
> >   Hi,
> > 
> > > > And the upstream code is now pretty much identical except for the
> > > > default;  note that for TCG you do need to keep to 40 I think.  
> > > 
> > > will TCG work with 40bits on host that supports less than that?  
> > 
> > When I understand things correctly the problem is that the phys-bits
> > limit applies to the npt/ept tables too, effectively restricting guest
> > physical address space to host physical address space.
> > 
> > TCG is not affected by that and should work just fine.
> > 
> > Not sure what happens if you turn off npt/ept and run on softmmu.
> > Possibly that works fine too.
> > 
> > > Also quick look at host-phys-bits shows that it affects only 'host'
> > > cpu model and is NOP for all other models.  
> > 
> > I don't think so.  microvm forces host-phys-bits=on and that works with
> > all cpu models.
> 
> I just don't see how host-phys-bits can work for other than 'host' cpu model.
> It's true that property is available for all cpu models, but the field it sets
> is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit.
> Am I missing something?

The hook in kvm/kvm-cpu.c kvm_cpu_realizefn:

    /*
     * The realize order is important, since x86_cpu_realize() checks if
     * nothing else has been set by the user (or by accelerators) in
     * cpu->ucode_rev and cpu->phys_bits, and updates the CPUID results in
     * mwait.ecx.
     * This accel realization code also assumes cpu features are already expanded.
     *
     * realize order:
     *
     * x86_cpu_realize():
     *  -> x86_cpu_expand_features()
     *  -> cpu_exec_realizefn():
     *            -> accel_cpu_realizefn()
     *               kvm_cpu_realizefn() -> host_cpu_realizefn()
     *  -> check/update ucode_rev, phys_bits, mwait
     */

Dave

> > 
> > take care,
> >   Gerd
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
  2022-02-23  9:16                     ` Dr. David Alan Gilbert
@ 2022-02-23  9:31                       ` Igor Mammedov
  0 siblings, 0 replies; 37+ messages in thread
From: Igor Mammedov @ 2022-02-23  9:31 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrangé,
	Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Gerd Hoffmann, Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini,
	Joao Martins

On Wed, 23 Feb 2022 09:16:51 +0000
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Igor Mammedov (imammedo@redhat.com) wrote:
> > On Tue, 22 Feb 2022 10:42:55 +0100
> > Gerd Hoffmann <kraxel@redhat.com> wrote:
> >   
> > >   Hi,
> > >   
> > > > > And the upstream code is now pretty much identical except for the
> > > > > default;  note that for TCG you do need to keep to 40 I think.    
> > > > 
> > > > will TCG work with 40bits on host that supports less than that?    
> > > 
> > > When I understand things correctly the problem is that the phys-bits
> > > limit applies to the npt/ept tables too, effectively restricting guest
> > > physical address space to host physical address space.
> > > 
> > > TCG is not affected by that and should work just fine.
> > > 
> > > Not sure what happens if you turn off npt/ept and run on softmmu.
> > > Possibly that works fine too.
> > >   
> > > > Also quick look at host-phys-bits shows that it affects only 'host'
> > > > cpu model and is NOP for all other models.    
> > > 
> > > I don't think so.  microvm forces host-phys-bits=on and that works with
> > > all cpu models.  
> > 
> > I just don't see how host-phys-bits can work for other than 'host' cpu model.
> > It's true that property is available for all cpu models, but the field it sets
> > is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit.
> > Am I missing something?  
> 
> The hook in kvm/kvm-cpu.c kvm_cpu_realizefn:
> 
>     /*
>      * The realize order is important, since x86_cpu_realize() checks if
>      * nothing else has been set by the user (or by accelerators) in
>      * cpu->ucode_rev and cpu->phys_bits, and updates the CPUID results in
>      * mwait.ecx.
>      * This accel realization code also assumes cpu features are already expanded.
>      *
>      * realize order:
>      *
>      * x86_cpu_realize():
>      *  -> x86_cpu_expand_features()
>      *  -> cpu_exec_realizefn():
>      *            -> accel_cpu_realizefn()
>      *               kvm_cpu_realizefn() -> host_cpu_realizefn()
>      *  -> check/update ucode_rev, phys_bits, mwait
>      */

Thanks,
I didn't expect host_cpu_realizefn being called from elsewhere
beside of cpu model it belongs to or models inherited from it.

> 
> Dave
> 
> > > 
> > > take care,
> > >   Gerd
> > >   
> >   



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-14 15:18     ` Joao Martins
  2022-02-14 15:41       ` Igor Mammedov
@ 2022-02-23 17:18       ` Joao Martins
  2022-02-24  9:01         ` Igor Mammedov
  1 sibling, 1 reply; 37+ messages in thread
From: Joao Martins @ 2022-02-23 17:18 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/14/22 15:18, Joao Martins wrote:
> On 2/14/22 15:03, Igor Mammedov wrote:
>> On Mon,  7 Feb 2022 20:24:21 +0000
>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>>> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
>>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
>>> ram-above-4g relocation happens and the CPU wasn't configured
>>> with a big enough phys-bits, warn the user. There isn't a
>>> catastrophic failure exactly, the guest will still boot, but
>>> most likely won't be able to use more than ~4G of RAM.
>>
>> how 'unable to use" would manifest?
>> It might be better to prevent QEMU startup with broken setup (CLI)
>> rather then letting guest run and trying to figure out what's
>> going wrong when users start to complain. 
>>
> Sounds better to be conservative here.
> 
> I will change from warn_report() to error_report()
> and exit.
> 

I was running through x86_64 qtests prior to submission
and it seems that the inclusion of a pci_hole64_size in
the check added by this patch would break tests if we were
to error out. So far, I'm keeping it as a warning over
compatibility concerns, not limited these 5 test failures
below. Let me know otherwise if you disagree, or if you
prefer another way.

Summary of Failures:

 1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
  killed by signal 6 SIGABRT
 4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
  killed by signal 6 SIGABRT
 7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
  killed by signal 6 SIGABRT
44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
  killed by signal 6 SIGABRT
45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
  killed by signal 6 SIGABRT


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-23 17:18       ` Joao Martins
@ 2022-02-24  9:01         ` Igor Mammedov
  2022-02-24  9:27           ` Joao Martins
  0 siblings, 1 reply; 37+ messages in thread
From: Igor Mammedov @ 2022-02-24  9:01 UTC (permalink / raw)
  To: Joao Martins
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On Wed, 23 Feb 2022 17:18:55 +0000
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 2/14/22 15:18, Joao Martins wrote:
> > On 2/14/22 15:03, Igor Mammedov wrote:  
> >> On Mon,  7 Feb 2022 20:24:21 +0000
> >> Joao Martins <joao.m.martins@oracle.com> wrote:
> >>  
> >>> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
> >>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
> >>> ram-above-4g relocation happens and the CPU wasn't configured
> >>> with a big enough phys-bits, warn the user. There isn't a
> >>> catastrophic failure exactly, the guest will still boot, but
> >>> most likely won't be able to use more than ~4G of RAM.  
> >>
> >> how 'unable to use" would manifest?
> >> It might be better to prevent QEMU startup with broken setup (CLI)
> >> rather then letting guest run and trying to figure out what's
> >> going wrong when users start to complain. 
> >>  
> > Sounds better to be conservative here.
> > 
> > I will change from warn_report() to error_report()
> > and exit.
> >   
> 
> I was running through x86_64 qtests prior to submission
> and it seems that the inclusion of a pci_hole64_size in
> the check added by this patch would break tests if we were
> to error out. So far, I'm keeping it as a warning over
> compatibility concerns, not limited these 5 test failures
> below. Let me know otherwise if you disagree, or if you
> prefer another way.

can you check what exactly breaks tests?

> Summary of Failures:
> 
>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
>   killed by signal 6 SIGABRT
>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
>   killed by signal 6 SIGABRT
>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
>   killed by signal 6 SIGABRT
> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
>   killed by signal 6 SIGABRT
> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
>   killed by signal 6 SIGABRT
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low
  2022-02-24  9:01         ` Igor Mammedov
@ 2022-02-24  9:27           ` Joao Martins
  0 siblings, 0 replies; 37+ messages in thread
From: Joao Martins @ 2022-02-24  9:27 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Eduardo Habkost, Michael S . Tsirkin, Richard Henderson,
	qemu-devel, Daniel Jordan, David Edmondson, Alex Williamson,
	Suravee Suthikulpanit, Ani Sinha, Paolo Bonzini

On 2/24/22 09:01, Igor Mammedov wrote:
> On Wed, 23 Feb 2022 17:18:55 +0000
> Joao Martins <joao.m.martins@oracle.com> wrote:
> 
>> On 2/14/22 15:18, Joao Martins wrote:
>>> On 2/14/22 15:03, Igor Mammedov wrote:  
>>>> On Mon,  7 Feb 2022 20:24:21 +0000
>>>> Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>  
>>>>> Default phys-bits on Qemu is TCG_PHYS_BITS (40) which is enough
>>>>> to address 1Tb (0xff ffff ffff). On AMD platforms, if a
>>>>> ram-above-4g relocation happens and the CPU wasn't configured
>>>>> with a big enough phys-bits, warn the user. There isn't a
>>>>> catastrophic failure exactly, the guest will still boot, but
>>>>> most likely won't be able to use more than ~4G of RAM.  
>>>>
>>>> how 'unable to use" would manifest?
>>>> It might be better to prevent QEMU startup with broken setup (CLI)
>>>> rather then letting guest run and trying to figure out what's
>>>> going wrong when users start to complain. 
>>>>  
>>> Sounds better to be conservative here.
>>>
>>> I will change from warn_report() to error_report()
>>> and exit.
>>>   
>>
>> I was running through x86_64 qtests prior to submission
>> and it seems that the inclusion of a pci_hole64_size in
>> the check added by this patch would break tests if we were
>> to error out. So far, I'm keeping it as a warning over
>> compatibility concerns, not limited these 5 test failures
>> below. Let me know otherwise if you disagree, or if you
>> prefer another way.
> 
> can you check what exactly breaks tests?
> 
The test prematuralely fails with the above check that.

And on a second look, the problem is obvious. Essentially, I
am not handling the 32-bit case correctly :(

I will revert back what I submitted in v3 to be an error_report()
and exit() and will restrict this 64-bit only (i.e. for memory above-4g)

qemu-system-x86_64: Address space above 4G at 100000000-100000000 phys-bits too low (32)
qemu-system-x86_64: Address space above 4G at 100000000-100000000 phys-bits too low (32)
qemu-system-x86_64: Address space above 4G at 100000000-100000000 phys-bits too low (32)
# child process (/x86/cpuid/parsing-plus-minus/subprocess [188634]) stderr:
"qemu-system-x86_64: warning: Ambiguous CPU model string. Don't mix both \"-mce\" and
\"mce=on\"\nqemu-system-x86_64: warning: Ambiguous CPU model string. Don't mix both
\"+cx8\" and \"cx8=off\"\nqemu-system-x86_64: warning: Compatibility of ambiguous CPU
model strings won't be kept on future QEMU versions\nqemu-system-x86_64: Address space
above 4G at 100000000-180000000 phys-bits too low (32)\nBroken pipe\n"
qemu-system-x86_64: Address space above 4G at 100000000-180000000 phys-bits too low (32)

>> Summary of Failures:
>>
>>  1/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/qom-test               ERROR           0.07s
>>   killed by signal 6 SIGABRT
>>  4/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-hmp               ERROR           0.07s
>>   killed by signal 6 SIGABRT
>>  7/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/boot-serial-test       ERROR           0.07s
>>   killed by signal 6 SIGABRT
>> 44/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/test-x86-cpuid-compat  ERROR           0.09s
>>   killed by signal 6 SIGABRT
>> 45/56 qemu:qtest+qtest-x86_64 / qtest-x86_64/numa-test              ERROR           0.17s
>>   killed by signal 6 SIGABRT
>>
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2022-02-24  9:31 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-07 20:24 [PATCH RFCv2 0/4] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU Joao Martins
2022-02-07 20:24 ` [PATCH RFCv2 1/4] hw/i386: add 4g boundary start to X86MachineState Joao Martins
2022-02-14 13:19   ` Igor Mammedov
2022-02-14 13:21     ` Joao Martins
2022-02-07 20:24 ` [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable Joao Martins
2022-02-14 14:53   ` Igor Mammedov
2022-02-14 15:05     ` Joao Martins
2022-02-14 15:31       ` Igor Mammedov
2022-02-15  9:53         ` Gerd Hoffmann
2022-02-15 19:37           ` Joao Martins
2022-02-16  8:19             ` Gerd Hoffmann
2022-02-16 11:54               ` Joao Martins
2022-02-16 12:32                 ` Gerd Hoffmann
2022-02-16  9:51           ` Daniel P. Berrangé
2022-02-21 13:15             ` Dr. David Alan Gilbert
2022-02-22  8:46               ` Igor Mammedov
2022-02-22  9:30                 ` Dr. David Alan Gilbert
2022-02-22  9:42                 ` Gerd Hoffmann
2022-02-23  8:43                   ` Igor Mammedov
2022-02-23  9:16                     ` Dr. David Alan Gilbert
2022-02-23  9:31                       ` Igor Mammedov
2022-02-18 17:12         ` Joao Martins
2022-02-21  6:58           ` Igor Mammedov
2022-02-21 15:28             ` Joao Martins
2022-02-22 11:00               ` Joao Martins
2022-02-23  8:38                 ` Igor Mammedov
2022-02-07 20:24 ` [PATCH RFCv2 3/4] i386/pc: warn if phys-bits is too low Joao Martins
2022-02-14 13:15   ` David Edmondson
2022-02-14 13:18     ` Joao Martins
2022-02-14 15:03   ` Igor Mammedov
2022-02-14 15:18     ` Joao Martins
2022-02-14 15:41       ` Igor Mammedov
2022-02-14 15:48         ` Joao Martins
2022-02-23 17:18       ` Joao Martins
2022-02-24  9:01         ` Igor Mammedov
2022-02-24  9:27           ` Joao Martins
2022-02-07 20:24 ` [PATCH RFCv2 4/4] i386/pc: Restrict AMD-only enforcing of valid IOVAs to new machine type Joao Martins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.