All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT)
@ 2019-05-08  6:17 Tao Xu
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState Tao Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

This series of patches will build Heterogeneous Memory Attribute Table (HMAT)
according to the command line. The ACPI HMAT describes the memory attributes,
such as memory side cache attributes and bandwidth and latency details,
related to the System Physical Address (SPA) Memory Ranges.
The software is expected to use this information as hint for optimization.

OSPM evaluates HMAT only during system initialization. Any changes to the HMAT
state at runtime or information regarding HMAT for hot plug are communicated
using the _HMA method.

The V3 patches link:
https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg08076.html
The V2 patches link:
https://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02276.html
The V1 RESEND patches link:
https://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg05368.html

Changelog:
v4:
    - send the patch of "move numa global variables into MachineState"
    together with HMAT patches.
    https://lists.gnu.org/archive/html/qemu-devel/2019-04/msg03662.html
    - spilt the 1/8 of v3 patch into two patches, 4/11 introduces
    build_mem_ranges() and 5/11 builds HMAT (Igor)
    - use build_append_int_noprefix() to build parts of ACPI table in
    all patches (Igor)
    - Split 8/8 of patch v3 into two parts, 10/11 introduces NFIT
    generalizations (build_acpi_aml_common), and use it in 11/11 to
    simplify hmat_build_aml (Igor)
    - use MachineState instead of PCMachineState to build HMAT more
    generalic (Igor)
    - move the 7/8 v3 patch into the former patches
    - update the version tag from 4.0 to 4.1
v3:
    - rebase the fixing patch into the jingqi's patches (Eric)
    - update the version tag from 3.10 to 4.0 (Eric)
v2:
  Per Igor and Eric's comments, fix some coding style and small issues:
    - update the version number in qapi/misc.json
    - including the expansion of the acronym HMAT in qapi/misc.json
    - correct spell mistakes in qapi/misc.json and qemu-options.hx
    - fix the comment syle in hw/i386/acpi-build.c
    and hw/acpi/hmat.h
   - remove some unnecessary head files in hw/acpi/hmat.c 
   - use hardcoded numbers from spec to generate
   Memory Subsystem Address Range Structure in hw/acpi/hmat.c
   - drop the struct AcpiHmat and AcpiHmatSpaRange
    in hw/acpi/hmat.h
   - rewrite NFIT code to build _HMA method

Liu Jingqi (6):
  hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI
    HMAT
  hmat acpi: Build System Locality Latency and Bandwidth Information
    Structure(s) in ACPI HMAT
  hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI
    HMAT
  numa: Extend the command-line to provide memory latency and bandwidth
    information
  numa: Extend the command-line to provide memory side cache information
  hmat acpi: Implement _HMA method to update HMAT at runtime

Tao Xu (5):
  numa: move numa global variable nb_numa_nodes into MachineState
  numa: move numa global variable have_numa_distance into MachineState
  numa: move numa global variable numa_info into MachineState
  acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook
  acpi: introduce build_acpi_aml_common for NFIT generalizations

 exec.c                               |   5 +-
 hw/acpi/Kconfig                      |   5 +
 hw/acpi/Makefile.objs                |   1 +
 hw/acpi/aml-build.c                  |   9 +-
 hw/acpi/hmat.c                       | 574 +++++++++++++++++++++++++++
 hw/acpi/hmat.h                       | 179 +++++++++
 hw/acpi/nvdimm.c                     |  49 ++-
 hw/acpi/piix4.c                      |   1 +
 hw/arm/boot.c                        |   4 +-
 hw/arm/virt-acpi-build.c             |  17 +-
 hw/arm/virt.c                        |   8 +-
 hw/core/machine.c                    |  24 +-
 hw/i386/acpi-build.c                 | 125 +++---
 hw/i386/pc.c                         |  14 +-
 hw/i386/pc_piix.c                    |   4 +
 hw/i386/pc_q35.c                     |   4 +
 hw/isa/lpc_ich9.c                    |   1 +
 hw/mem/pc-dimm.c                     |   2 +
 hw/pci-bridge/pci_expander_bridge.c  |   2 +
 hw/ppc/spapr.c                       |  20 +-
 hw/ppc/spapr_pci.c                   |   2 +
 include/hw/acpi/acpi_dev_interface.h |   3 +
 include/hw/acpi/aml-build.h          |   2 +-
 include/hw/boards.h                  |  43 ++
 include/hw/i386/pc.h                 |   1 +
 include/hw/mem/nvdimm.h              |   6 +
 include/qemu/typedefs.h              |   3 +
 include/sysemu/numa.h                |  13 +-
 include/sysemu/sysemu.h              |  30 ++
 monitor.c                            |   4 +-
 numa.c                               | 282 +++++++++++--
 qapi/misc.json                       | 162 +++++++-
 qemu-options.hx                      |  28 +-
 stubs/Makefile.objs                  |   1 +
 stubs/pc_build_mem_ranges.c          |   6 +
 35 files changed, 1501 insertions(+), 133 deletions(-)
 create mode 100644 hw/acpi/hmat.c
 create mode 100644 hw/acpi/hmat.h
 create mode 100644 stubs/pc_build_mem_ranges.c

-- 
2.17.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-05-23 13:04   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance " Tao Xu
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

The aim of this patch is to add struct NumaState in MachineState
and move existing numa global nb_numa_nodes(renamed as "num_nodes")
into NumaState. And add variable numa_support into MachineClass to
decide which submachines support NUMA.

Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
Suggested-by: Igor Mammedov <imammedo@redhat.com>
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - send the patch together with HMAT patches

Changes in v3 -> v2:
    - rename the "NumaState::nb_numa_nodes" as "NumaState::num_nodes"
    (Eduardo)
    - use machine_num_numa_nodes(MachineState *ms) to check if
    ms->numa_state is NULL before using NumaState::num_nodes (Eduardo)
    - check if ms->numa_state == NULL in the set_numa_options to avoid
    using -numa in a machine-type which don't support numa

Changes in v2:
    - fix the mistake in numa_complete_configuration in numa.c
    - add MachineState into some functions to avoid using
    qdev_get_machine
    - add some if experssion to avoid the NumaState is null
---
 exec.c                              |  5 ++-
 hw/acpi/aml-build.c                 |  3 +-
 hw/arm/boot.c                       |  2 ++
 hw/arm/virt-acpi-build.c            |  8 +++--
 hw/arm/virt.c                       |  5 ++-
 hw/core/machine.c                   | 21 ++++++++---
 hw/i386/acpi-build.c                |  2 +-
 hw/i386/pc.c                        |  7 +++-
 hw/mem/pc-dimm.c                    |  2 ++
 hw/pci-bridge/pci_expander_bridge.c |  2 ++
 hw/ppc/spapr.c                      | 12 ++++++-
 include/hw/acpi/aml-build.h         |  2 +-
 include/hw/boards.h                 | 10 ++++++
 include/sysemu/numa.h               |  3 +-
 monitor.c                           |  4 ++-
 numa.c                              | 54 ++++++++++++++++++-----------
 16 files changed, 105 insertions(+), 37 deletions(-)

diff --git a/exec.c b/exec.c
index 4e734770c2..c7eb4af42d 100644
--- a/exec.c
+++ b/exec.c
@@ -1733,6 +1733,7 @@ long qemu_minrampagesize(void)
     long hpsize = LONG_MAX;
     long mainrampagesize;
     Object *memdev_root;
+    MachineState *ms = MACHINE(qdev_get_machine());
 
     mainrampagesize = qemu_mempath_getpagesize(mem_path);
 
@@ -1760,7 +1761,9 @@ long qemu_minrampagesize(void)
      * so if its page size is smaller we have got to report that size instead.
      */
     if (hpsize > mainrampagesize &&
-        (nb_numa_nodes == 0 || numa_info[0].node_memdev == NULL)) {
+        (ms->numa_state == NULL ||
+         ms->numa_state->num_nodes == 0 ||
+         numa_info[0].node_memdev == NULL)) {
         static bool warned;
         if (!warned) {
             error_report("Huge page support disabled (n/a for main memory).");
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index 555c24f21d..c67f4561a4 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1726,10 +1726,11 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
  * ACPI spec 5.2.17 System Locality Distance Information Table
  * (Revision 2.0 or later)
  */
-void build_slit(GArray *table_data, BIOSLinker *linker)
+void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms)
 {
     int slit_start, i, j;
     slit_start = table_data->len;
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     acpi_data_push(table_data, sizeof(AcpiTableHeader));
 
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index a830655e1a..8ff08814fd 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -532,6 +532,8 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
     hwaddr mem_base, mem_len;
     char **node_path;
     Error *err = NULL;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     if (binfo->dtb_filename) {
         char *filename;
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index bf9c0bc2f4..6805b4de51 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -516,7 +516,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     int i, srat_start;
     uint64_t mem_base;
     MachineClass *mc = MACHINE_GET_CLASS(vms);
-    const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(MACHINE(vms));
+    MachineState *ms = MACHINE(vms);
+    const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     srat_start = table_data->len;
     srat = acpi_data_push(table_data, sizeof(*srat));
@@ -780,6 +782,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
     GArray *table_offsets;
     unsigned dsdt, xsdt;
     GArray *tables_blob = tables->table_data;
+    MachineState *ms = MACHINE(vms);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     table_offsets = g_array_new(false, true /* clear */,
                                         sizeof(uint32_t));
@@ -813,7 +817,7 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
         build_srat(tables_blob, tables->linker, vms);
         if (have_numa_distance) {
             acpi_add_table(table_offsets, tables_blob);
-            build_slit(tables_blob, tables->linker);
+            build_slit(tables_blob, tables->linker, ms);
         }
     }
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 16ba67f7a7..70954b658d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -195,6 +195,8 @@ static bool cpu_type_valid(const char *cpu)
 
 static void create_fdt(VirtMachineState *vms)
 {
+    MachineState *ms = MACHINE(vms);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
     void *fdt = create_device_tree(&vms->fdt_size);
 
     if (!fdt) {
@@ -1780,7 +1782,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
 
 static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
 {
-    return idx % nb_numa_nodes;
+    return idx % machine_num_numa_nodes(ms);
 }
 
 static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
@@ -1886,6 +1888,7 @@ static void virt_machine_class_init(ObjectClass *oc, void *data)
     mc->kvm_type = virt_kvm_type;
     assert(!mc->get_hotplug_handler);
     mc->get_hotplug_handler = virt_machine_get_hotplug_handler;
+    mc->numa_supported = true;
     hc->plug = virt_machine_device_plug_cb;
 }
 
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 5d046a43e3..90bebb8d3a 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -857,6 +857,11 @@ static void machine_initfn(Object *obj)
                                         NULL);
     }
 
+    if (mc->numa_supported) {
+        ms->numa_state = g_new0(NumaState, 1);
+    } else {
+        ms->numa_state = NULL;
+    }
 
     /* Register notifier when init is done for sysbus sanity checks */
     ms->sysbus_notifier.notify = machine_init_notify;
@@ -877,6 +882,7 @@ static void machine_finalize(Object *obj)
     g_free(ms->firmware);
     g_free(ms->device_memory);
     g_free(ms->nvdimms_state);
+    g_free(ms->numa_state);
 }
 
 bool machine_usb(MachineState *machine)
@@ -919,6 +925,11 @@ bool machine_mem_merge(MachineState *machine)
     return machine->mem_merge;
 }
 
+int machine_num_numa_nodes(const MachineState *machine)
+{
+    return machine->numa_state ? machine->numa_state->num_nodes : 0;
+}
+
 static char *cpu_slot_to_string(const CPUArchId *cpu)
 {
     GString *s = g_string_new(NULL);
@@ -948,7 +959,7 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
 
-    assert(nb_numa_nodes);
+    assert(machine_num_numa_nodes(machine));
     for (i = 0; i < possible_cpus->len; i++) {
         if (possible_cpus->cpus[i].props.has_node_id) {
             break;
@@ -994,9 +1005,11 @@ void machine_run_board_init(MachineState *machine)
 {
     MachineClass *machine_class = MACHINE_GET_CLASS(machine);
 
-    numa_complete_configuration(machine);
-    if (nb_numa_nodes) {
-        machine_numa_finish_cpu_init(machine);
+    if (machine_class->numa_supported) {
+        numa_complete_configuration(machine);
+        if (machine->numa_state->num_nodes) {
+            machine_numa_finish_cpu_init(machine);
+        }
     }
 
     /* If the machine supports the valid_cpu_types check and the user
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 416da318ae..7d9bc88ac9 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2687,7 +2687,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
         build_srat(tables_blob, tables->linker, machine);
         if (have_numa_distance) {
             acpi_add_table(table_offsets, tables_blob);
-            build_slit(tables_blob, tables->linker);
+            build_slit(tables_blob, tables->linker, machine);
         }
     }
     if (acpi_get_mcfg(&mcfg)) {
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index d98b737b8f..6404ae508e 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -999,6 +999,8 @@ static FWCfgState *bochs_bios_init(AddressSpace *as, PCMachineState *pcms)
     int i;
     const CPUArchIdList *cpus;
     MachineClass *mc = MACHINE_GET_CLASS(pcms);
+    MachineState *ms = MACHINE(pcms);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     fw_cfg = fw_cfg_init_io_dma(FW_CFG_IO_BASE, FW_CFG_IO_BASE + 4, as);
     fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, pcms->boot_cpus);
@@ -1675,6 +1677,8 @@ void pc_machine_done(Notifier *notifier, void *data)
 void pc_guest_info_init(PCMachineState *pcms)
 {
     int i;
+    MachineState *ms = MACHINE(pcms);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     pcms->apic_xrupt_override = kvm_allows_irq0_override();
     pcms->numa_nodes = nb_numa_nodes;
@@ -2658,7 +2662,7 @@ static int64_t pc_get_default_cpu_node_id(const MachineState *ms, int idx)
    assert(idx < ms->possible_cpus->len);
    x86_topo_ids_from_apicid(ms->possible_cpus->cpus[idx].arch_id,
                             smp_cores, smp_threads, &topo);
-   return topo.pkg_id % nb_numa_nodes;
+   return topo.pkg_id % machine_num_numa_nodes(ms);
 }
 
 static const CPUArchIdList *pc_possible_cpu_arch_ids(MachineState *ms)
@@ -2752,6 +2756,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
     nc->nmi_monitor_handler = x86_nmi;
     mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE;
     mc->nvdimm_supported = true;
+    mc->numa_supported = true;
 
     object_class_property_add(oc, PC_MACHINE_DEVMEM_REGION_SIZE, "int",
         pc_machine_get_device_memory_region_size, NULL,
diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
index 152400b1fc..48cbd53e6b 100644
--- a/hw/mem/pc-dimm.c
+++ b/hw/mem/pc-dimm.c
@@ -160,6 +160,8 @@ static void pc_dimm_realize(DeviceState *dev, Error **errp)
 {
     PCDIMMDevice *dimm = PC_DIMM(dev);
     PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     if (!dimm->hostmem) {
         error_setg(errp, "'" PC_DIMM_MEMDEV_PROP "' property is not set");
diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
index e62de4218f..d0590c0973 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -217,6 +217,8 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool pcie, Error **errp)
     PCIBus *bus;
     const char *dev_name = NULL;
     Error *local_err = NULL;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     if (pxb->numa_node != NUMA_NODE_UNASSIGNED &&
         pxb->numa_node >= nb_numa_nodes) {
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 2ef3ce4362..4f0a8d4e2e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -290,6 +290,8 @@ static int spapr_fixup_cpu_dt(void *fdt, SpaprMachineState *spapr)
     CPUState *cs;
     char cpu_model[32];
     uint32_t pft_size_prop[] = {0, cpu_to_be32(spapr->htab_shift)};
+    MachineState *ms = MACHINE(spapr);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
@@ -344,6 +346,7 @@ static int spapr_fixup_cpu_dt(void *fdt, SpaprMachineState *spapr)
 
 static hwaddr spapr_node0_size(MachineState *machine)
 {
+    int nb_numa_nodes = machine_num_numa_nodes(machine);
     if (nb_numa_nodes) {
         int i;
         for (i = 0; i < nb_numa_nodes; ++i) {
@@ -390,6 +393,7 @@ static int spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
 static int spapr_populate_memory(SpaprMachineState *spapr, void *fdt)
 {
     MachineState *machine = MACHINE(spapr);
+    int nb_numa_nodes = machine_num_numa_nodes(machine);
     hwaddr mem_start, node_size;
     int i, nb_nodes = nb_numa_nodes;
     NodeInfo *nodes = numa_info;
@@ -444,6 +448,8 @@ static void spapr_populate_cpu_dt(CPUState *cs, void *fdt, int offset,
     PowerPCCPU *cpu = POWERPC_CPU(cs);
     CPUPPCState *env = &cpu->env;
     PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cs);
+    MachineState *ms = MACHINE(spapr);
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
     int index = spapr_get_vcpu_id(cpu);
     uint32_t segs[] = {cpu_to_be32(28), cpu_to_be32(40),
                        0xffffffff, 0xffffffff};
@@ -849,6 +855,7 @@ static int spapr_populate_drmem_v1(SpaprMachineState *spapr, void *fdt,
 static int spapr_populate_drconf_memory(SpaprMachineState *spapr, void *fdt)
 {
     MachineState *machine = MACHINE(spapr);
+    int nb_numa_nodes = machine_num_numa_nodes(machine);
     int ret, i, offset;
     uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
     uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
@@ -1693,6 +1700,7 @@ static void spapr_machine_reset(void)
 {
     MachineState *machine = MACHINE(qdev_get_machine());
     SpaprMachineState *spapr = SPAPR_MACHINE(machine);
+    int nb_numa_nodes = machine_num_numa_nodes(machine);
     PowerPCCPU *first_ppc_cpu;
     uint32_t rtas_limit;
     hwaddr rtas_addr, fdt_addr;
@@ -2509,6 +2517,7 @@ static void spapr_create_lmb_dr_connectors(SpaprMachineState *spapr)
 static void spapr_validate_node_memory(MachineState *machine, Error **errp)
 {
     int i;
+    int nb_numa_nodes = machine_num_numa_nodes(machine);
 
     if (machine->ram_size % SPAPR_MEMORY_BLOCK_SIZE) {
         error_setg(errp, "Memory size 0x" RAM_ADDR_FMT
@@ -4111,7 +4120,7 @@ spapr_cpu_index_to_props(MachineState *machine, unsigned cpu_index)
 
 static int64_t spapr_get_default_cpu_node_id(const MachineState *ms, int idx)
 {
-    return idx / smp_cores % nb_numa_nodes;
+    return idx / smp_cores % machine_num_numa_nodes(ms);
 }
 
 static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
@@ -4315,6 +4324,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
     smc->update_dt_enabled = true;
     mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power9_v2.0");
     mc->has_hotpluggable_cpus = true;
+    mc->numa_supported = true;
     smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
     fwc->get_dev_path = spapr_get_fw_dev_path;
     nc->nmi_monitor_handler = spapr_nmi;
diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
index 1a563ad756..991cf05134 100644
--- a/include/hw/acpi/aml-build.h
+++ b/include/hw/acpi/aml-build.h
@@ -414,7 +414,7 @@ build_append_gas_from_struct(GArray *table, const struct AcpiGenericAddress *s)
 void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
                        uint64_t len, int node, MemoryAffinityFlags flags);
 
-void build_slit(GArray *table_data, BIOSLinker *linker);
+void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms);
 
 void build_fadt(GArray *tbl, BIOSLinker *linker, const AcpiFadtData *f,
                 const char *oem_id, const char *oem_table_id);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 6f7916f88f..5f102e3075 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -5,6 +5,7 @@
 
 #include "sysemu/blockdev.h"
 #include "sysemu/accel.h"
+#include "sysemu/sysemu.h"
 #include "hw/qdev.h"
 #include "qom/object.h"
 #include "qom/cpu.h"
@@ -68,6 +69,7 @@ int machine_kvm_shadow_mem(MachineState *machine);
 int machine_phandle_start(MachineState *machine);
 bool machine_dump_guest_core(MachineState *machine);
 bool machine_mem_merge(MachineState *machine);
+int machine_num_numa_nodes(const MachineState *machine);
 HotpluggableCPUList *machine_query_hotpluggable_cpus(MachineState *machine);
 void machine_set_cpu_numa_node(MachineState *machine,
                                const CpuInstanceProperties *props,
@@ -210,6 +212,7 @@ struct MachineClass {
     bool ignore_boot_device_suffixes;
     bool smbus_no_migration_support;
     bool nvdimm_supported;
+    bool numa_supported;
 
     HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
                                            DeviceState *dev);
@@ -230,6 +233,12 @@ typedef struct DeviceMemoryState {
     MemoryRegion mr;
 } DeviceMemoryState;
 
+typedef struct NumaState {
+    /* Number of NUMA nodes */
+    int num_nodes;
+
+} NumaState;
+
 /**
  * MachineState:
  */
@@ -273,6 +282,7 @@ struct MachineState {
     AccelState *accelerator;
     CPUArchIdList *possible_cpus;
     struct NVDIMMState *nvdimms_state;
+    NumaState *numa_state;
 };
 
 #define DEFINE_MACHINE(namestr, machine_initfn) \
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index b6ac7de43e..a55e2be563 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -6,7 +6,6 @@
 #include "sysemu/hostmem.h"
 #include "hw/boards.h"
 
-extern int nb_numa_nodes;   /* Number of NUMA nodes */
 extern bool have_numa_distance;
 
 struct NodeInfo {
@@ -24,7 +23,7 @@ struct NumaNodeMem {
 extern NodeInfo numa_info[MAX_NODES];
 void parse_numa_opts(MachineState *ms);
 void numa_complete_configuration(MachineState *ms);
-void query_numa_node_mem(NumaNodeMem node_mem[]);
+void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms);
 extern QemuOptsList qemu_numa_opts;
 void numa_legacy_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
                                  int nb_nodes, ram_addr_t size);
diff --git a/monitor.c b/monitor.c
index bb48997913..28ea45a731 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1926,11 +1926,13 @@ static void hmp_info_numa(Monitor *mon, const QDict *qdict)
     int i;
     NumaNodeMem *node_mem;
     CpuInfoList *cpu_list, *cpu;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     cpu_list = qmp_query_cpus(&error_abort);
     node_mem = g_new0(NumaNodeMem, nb_numa_nodes);
 
-    query_numa_node_mem(node_mem);
+    query_numa_node_mem(node_mem, ms);
     monitor_printf(mon, "%d nodes\n", nb_numa_nodes);
     for (i = 0; i < nb_numa_nodes; i++) {
         monitor_printf(mon, "node %d cpus:", i);
diff --git a/numa.c b/numa.c
index 3875e1efda..343fcaf13f 100644
--- a/numa.c
+++ b/numa.c
@@ -52,7 +52,6 @@ static int have_memdevs = -1;
 static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
                              * For all nodes, nodeid < max_numa_nodeid
                              */
-int nb_numa_nodes;
 bool have_numa_distance;
 NodeInfo numa_info[MAX_NODES];
 
@@ -68,7 +67,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
     if (node->has_nodeid) {
         nodenr = node->nodeid;
     } else {
-        nodenr = nb_numa_nodes;
+        nodenr = machine_num_numa_nodes(ms);
     }
 
     if (nodenr >= MAX_NODES) {
@@ -136,10 +135,11 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
     }
     numa_info[nodenr].present = true;
     max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
-    nb_numa_nodes++;
+    ms->numa_state->num_nodes++;
 }
 
-static void parse_numa_distance(NumaDistOptions *dist, Error **errp)
+static
+void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
 {
     uint16_t src = dist->src;
     uint16_t dst = dist->dst;
@@ -179,6 +179,11 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
 {
     Error *err = NULL;
 
+    if (ms->numa_state == NULL) {
+        error_setg(errp, "NUMA is not supported by this machine-type");
+        goto end;
+    }
+
     switch (object->type) {
     case NUMA_OPTIONS_TYPE_NODE:
         parse_numa_node(ms, &object->u.node, &err);
@@ -187,7 +192,7 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
         }
         break;
     case NUMA_OPTIONS_TYPE_DIST:
-        parse_numa_distance(&object->u.dist, &err);
+        parse_numa_distance(ms, &object->u.dist, &err);
         if (err) {
             goto end;
         }
@@ -252,10 +257,11 @@ end:
  * distance from a node to itself is always NUMA_DISTANCE_MIN,
  * so providing it is never necessary.
  */
-static void validate_numa_distance(void)
+static void validate_numa_distance(MachineState *ms)
 {
     int src, dst;
     bool is_asymmetrical = false;
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     for (src = 0; src < nb_numa_nodes; src++) {
         for (dst = src; dst < nb_numa_nodes; dst++) {
@@ -293,9 +299,10 @@ static void validate_numa_distance(void)
     }
 }
 
-static void complete_init_numa_distance(void)
+static void complete_init_numa_distance(MachineState *ms)
 {
     int src, dst;
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     /* Fixup NUMA distance by symmetric policy because if it is an
      * asymmetric distance table, it should be a complete table and
@@ -369,7 +376,7 @@ void numa_complete_configuration(MachineState *ms)
      *
      * Enable NUMA implicitly by adding a new NUMA node automatically.
      */
-    if (ms->ram_slots > 0 && nb_numa_nodes == 0 &&
+    if (ms->ram_slots > 0 && ms->numa_state->num_nodes == 0 &&
         mc->auto_enable_numa_with_memhp) {
             NumaNodeOptions node = { };
             parse_numa_node(ms, &node, &error_abort);
@@ -387,30 +394,33 @@ void numa_complete_configuration(MachineState *ms)
     }
 
     /* This must be always true if all nodes are present: */
-    assert(nb_numa_nodes == max_numa_nodeid);
+    assert(ms->numa_state->num_nodes == max_numa_nodeid);
 
-    if (nb_numa_nodes > 0) {
+    if (ms->numa_state->num_nodes > 0) {
         uint64_t numa_total;
 
-        if (nb_numa_nodes > MAX_NODES) {
-            nb_numa_nodes = MAX_NODES;
+        if (ms->numa_state->num_nodes > MAX_NODES) {
+            ms->numa_state->num_nodes = MAX_NODES;
         }
 
         /* If no memory size is given for any node, assume the default case
          * and distribute the available memory equally across all nodes
          */
-        for (i = 0; i < nb_numa_nodes; i++) {
+        for (i = 0; i < ms->numa_state->num_nodes; i++) {
             if (numa_info[i].node_mem != 0) {
                 break;
             }
         }
-        if (i == nb_numa_nodes) {
+        if (i == ms->numa_state->num_nodes) {
             assert(mc->numa_auto_assign_ram);
-            mc->numa_auto_assign_ram(mc, numa_info, nb_numa_nodes, ram_size);
+            mc->numa_auto_assign_ram(mc,
+                                     numa_info,
+                                     ms->numa_state->num_nodes,
+                                     ram_size);
         }
 
         numa_total = 0;
-        for (i = 0; i < nb_numa_nodes; i++) {
+        for (i = 0; i < ms->numa_state->num_nodes; i++) {
             numa_total += numa_info[i].node_mem;
         }
         if (numa_total != ram_size) {
@@ -434,10 +444,10 @@ void numa_complete_configuration(MachineState *ms)
          */
         if (have_numa_distance) {
             /* Validate enough NUMA distance information was provided. */
-            validate_numa_distance();
+            validate_numa_distance(ms);
 
             /* Validation succeeded, now fill in any missing distances. */
-            complete_init_numa_distance();
+            complete_init_numa_distance(ms);
         }
     }
 }
@@ -513,6 +523,8 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner,
 {
     uint64_t addr = 0;
     int i;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int nb_numa_nodes = machine_num_numa_nodes(ms);
 
     if (nb_numa_nodes == 0 || !have_memdevs) {
         allocate_system_memory_nonnuma(mr, owner, name, ram_size);
@@ -578,16 +590,16 @@ static void numa_stat_memory_devices(NumaNodeMem node_mem[])
     qapi_free_MemoryDeviceInfoList(info_list);
 }
 
-void query_numa_node_mem(NumaNodeMem node_mem[])
+void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms)
 {
     int i;
 
-    if (nb_numa_nodes <= 0) {
+    if (ms->numa_state == NULL || ms->numa_state->num_nodes <= 0) {
         return;
     }
 
     numa_stat_memory_devices(node_mem);
-    for (i = 0; i < nb_numa_nodes; i++) {
+    for (i = 0; i < ms->numa_state->num_nodes; i++) {
         node_mem[i].node_mem += numa_info[i].node_mem;
     }
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance into MachineState
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-05-23 13:07   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info " Tao Xu
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

The aim of this patch is to move existing numa global have_numa_distance
into NumaState.

Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
Suggested-by: Igor Mammedov <imammedo@redhat.com>
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - send the patch together with HMAT patches
---
 hw/arm/virt-acpi-build.c | 2 +-
 hw/arm/virt.c            | 2 +-
 hw/i386/acpi-build.c     | 2 +-
 include/hw/boards.h      | 2 ++
 include/sysemu/numa.h    | 2 --
 numa.c                   | 5 ++---
 6 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 6805b4de51..65f070843c 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -815,7 +815,7 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
     if (nb_numa_nodes > 0) {
         acpi_add_table(table_offsets, tables_blob);
         build_srat(tables_blob, tables->linker, vms);
-        if (have_numa_distance) {
+        if (ms->numa_state->have_numa_distance) {
             acpi_add_table(table_offsets, tables_blob);
             build_slit(tables_blob, tables->linker, ms);
         }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 70954b658d..f0818ef597 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -228,7 +228,7 @@ static void create_fdt(VirtMachineState *vms)
                                 "clk24mhz");
     qemu_fdt_setprop_cell(fdt, "/apb-pclk", "phandle", vms->clock_phandle);
 
-    if (have_numa_distance) {
+    if (nb_numa_nodes > 0 && ms->numa_state->have_numa_distance) {
         int size = nb_numa_nodes * nb_numa_nodes * 3 * sizeof(uint32_t);
         uint32_t *matrix = g_malloc0(size);
         int idx, i, j;
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 7d9bc88ac9..43a807c483 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2685,7 +2685,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
     if (pcms->numa_nodes) {
         acpi_add_table(table_offsets, tables_blob);
         build_srat(tables_blob, tables->linker, machine);
-        if (have_numa_distance) {
+        if (machine->numa_state->have_numa_distance) {
             acpi_add_table(table_offsets, tables_blob);
             build_slit(tables_blob, tables->linker, machine);
         }
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5f102e3075..c3c678b7ff 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -237,6 +237,8 @@ typedef struct NumaState {
     /* Number of NUMA nodes */
     int num_nodes;
 
+    /* Allow setting NUMA distance for different NUMA nodes */
+    bool have_numa_distance;
 } NumaState;
 
 /**
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index a55e2be563..1a29408db9 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -6,8 +6,6 @@
 #include "sysemu/hostmem.h"
 #include "hw/boards.h"
 
-extern bool have_numa_distance;
-
 struct NodeInfo {
     uint64_t node_mem;
     struct HostMemoryBackend *node_memdev;
diff --git a/numa.c b/numa.c
index 343fcaf13f..d4f5ff5193 100644
--- a/numa.c
+++ b/numa.c
@@ -52,7 +52,6 @@ static int have_memdevs = -1;
 static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
                              * For all nodes, nodeid < max_numa_nodeid
                              */
-bool have_numa_distance;
 NodeInfo numa_info[MAX_NODES];
 
 
@@ -171,7 +170,7 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
     }
 
     numa_info[src].distance[dst] = val;
-    have_numa_distance = true;
+    ms->numa_state->have_numa_distance = true;
 }
 
 static
@@ -442,7 +441,7 @@ void numa_complete_configuration(MachineState *ms)
          * asymmetric. In this case, the distances for both directions
          * of all node pairs are required.
          */
-        if (have_numa_distance) {
+        if (ms->numa_state->have_numa_distance) {
             /* Validate enough NUMA distance information was provided. */
             validate_numa_distance(ms);
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info into MachineState
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState Tao Xu
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance " Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-05-23 13:47   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook Tao Xu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

The aim of this patch is to move existing numa global numa_info
(renamed as "nodes") into NumaState.

Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
Suggested-by: Igor Mammedov <imammedo@redhat.com>
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - send the patch together with HMAT patches

Changes in v3 -> v2:
    - rename the "NumaState::numa_info" as "NumaState::nodes" (Eduardo)
---
 exec.c                   |  2 +-
 hw/acpi/aml-build.c      |  6 ++++--
 hw/arm/boot.c            |  2 +-
 hw/arm/virt-acpi-build.c |  7 ++++---
 hw/arm/virt.c            |  1 +
 hw/i386/pc.c             |  4 ++--
 hw/ppc/spapr.c           |  8 +++++++-
 hw/ppc/spapr_pci.c       |  2 ++
 include/hw/boards.h      | 10 ++++++++++
 include/sysemu/numa.h    |  8 --------
 numa.c                   | 15 +++++++++------
 11 files changed, 41 insertions(+), 24 deletions(-)

diff --git a/exec.c b/exec.c
index c7eb4af42d..0e30926588 100644
--- a/exec.c
+++ b/exec.c
@@ -1763,7 +1763,7 @@ long qemu_minrampagesize(void)
     if (hpsize > mainrampagesize &&
         (ms->numa_state == NULL ||
          ms->numa_state->num_nodes == 0 ||
-         numa_info[0].node_memdev == NULL)) {
+         ms->numa_state->nodes[0].node_memdev == NULL)) {
         static bool warned;
         if (!warned) {
             error_report("Huge page support disabled (n/a for main memory).");
diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
index c67f4561a4..b53a55cb56 100644
--- a/hw/acpi/aml-build.c
+++ b/hw/acpi/aml-build.c
@@ -1737,8 +1737,10 @@ void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms)
     build_append_int_noprefix(table_data, nb_numa_nodes, 8);
     for (i = 0; i < nb_numa_nodes; i++) {
         for (j = 0; j < nb_numa_nodes; j++) {
-            assert(numa_info[i].distance[j]);
-            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
+            assert(ms->numa_state->nodes[i].distance[j]);
+            build_append_int_noprefix(table_data,
+                                      ms->numa_state->nodes[i].distance[j],
+                                      1);
         }
     }
 
diff --git a/hw/arm/boot.c b/hw/arm/boot.c
index 8ff08814fd..845b737ab9 100644
--- a/hw/arm/boot.c
+++ b/hw/arm/boot.c
@@ -602,7 +602,7 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
     if (nb_numa_nodes > 0) {
         mem_base = binfo->loader_start;
         for (i = 0; i < nb_numa_nodes; i++) {
-            mem_len = numa_info[i].node_mem;
+            mem_len = ms->numa_state->nodes[i].node_mem;
             rc = fdt_add_memory_node(fdt, acells, mem_base,
                                      scells, mem_len, i);
             if (rc < 0) {
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 65f070843c..b22c3d27ad 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -535,11 +535,12 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
 
     mem_base = vms->memmap[VIRT_MEM].base;
     for (i = 0; i < nb_numa_nodes; ++i) {
-        if (numa_info[i].node_mem > 0) {
+        if (ms->numa_state->nodes[i].node_mem > 0) {
             numamem = acpi_data_push(table_data, sizeof(*numamem));
-            build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
+            build_srat_memory(numamem, mem_base,
+                              ms->numa_state->nodes[i].node_mem, i,
                               MEM_AFFINITY_ENABLED);
-            mem_base += numa_info[i].node_mem;
+            mem_base += ms->numa_state->nodes[i].node_mem;
         }
     }
 
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index f0818ef597..853caf606f 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -232,6 +232,7 @@ static void create_fdt(VirtMachineState *vms)
         int size = nb_numa_nodes * nb_numa_nodes * 3 * sizeof(uint32_t);
         uint32_t *matrix = g_malloc0(size);
         int idx, i, j;
+        NodeInfo *numa_info = ms->numa_state->nodes;
 
         for (i = 0; i < nb_numa_nodes; i++) {
             for (j = 0; j < nb_numa_nodes; j++) {
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 6404ae508e..1c7b2a97bc 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1043,7 +1043,7 @@ static FWCfgState *bochs_bios_init(AddressSpace *as, PCMachineState *pcms)
     }
     for (i = 0; i < nb_numa_nodes; i++) {
         numa_fw_cfg[pcms->apic_id_limit + 1 + i] =
-            cpu_to_le64(numa_info[i].node_mem);
+            cpu_to_le64(ms->numa_state->nodes[i].node_mem);
     }
     fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, numa_fw_cfg,
                      (1 + pcms->apic_id_limit + nb_numa_nodes) *
@@ -1685,7 +1685,7 @@ void pc_guest_info_init(PCMachineState *pcms)
     pcms->node_mem = g_malloc0(pcms->numa_nodes *
                                     sizeof *pcms->node_mem);
     for (i = 0; i < nb_numa_nodes; i++) {
-        pcms->node_mem[i] = numa_info[i].node_mem;
+        pcms->node_mem[i] = ms->numa_state->nodes[i].node_mem;
     }
 
     pcms->machine_done.notify = pc_machine_done;
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4f0a8d4e2e..d577c2025e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -349,6 +349,7 @@ static hwaddr spapr_node0_size(MachineState *machine)
     int nb_numa_nodes = machine_num_numa_nodes(machine);
     if (nb_numa_nodes) {
         int i;
+        NodeInfo *numa_info = machine->numa_state->nodes;
         for (i = 0; i < nb_numa_nodes; ++i) {
             if (numa_info[i].node_mem) {
                 return MIN(pow2floor(numa_info[i].node_mem),
@@ -396,7 +397,9 @@ static int spapr_populate_memory(SpaprMachineState *spapr, void *fdt)
     int nb_numa_nodes = machine_num_numa_nodes(machine);
     hwaddr mem_start, node_size;
     int i, nb_nodes = nb_numa_nodes;
-    NodeInfo *nodes = numa_info;
+    NodeInfo *nodes = machine->numa_state ?
+                      machine->numa_state->nodes :
+                      NULL;
     NodeInfo ramnode;
 
     /* No NUMA nodes, assume there is just one node with whole RAM */
@@ -2518,6 +2521,9 @@ static void spapr_validate_node_memory(MachineState *machine, Error **errp)
 {
     int i;
     int nb_numa_nodes = machine_num_numa_nodes(machine);
+    NodeInfo *numa_info = machine->numa_state ?
+                          machine->numa_state->nodes :
+                          NULL;
 
     if (machine->ram_size % SPAPR_MEMORY_BLOCK_SIZE) {
         error_setg(errp, "Memory size 0x" RAM_ADDR_FMT
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 97961b0128..f4e5c0f5b2 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1660,6 +1660,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     SysBusDevice *s = SYS_BUS_DEVICE(dev);
     SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
     PCIHostState *phb = PCI_HOST_BRIDGE(s);
+    MachineState *ms = MACHINE(spapr);
+    NodeInfo *numa_info = ms->numa_state ? ms->numa_state->nodes : NULL;
     char *namebuf;
     int i;
     PCIBus *bus;
diff --git a/include/hw/boards.h b/include/hw/boards.h
index c3c678b7ff..777eed4dd9 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -233,12 +233,22 @@ typedef struct DeviceMemoryState {
     MemoryRegion mr;
 } DeviceMemoryState;
 
+struct NodeInfo {
+    uint64_t node_mem;
+    struct HostMemoryBackend *node_memdev;
+    bool present;
+    uint8_t distance[MAX_NODES];
+};
+
 typedef struct NumaState {
     /* Number of NUMA nodes */
     int num_nodes;
 
     /* Allow setting NUMA distance for different NUMA nodes */
     bool have_numa_distance;
+
+    /* NUMA nodes information */
+    NodeInfo nodes[MAX_NODES];
 } NumaState;
 
 /**
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index 1a29408db9..7b8011f9ea 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -6,19 +6,11 @@
 #include "sysemu/hostmem.h"
 #include "hw/boards.h"
 
-struct NodeInfo {
-    uint64_t node_mem;
-    struct HostMemoryBackend *node_memdev;
-    bool present;
-    uint8_t distance[MAX_NODES];
-};
-
 struct NumaNodeMem {
     uint64_t node_mem;
     uint64_t node_plugged_mem;
 };
 
-extern NodeInfo numa_info[MAX_NODES];
 void parse_numa_opts(MachineState *ms);
 void numa_complete_configuration(MachineState *ms);
 void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms);
diff --git a/numa.c b/numa.c
index d4f5ff5193..ddea376d72 100644
--- a/numa.c
+++ b/numa.c
@@ -52,8 +52,6 @@ static int have_memdevs = -1;
 static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
                              * For all nodes, nodeid < max_numa_nodeid
                              */
-NodeInfo numa_info[MAX_NODES];
-
 
 static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
                             Error **errp)
@@ -62,6 +60,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
     uint16_t nodenr;
     uint16List *cpus = NULL;
     MachineClass *mc = MACHINE_GET_CLASS(ms);
+    NodeInfo *numa_info = ms->numa_state->nodes;
 
     if (node->has_nodeid) {
         nodenr = node->nodeid;
@@ -143,6 +142,7 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
     uint16_t src = dist->src;
     uint16_t dst = dist->dst;
     uint8_t val = dist->val;
+    NodeInfo *numa_info = ms->numa_state->nodes;
 
     if (src >= MAX_NODES || dst >= MAX_NODES) {
         error_setg(errp, "Parameter '%s' expects an integer between 0 and %d",
@@ -201,7 +201,7 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
             error_setg(&err, "Missing mandatory node-id property");
             goto end;
         }
-        if (!numa_info[object->u.cpu.node_id].present) {
+        if (!ms->numa_state->nodes[object->u.cpu.node_id].present) {
             error_setg(&err, "Invalid node-id=%" PRId64 ", NUMA node must be "
                 "defined with -numa node,nodeid=ID before it's used with "
                 "-numa cpu,node-id=ID", object->u.cpu.node_id);
@@ -261,6 +261,7 @@ static void validate_numa_distance(MachineState *ms)
     int src, dst;
     bool is_asymmetrical = false;
     int nb_numa_nodes = machine_num_numa_nodes(ms);
+    NodeInfo *numa_info = ms->numa_state->nodes;
 
     for (src = 0; src < nb_numa_nodes; src++) {
         for (dst = src; dst < nb_numa_nodes; dst++) {
@@ -302,6 +303,7 @@ static void complete_init_numa_distance(MachineState *ms)
 {
     int src, dst;
     int nb_numa_nodes = machine_num_numa_nodes(ms);
+    NodeInfo *numa_info = ms->numa_state->nodes;
 
     /* Fixup NUMA distance by symmetric policy because if it is an
      * asymmetric distance table, it should be a complete table and
@@ -361,6 +363,7 @@ void numa_complete_configuration(MachineState *ms)
 {
     int i;
     MachineClass *mc = MACHINE_GET_CLASS(ms);
+    NodeInfo *numa_info = ms->numa_state->nodes;
 
     /*
      * If memory hotplug is enabled (slots > 0) but without '-numa'
@@ -532,8 +535,8 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner,
 
     memory_region_init(mr, owner, name, ram_size);
     for (i = 0; i < nb_numa_nodes; i++) {
-        uint64_t size = numa_info[i].node_mem;
-        HostMemoryBackend *backend = numa_info[i].node_memdev;
+        uint64_t size = ms->numa_state->nodes[i].node_mem;
+        HostMemoryBackend *backend = ms->numa_state->nodes[i].node_memdev;
         if (!backend) {
             continue;
         }
@@ -599,7 +602,7 @@ void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms)
 
     numa_stat_memory_devices(node_mem);
     for (i = 0; i < ms->numa_state->num_nodes; i++) {
-        node_mem[i].node_mem += numa_info[i].node_mem;
+        node_mem[i].node_mem += ms->numa_state->nodes[i].node_mem;
     }
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (2 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info " Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-05-24 12:35   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 05/11] hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI HMAT Tao Xu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

Add build_mem_ranges callback to AcpiDeviceIfClass and use
it for generating SRAT and HMAT numa memory ranges.

Suggested-by: Igor Mammedov <imammedo@redhat.com>
Co-developed-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - spilt the 1/8 of v3 patch into two patches, 4/13 introduces
    build_mem_ranges() and adding it to ACPI interface, 5/13 builds
    HMAT (Igor)
---
 hw/acpi/piix4.c                      |   1 +
 hw/i386/acpi-build.c                 | 116 ++++++++++++++++-----------
 hw/isa/lpc_ich9.c                    |   1 +
 include/hw/acpi/acpi_dev_interface.h |   3 +
 include/hw/boards.h                  |  12 +++
 include/hw/i386/pc.h                 |   1 +
 stubs/Makefile.objs                  |   1 +
 stubs/pc_build_mem_ranges.c          |   6 ++
 8 files changed, 96 insertions(+), 45 deletions(-)
 create mode 100644 stubs/pc_build_mem_ranges.c

diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
index 9c079d6834..7c320a49b2 100644
--- a/hw/acpi/piix4.c
+++ b/hw/acpi/piix4.c
@@ -723,6 +723,7 @@ static void piix4_pm_class_init(ObjectClass *klass, void *data)
     adevc->ospm_status = piix4_ospm_status;
     adevc->send_event = piix4_send_gpe;
     adevc->madt_cpu = pc_madt_cpu_entry;
+    adevc->build_mem_ranges = pc_build_mem_ranges;
 }
 
 static const TypeInfo piix4_pm_info = {
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 43a807c483..5598e7f780 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -2271,6 +2271,65 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, GArray *tcpalog)
 #define HOLE_640K_START  (640 * KiB)
 #define HOLE_640K_END   (1 * MiB)
 
+void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *ms)
+{
+    uint64_t mem_len, mem_base, next_base;
+    int i;
+    PCMachineState *pcms = PC_MACHINE(ms);
+    /*
+     * the memory map is a bit tricky, it contains at least one hole
+     * from 640k-1M and possibly another one from 3.5G-4G.
+     */
+    NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
+    ms->numa_state->mem_ranges_num = 0;
+    next_base = 0;
+
+    for (i = 0; i < pcms->numa_nodes; ++i) {
+        mem_base = next_base;
+        mem_len = pcms->node_mem[i];
+        next_base = mem_base + mem_len;
+
+        /* Cut out the 640K hole */
+        if (mem_base <= HOLE_640K_START &&
+            next_base > HOLE_640K_START) {
+            mem_len -= next_base - HOLE_640K_START;
+            if (mem_len > 0) {
+                mem_ranges[ms->numa_state->mem_ranges_num].base = mem_base;
+                mem_ranges[ms->numa_state->mem_ranges_num].length = mem_len;
+                mem_ranges[ms->numa_state->mem_ranges_num].node = i;
+                ms->numa_state->mem_ranges_num++;
+            }
+
+            /* Check for the rare case: 640K < RAM < 1M */
+            if (next_base <= HOLE_640K_END) {
+                next_base = HOLE_640K_END;
+                continue;
+            }
+            mem_base = HOLE_640K_END;
+            mem_len = next_base - HOLE_640K_END;
+        }
+
+        /* Cut out the ACPI_PCI hole */
+        if (mem_base <= pcms->below_4g_mem_size &&
+            next_base > pcms->below_4g_mem_size) {
+            mem_len -= next_base - pcms->below_4g_mem_size;
+            if (mem_len > 0) {
+                mem_ranges[ms->numa_state->mem_ranges_num].base = mem_base;
+                mem_ranges[ms->numa_state->mem_ranges_num].length = mem_len;
+                mem_ranges[ms->numa_state->mem_ranges_num].node = i;
+                ms->numa_state->mem_ranges_num++;
+            }
+            mem_base = 1ULL << 32;
+            mem_len = next_base - pcms->below_4g_mem_size;
+            next_base = mem_base + mem_len;
+        }
+        mem_ranges[ms->numa_state->mem_ranges_num].base = mem_base;
+        mem_ranges[ms->numa_state->mem_ranges_num].length = mem_len;
+        mem_ranges[ms->numa_state->mem_ranges_num].node = i;
+        ms->numa_state->mem_ranges_num++;
+    }
+}
+
 static void
 build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 {
@@ -2279,10 +2338,13 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 
     int i;
     int srat_start, numa_start, slots;
-    uint64_t mem_len, mem_base, next_base;
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     const CPUArchIdList *apic_ids = mc->possible_cpu_arch_ids(machine);
     PCMachineState *pcms = PC_MACHINE(machine);
+    AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
+    AcpiDeviceIf *adev = ACPI_DEVICE_IF(pcms->acpi_dev);
+    uint32_t mem_ranges_num = machine->numa_state->mem_ranges_num;
+    NumaMemRange *mem_ranges = machine->numa_state->mem_ranges;
     ram_addr_t hotplugabble_address_space_size =
         object_property_get_int(OBJECT(pcms), PC_MACHINE_DEVMEM_REGION_SIZE,
                                 NULL);
@@ -2319,57 +2381,21 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
         }
     }
 
+    if (pcms->numa_nodes && !mem_ranges_num) {
+        adevc->build_mem_ranges(adev, machine);
+    }
 
-    /* the memory map is a bit tricky, it contains at least one hole
-     * from 640k-1M and possibly another one from 3.5G-4G.
-     */
-    next_base = 0;
     numa_start = table_data->len;
 
-    for (i = 1; i < pcms->numa_nodes + 1; ++i) {
-        mem_base = next_base;
-        mem_len = pcms->node_mem[i - 1];
-        next_base = mem_base + mem_len;
-
-        /* Cut out the 640K hole */
-        if (mem_base <= HOLE_640K_START &&
-            next_base > HOLE_640K_START) {
-            mem_len -= next_base - HOLE_640K_START;
-            if (mem_len > 0) {
+    for (i = 0; i < mem_ranges_num; i++) {
+        if (mem_ranges[i].length > 0) {
                 numamem = acpi_data_push(table_data, sizeof *numamem);
-                build_srat_memory(numamem, mem_base, mem_len, i - 1,
+            build_srat_memory(numamem, mem_ranges[i].base,
+                              mem_ranges[i].length,
+                              mem_ranges[i].node,
                                   MEM_AFFINITY_ENABLED);
             }
-
-            /* Check for the rare case: 640K < RAM < 1M */
-            if (next_base <= HOLE_640K_END) {
-                next_base = HOLE_640K_END;
-                continue;
             }
-            mem_base = HOLE_640K_END;
-            mem_len = next_base - HOLE_640K_END;
-        }
-
-        /* Cut out the ACPI_PCI hole */
-        if (mem_base <= pcms->below_4g_mem_size &&
-            next_base > pcms->below_4g_mem_size) {
-            mem_len -= next_base - pcms->below_4g_mem_size;
-            if (mem_len > 0) {
-                numamem = acpi_data_push(table_data, sizeof *numamem);
-                build_srat_memory(numamem, mem_base, mem_len, i - 1,
-                                  MEM_AFFINITY_ENABLED);
-            }
-            mem_base = 1ULL << 32;
-            mem_len = next_base - pcms->below_4g_mem_size;
-            next_base = mem_base + mem_len;
-        }
-
-        if (mem_len > 0) {
-            numamem = acpi_data_push(table_data, sizeof *numamem);
-            build_srat_memory(numamem, mem_base, mem_len, i - 1,
-                              MEM_AFFINITY_ENABLED);
-        }
-    }
     slots = (table_data->len - numa_start) / sizeof *numamem;
     for (; slots < pcms->numa_nodes + 2; slots++) {
         numamem = acpi_data_push(table_data, sizeof *numamem);
diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
index ac44aa53be..4ae64846ba 100644
--- a/hw/isa/lpc_ich9.c
+++ b/hw/isa/lpc_ich9.c
@@ -812,6 +812,7 @@ static void ich9_lpc_class_init(ObjectClass *klass, void *data)
     adevc->ospm_status = ich9_pm_ospm_status;
     adevc->send_event = ich9_send_gpe;
     adevc->madt_cpu = pc_madt_cpu_entry;
+    adevc->build_mem_ranges = pc_build_mem_ranges;
 }
 
 static const TypeInfo ich9_lpc_info = {
diff --git a/include/hw/acpi/acpi_dev_interface.h b/include/hw/acpi/acpi_dev_interface.h
index 43ff119179..d8634ac1ed 100644
--- a/include/hw/acpi/acpi_dev_interface.h
+++ b/include/hw/acpi/acpi_dev_interface.h
@@ -39,6 +39,7 @@ void acpi_send_event(DeviceState *dev, AcpiEventStatusBits event);
  *           for CPU indexed by @uid in @apic_ids array,
  *           returned structure types are:
  *           0 - Local APIC, 9 - Local x2APIC, 0xB - GICC
+ * build_mem_ranges: build memory ranges of ACPI SRAT and HMAT
  *
  * Interface is designed for providing unified interface
  * to generic ACPI functionality that could be used without
@@ -54,5 +55,7 @@ typedef struct AcpiDeviceIfClass {
     void (*send_event)(AcpiDeviceIf *adev, AcpiEventStatusBits ev);
     void (*madt_cpu)(AcpiDeviceIf *adev, int uid,
                      const CPUArchIdList *apic_ids, GArray *entry);
+    void (*build_mem_ranges)(AcpiDeviceIf *adev, MachineState *ms);
+
 } AcpiDeviceIfClass;
 #endif
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 777eed4dd9..9fbf921ecf 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -240,6 +240,12 @@ struct NodeInfo {
     uint8_t distance[MAX_NODES];
 };
 
+typedef struct NumaMemRange {
+    uint64_t base;
+    uint64_t length;
+    uint32_t node;
+} NumaMemRange;
+
 typedef struct NumaState {
     /* Number of NUMA nodes */
     int num_nodes;
@@ -249,6 +255,12 @@ typedef struct NumaState {
 
     /* NUMA nodes information */
     NodeInfo nodes[MAX_NODES];
+
+    /* Number of NUMA memory ranges */
+    uint32_t mem_ranges_num;
+
+    /* NUMA memory ranges */
+    NumaMemRange mem_ranges[MAX_NODES + 2];
 } NumaState;
 
 /**
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 43df7230a2..1e4ee404ae 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -281,6 +281,7 @@ void pc_system_firmware_init(PCMachineState *pcms, MemoryRegion *rom_memory);
 /* acpi-build.c */
 void pc_madt_cpu_entry(AcpiDeviceIf *adev, int uid,
                        const CPUArchIdList *apic_ids, GArray *entry);
+void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *ms);
 
 /* e820 types */
 #define E820_RAM        1
diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
index 269dfa5832..7e0a962815 100644
--- a/stubs/Makefile.objs
+++ b/stubs/Makefile.objs
@@ -33,6 +33,7 @@ stub-obj-y += qmp_memory_device.o
 stub-obj-y += target-monitor-defs.o
 stub-obj-y += target-get-monitor-def.o
 stub-obj-y += pc_madt_cpu_entry.o
+stub-obj-y += pc_build_mem_ranges.o
 stub-obj-y += vmgenid.o
 stub-obj-y += xen-common.o
 stub-obj-y += xen-hvm.o
diff --git a/stubs/pc_build_mem_ranges.c b/stubs/pc_build_mem_ranges.c
new file mode 100644
index 0000000000..0f104ba79d
--- /dev/null
+++ b/stubs/pc_build_mem_ranges.c
@@ -0,0 +1,6 @@
+#include "qemu/osdep.h"
+#include "hw/i386/pc.h"
+
+void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *machine)
+{
+}
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 05/11] hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI HMAT
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (3 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-05-24 14:16   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 06/11] hmat acpi: Build System Locality Latency and Bandwidth Information " Tao Xu
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

From: Liu Jingqi <jingqi.liu@intel.com>

HMAT is defined in ACPI 6.2: 5.2.27 Heterogeneous Memory Attribute Table (HMAT).
The specification references below link:
http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

It describes the memory attributes, such as memory side cache
attributes and bandwidth and latency details, related to the
System Physical Address (SPA) Memory Ranges. The software is
expected to use this information as hint for optimization.

This structure describes the System Physical Address(SPA) range
occupied by memory subsystem and its associativity with processor
proximity domain as well as hint for memory usage.

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - spilt the 1/8 of v3 patch into two patches, 4/13 introduces
    build_mem_ranges() and adding it to ACPI interface, 5/13 builds
    HMAT (Igor)
    - use MachineState instead of PCMachineState to build HMAT more
    generalic (Igor)
    - move hmat_build_spa() inside of hmat_build_hma() (Igor)
---
 hw/acpi/Kconfig       |   5 ++
 hw/acpi/Makefile.objs |   1 +
 hw/acpi/hmat.c        | 135 ++++++++++++++++++++++++++++++++++++++++++
 hw/acpi/hmat.h        |  43 ++++++++++++++
 hw/i386/acpi-build.c  |  11 ++--
 include/hw/boards.h   |   2 +
 numa.c                |   6 ++
 7 files changed, 199 insertions(+), 4 deletions(-)
 create mode 100644 hw/acpi/hmat.c
 create mode 100644 hw/acpi/hmat.h

diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
index eca3beed75..074dbd5a42 100644
--- a/hw/acpi/Kconfig
+++ b/hw/acpi/Kconfig
@@ -7,6 +7,7 @@ config ACPI_X86
     select ACPI_NVDIMM
     select ACPI_CPU_HOTPLUG
     select ACPI_MEMORY_HOTPLUG
+    select ACPI_HMAT
 
 config ACPI_X86_ICH
     bool
@@ -27,3 +28,7 @@ config ACPI_VMGENID
     bool
     default y
     depends on PC
+
+config ACPI_HMAT
+    bool
+    depends on ACPI
diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
index 2d46e3789a..932ba42d13 100644
--- a/hw/acpi/Makefile.objs
+++ b/hw/acpi/Makefile.objs
@@ -6,6 +6,7 @@ common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
 common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu.o
 common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
 common-obj-$(CONFIG_ACPI_VMGENID) += vmgenid.o
+common-obj-$(CONFIG_ACPI_HMAT) += hmat.o
 common-obj-$(call lnot,$(CONFIG_ACPI_X86)) += acpi-stub.o
 
 common-obj-y += acpi_interface.o
diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
new file mode 100644
index 0000000000..bffe453280
--- /dev/null
+++ b/hw/acpi/hmat.c
@@ -0,0 +1,135 @@
+/*
+ * HMAT ACPI Implementation
+ *
+ * Copyright(C) 2019 Intel Corporation.
+ *
+ * Author:
+ *  Liu jingqi <jingqi.liu@linux.intel.com>
+ *  Tao Xu <tao3.xu@intel.com>
+ *
+ * HMAT is defined in ACPI 6.2.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#include "qemu/osdep.h"
+#include "sysemu/numa.h"
+#include "hw/i386/pc.h"
+#include "hw/acpi/hmat.h"
+#include "hw/nvram/fw_cfg.h"
+
+/* Build Memory Subsystem Address Range Structure */
+static void build_hmat_spa(GArray *table_data, MachineState *ms,
+                           uint64_t base, uint64_t length, int node)
+{
+    uint16_t flags = 0;
+
+    if (ms->numa_state->nodes[node].is_initiator) {
+        flags |= HMAT_SPA_PROC_VALID;
+    }
+    if (ms->numa_state->nodes[node].is_target) {
+        flags |= HMAT_SPA_MEM_VALID;
+    }
+
+    /* Memory Subsystem Address Range Structure */
+    /* Type */
+    build_append_int_noprefix(table_data, 0, 2);
+    /* Reserved */
+    build_append_int_noprefix(table_data, 0, 2);
+    /* Length */
+    build_append_int_noprefix(table_data, 40, 4);
+    /* Flags */
+    build_append_int_noprefix(table_data, flags, 2);
+    /* Reserved */
+    build_append_int_noprefix(table_data, 0, 2);
+    /* Process Proximity Domain */
+    build_append_int_noprefix(table_data, node, 4);
+    /* Memory Proximity Domain */
+    build_append_int_noprefix(table_data, node, 4);
+    /* Reserved */
+    build_append_int_noprefix(table_data, 0, 4);
+    /* System Physical Address Range Base */
+    build_append_int_noprefix(table_data, base, 8);
+    /* System Physical Address Range Length */
+    build_append_int_noprefix(table_data, length, 8);
+}
+
+static int pc_dimm_device_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
+        *list = g_slist_append(*list, DEVICE(obj));
+    }
+
+    object_child_foreach(obj, pc_dimm_device_list, opaque);
+    return 0;
+}
+
+/*
+ * The Proximity Domain of System Physical Address ranges defined
+ * in the HMAT, NFIT and SRAT tables shall match each other.
+ */
+static void hmat_build_hma(GArray *table_data, MachineState *ms)
+{
+    GSList *device_list = NULL;
+    uint64_t mem_base, mem_len;
+    int i;
+    uint32_t mem_ranges_num = ms->numa_state->mem_ranges_num;
+    NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
+
+    PCMachineState *pcms = PC_MACHINE(ms);
+    AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
+    AcpiDeviceIf *adev = ACPI_DEVICE_IF(pcms->acpi_dev);
+
+    /* Build HMAT Memory Subsystem Address Range. */
+    if (pcms->numa_nodes && !mem_ranges_num) {
+        adevc->build_mem_ranges(adev, ms);
+    }
+
+    for (i = 0; i < mem_ranges_num; i++) {
+        build_hmat_spa(table_data, ms, mem_ranges[i].base,
+                       mem_ranges[i].length,
+                       mem_ranges[i].node);
+    }
+
+    /* Build HMAT SPA structures for PC-DIMM devices. */
+    object_child_foreach(qdev_get_machine(),
+                         pc_dimm_device_list, &device_list);
+
+    for (; device_list; device_list = device_list->next) {
+        PCDIMMDevice *dimm = device_list->data;
+        mem_base = object_property_get_uint(OBJECT(dimm), PC_DIMM_ADDR_PROP,
+                                            NULL);
+        mem_len = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
+                                           NULL);
+        i = object_property_get_uint(OBJECT(dimm), PC_DIMM_NODE_PROP, NULL);
+        build_hmat_spa(table_data, ms, mem_base, mem_len, i);
+    }
+}
+
+void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
+{
+    uint64_t hmat_start, hmat_len;
+
+    hmat_start = table_data->len;
+    acpi_data_push(table_data, 40);
+
+    hmat_build_hma(table_data, ms);
+    hmat_len = table_data->len - hmat_start;
+
+    build_header(linker, table_data,
+                 (void *)(table_data->data + hmat_start),
+                 "HMAT", hmat_len, 1, NULL, NULL);
+}
diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
new file mode 100644
index 0000000000..4f480c1e43
--- /dev/null
+++ b/hw/acpi/hmat.h
@@ -0,0 +1,43 @@
+/*
+ * HMAT ACPI Implementation Header
+ *
+ * Copyright(C) 2019 Intel Corporation.
+ *
+ * Author:
+ *  Liu jingqi <jingqi.liu@linux.intel.com>
+ *  Tao Xu <tao3.xu@intel.com>
+ *
+ * HMAT is defined in ACPI 6.2.
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>
+ */
+
+#ifndef HMAT_H
+#define HMAT_H
+
+#include "hw/acpi/acpi-defs.h"
+#include "hw/acpi/acpi.h"
+#include "hw/acpi/bios-linker-loader.h"
+#include "hw/acpi/aml-build.h"
+
+/* the values of AcpiHmatSpaRange flag */
+enum {
+    HMAT_SPA_PROC_VALID       = 0x1,
+    HMAT_SPA_MEM_VALID        = 0x2,
+    HMAT_SPA_RESERVATION_HINT = 0x4,
+};
+
+void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
+
+#endif
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 5598e7f780..d3d8c93631 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -64,6 +64,7 @@
 #include "hw/i386/intel_iommu.h"
 
 #include "hw/acpi/ipmi.h"
+#include "hw/acpi/hmat.h"
 
 /* These are used to size the ACPI tables for -M pc-i440fx-1.7 and
  * -M pc-i440fx-2.0.  Even if the actual amount of AML generated grows
@@ -2389,13 +2390,13 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
 
     for (i = 0; i < mem_ranges_num; i++) {
         if (mem_ranges[i].length > 0) {
-                numamem = acpi_data_push(table_data, sizeof *numamem);
+            numamem = acpi_data_push(table_data, sizeof *numamem);
             build_srat_memory(numamem, mem_ranges[i].base,
                               mem_ranges[i].length,
                               mem_ranges[i].node,
-                                  MEM_AFFINITY_ENABLED);
-            }
-            }
+                              MEM_AFFINITY_ENABLED);
+        }
+    }
     slots = (table_data->len - numa_start) / sizeof *numamem;
     for (; slots < pcms->numa_nodes + 2; slots++) {
         numamem = acpi_data_push(table_data, sizeof *numamem);
@@ -2715,6 +2716,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
             acpi_add_table(table_offsets, tables_blob);
             build_slit(tables_blob, tables->linker, machine);
         }
+        acpi_add_table(table_offsets, tables_blob);
+        hmat_build_acpi(tables_blob, tables->linker, machine);
     }
     if (acpi_get_mcfg(&mcfg)) {
         acpi_add_table(table_offsets, tables_blob);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 9fbf921ecf..d392634e08 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -237,6 +237,8 @@ struct NodeInfo {
     uint64_t node_mem;
     struct HostMemoryBackend *node_memdev;
     bool present;
+    bool is_initiator;
+    bool is_target;
     uint8_t distance[MAX_NODES];
 };
 
diff --git a/numa.c b/numa.c
index ddea376d72..71b0aee02a 100644
--- a/numa.c
+++ b/numa.c
@@ -102,6 +102,10 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
         }
     }
 
+    if (node->cpus) {
+        numa_info[nodenr].is_initiator = true;
+    }
+
     if (node->has_mem && node->has_memdev) {
         error_setg(errp, "cannot specify both mem= and memdev=");
         return;
@@ -118,6 +122,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
 
     if (node->has_mem) {
         numa_info[nodenr].node_mem = node->mem;
+        numa_info[nodenr].is_target = true;
     }
     if (node->has_memdev) {
         Object *o;
@@ -130,6 +135,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
         object_ref(o);
         numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
         numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
+        numa_info[nodenr].is_target = true;
     }
     numa_info[nodenr].present = true;
     max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 06/11] hmat acpi: Build System Locality Latency and Bandwidth Information Structure(s) in ACPI HMAT
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (4 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 05/11] hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI HMAT Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-06-04 14:43   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache " Tao Xu
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

From: Liu Jingqi <jingqi.liu@intel.com>

This structure describes the memory access latency and bandwidth
information from various memory access initiator proximity domains.
The latency and bandwidth numbers represented in this structure
correspond to rated latency and bandwidth for the platform.
The software could use this information as hint for optimization.

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - use build_append_int_noprefix() to build System Locality Latency
    and Bandwidth Information Structure(s) tables (Igor)
    - move globals (hmat_lb_info) into MachineState (Igor)
    - move hmat_build_lb() inside of hmat_build_hma() (Igor)
---
 hw/acpi/hmat.c          | 97 ++++++++++++++++++++++++++++++++++++++++-
 hw/acpi/hmat.h          | 39 +++++++++++++++++
 include/hw/boards.h     |  3 ++
 include/qemu/typedefs.h |  1 +
 include/sysemu/sysemu.h | 22 ++++++++++
 5 files changed, 161 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index bffe453280..54aabf77eb 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -29,6 +29,9 @@
 #include "hw/acpi/hmat.h"
 #include "hw/nvram/fw_cfg.h"
 
+static uint32_t initiator_pxm[MAX_NODES], target_pxm[MAX_NODES];
+static uint32_t num_initiator, num_target;
+
 /* Build Memory Subsystem Address Range Structure */
 static void build_hmat_spa(GArray *table_data, MachineState *ms,
                            uint64_t base, uint64_t length, int node)
@@ -77,6 +80,20 @@ static int pc_dimm_device_list(Object *obj, void *opaque)
     return 0;
 }
 
+static void classify_proximity_domains(MachineState *ms)
+{
+    int node;
+
+    for (node = 0; node < ms->numa_state->num_nodes; node++) {
+        if (ms->numa_state->nodes[node].is_initiator) {
+            initiator_pxm[num_initiator++] = node;
+        }
+        if (ms->numa_state->nodes[node].is_target) {
+            target_pxm[num_target++] = node;
+        }
+    }
+}
+
 /*
  * The Proximity Domain of System Physical Address ranges defined
  * in the HMAT, NFIT and SRAT tables shall match each other.
@@ -85,9 +102,10 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
 {
     GSList *device_list = NULL;
     uint64_t mem_base, mem_len;
-    int i;
+    int i, j, hrchy, type;
     uint32_t mem_ranges_num = ms->numa_state->mem_ranges_num;
     NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
+    HMAT_LB_Info *numa_hmat_lb;
 
     PCMachineState *pcms = PC_MACHINE(ms);
     AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
@@ -117,6 +135,83 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
         i = object_property_get_uint(OBJECT(dimm), PC_DIMM_NODE_PROP, NULL);
         build_hmat_spa(table_data, ms, mem_base, mem_len, i);
     }
+
+    if (!num_initiator && !num_target) {
+        classify_proximity_domains(ms);
+    }
+
+    /* Build HMAT System Locality Latency and Bandwidth Information. */
+    for (hrchy = HMAT_LB_MEM_MEMORY;
+         hrchy <= HMAT_LB_MEM_CACHE_3RD_LEVEL; hrchy++) {
+        for (type = HMAT_LB_DATA_ACCESS_LATENCY;
+             type <= HMAT_LB_DATA_WRITE_BANDWIDTH; type++) {
+            numa_hmat_lb = ms->numa_state->hmat_lb[hrchy][type];
+
+            if (numa_hmat_lb) {
+                uint32_t s = num_initiator;
+                uint32_t t = num_target;
+                uint8_t m, n;
+
+                /* Type */
+                build_append_int_noprefix(table_data, 1, 2);
+                /* Reserved */
+                build_append_int_noprefix(table_data, 0, 2);
+                /* Length */
+                build_append_int_noprefix(table_data,
+                                          32 + 4 * s + 4 * t + 2 * s * t, 4);
+                /* Flags */
+                build_append_int_noprefix(table_data,
+                                          numa_hmat_lb->hierarchy, 1);
+                /* Data Type */
+                build_append_int_noprefix(table_data,
+                                          numa_hmat_lb->data_type, 1);
+                /* Reserved */
+                build_append_int_noprefix(table_data, 0, 2);
+                /* Number of Initiator Proximity Domains (s) */
+                build_append_int_noprefix(table_data, s, 4);
+                /* Number of Target Proximity Domains (t) */
+                build_append_int_noprefix(table_data, t, 4);
+                /* Reserved */
+                build_append_int_noprefix(table_data, 0, 4);
+
+                /* Entry Base Unit */
+                if (type <= HMAT_LB_DATA_WRITE_LATENCY) {
+                    build_append_int_noprefix(table_data,
+                                              numa_hmat_lb->base_lat, 8);
+                } else {
+                    build_append_int_noprefix(table_data,
+                                              numa_hmat_lb->base_bw, 8);
+                }
+
+                /* Initiator Proximity Domain List */
+                for (i = 0; i < s; i++) {
+                    build_append_int_noprefix(table_data, initiator_pxm[i], 4);
+                }
+
+                /* Target Proximity Domain List */
+                for (i = 0; i < t; i++) {
+                    build_append_int_noprefix(table_data, target_pxm[i], 4);
+                }
+
+                /* Latency or Bandwidth Entries */
+                for (i = 0; i < s; i++) {
+                    m = initiator_pxm[i];
+                    for (j = 0; j < t; j++) {
+                        n = target_pxm[j];
+                        uint16_t entry;
+
+                        if (type <= HMAT_LB_DATA_WRITE_LATENCY) {
+                            entry = numa_hmat_lb->latency[m][n];
+                        } else {
+                            entry = numa_hmat_lb->bandwidth[m][n];
+                        }
+
+                        build_append_int_noprefix(table_data, entry, 2);
+                    }
+                }
+            }
+        }
+    }
 }
 
 void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
index 4f480c1e43..f37e30e533 100644
--- a/hw/acpi/hmat.h
+++ b/hw/acpi/hmat.h
@@ -38,6 +38,45 @@ enum {
     HMAT_SPA_RESERVATION_HINT = 0x4,
 };
 
+struct HMAT_LB_Info {
+    /*
+     * Indicates total number of Proximity Domains
+     * that can initiate memory access requests.
+     */
+    uint32_t    num_initiator;
+    /*
+     * Indicates total number of Proximity Domains
+     * that can act as target.
+     */
+    uint32_t    num_target;
+    /*
+     * Indicates it's memory or
+     * the specified level memory side cache.
+     */
+    uint8_t     hierarchy;
+    /*
+     * Present the type of data,
+     * access/read/write latency or bandwidth.
+     */
+    uint8_t     data_type;
+    /* The base unit for latency in nanoseconds. */
+    uint64_t    base_lat;
+    /* The base unit for bandwidth in megabytes per second(MB/s). */
+    uint64_t    base_bw;
+    /*
+     * latency[i][j]:
+     * Indicates the latency based on base_lat
+     * from Initiator Proximity Domain i to Target Proximity Domain j.
+     */
+    uint16_t    latency[MAX_NODES][MAX_NODES];
+    /*
+     * bandwidth[i][j]:
+     * Indicates the bandwidth based on base_bw
+     * from Initiator Proximity Domain i to Target Proximity Domain j.
+     */
+    uint16_t    bandwidth[MAX_NODES][MAX_NODES];
+};
+
 void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
 
 #endif
diff --git a/include/hw/boards.h b/include/hw/boards.h
index d392634e08..e0169b0a64 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -263,6 +263,9 @@ typedef struct NumaState {
 
     /* NUMA memory ranges */
     NumaMemRange mem_ranges[MAX_NODES + 2];
+
+    /* NUMA modes HMAT Locality Latency and Bandwidth Information */
+    HMAT_LB_Info *hmat_lb[HMAT_LB_LEVELS][HMAT_LB_TYPES];
 } NumaState;
 
 /**
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index fcdaae58c4..c0257e936b 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -33,6 +33,7 @@ typedef struct FWCfgEntry FWCfgEntry;
 typedef struct FWCfgIoState FWCfgIoState;
 typedef struct FWCfgMemState FWCfgMemState;
 typedef struct FWCfgState FWCfgState;
+typedef struct HMAT_LB_Info HMAT_LB_Info;
 typedef struct HVFX86EmulatorState HVFX86EmulatorState;
 typedef struct I2CBus I2CBus;
 typedef struct I2SCodec I2SCodec;
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 5f133cae83..da51a9bc26 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -124,6 +124,28 @@ extern int mem_prealloc;
 #define NUMA_DISTANCE_MAX         254
 #define NUMA_DISTANCE_UNREACHABLE 255
 
+/* the value of AcpiHmatLBInfo flags */
+enum {
+    HMAT_LB_MEM_MEMORY           = 0,
+    HMAT_LB_MEM_CACHE_LAST_LEVEL = 1,
+    HMAT_LB_MEM_CACHE_1ST_LEVEL  = 2,
+    HMAT_LB_MEM_CACHE_2ND_LEVEL  = 3,
+    HMAT_LB_MEM_CACHE_3RD_LEVEL  = 4,
+};
+
+/* the value of AcpiHmatLBInfo data type */
+enum {
+    HMAT_LB_DATA_ACCESS_LATENCY   = 0,
+    HMAT_LB_DATA_READ_LATENCY     = 1,
+    HMAT_LB_DATA_WRITE_LATENCY    = 2,
+    HMAT_LB_DATA_ACCESS_BANDWIDTH = 3,
+    HMAT_LB_DATA_READ_BANDWIDTH   = 4,
+    HMAT_LB_DATA_WRITE_BANDWIDTH  = 5,
+};
+
+#define HMAT_LB_LEVELS    (HMAT_LB_MEM_CACHE_3RD_LEVEL + 1)
+#define HMAT_LB_TYPES     (HMAT_LB_DATA_WRITE_BANDWIDTH + 1)
+
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
     const char *name;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (5 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 06/11] hmat acpi: Build System Locality Latency and Bandwidth Information " Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-06-04 15:04   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information Tao Xu
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

From: Liu Jingqi <jingqi.liu@intel.com>

This structure describes memory side cache information for memory
proximity domains if the memory side cache is present and the
physical device(SMBIOS handle) forms the memory side cache.
The software could use this information to effectively place
the data in memory to maximize the performance of the system
memory that use the memory side cache.

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - use build_append_int_noprefix() to build Memory Side Cache
    Information Structure(s) tables (Igor)
    - move globals (hmat_cache_info) into MachineState (Igor)
    - move hmat_build_cache() inside of hmat_build_hma() (Igor)
---
 hw/acpi/hmat.c          | 50 ++++++++++++++++++++++++++++++++++++++++-
 hw/acpi/hmat.h          | 25 +++++++++++++++++++++
 include/hw/boards.h     |  3 +++
 include/qemu/typedefs.h |  1 +
 include/sysemu/sysemu.h |  8 +++++++
 5 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index 54aabf77eb..3a8c41162d 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -102,10 +102,11 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
 {
     GSList *device_list = NULL;
     uint64_t mem_base, mem_len;
-    int i, j, hrchy, type;
+    int i, j, hrchy, type, level;
     uint32_t mem_ranges_num = ms->numa_state->mem_ranges_num;
     NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
     HMAT_LB_Info *numa_hmat_lb;
+    HMAT_Cache_Info *numa_hmat_cache = NULL;
 
     PCMachineState *pcms = PC_MACHINE(ms);
     AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
@@ -212,6 +213,53 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
             }
         }
     }
+
+    /* Build HMAT Memory Side Cache Information. */
+    for (i = 0; i < ms->numa_state->num_nodes; i++) {
+        for (level = 0; level <= MAX_HMAT_CACHE_LEVEL; level++) {
+            numa_hmat_cache = ms->numa_state->hmat_cache[i][level];
+            if (numa_hmat_cache) {
+                uint16_t n = numa_hmat_cache->num_smbios_handles;
+                uint32_t cache_attr = HMAT_CACHE_TOTAL_LEVEL(
+                                      numa_hmat_cache->total_levels);
+                cache_attr |= HMAT_CACHE_CURRENT_LEVEL(
+                              numa_hmat_cache->level);
+                cache_attr |= HMAT_CACHE_ASSOC(
+                                          numa_hmat_cache->associativity);
+                cache_attr |= HMAT_CACHE_WRITE_POLICY(
+                                          numa_hmat_cache->write_policy);
+                cache_attr |= HMAT_CACHE_LINE_SIZE(
+                                          numa_hmat_cache->line_size);
+                cache_attr = cpu_to_le32(cache_attr);
+
+                /* Memory Side Cache Information Structure */
+                /* Type */
+                build_append_int_noprefix(table_data, 2, 2);
+                /* Reserved */
+                build_append_int_noprefix(table_data, 0, 2);
+                /* Length */
+                build_append_int_noprefix(table_data, 32 + 2 * n, 4);
+                /* Proximity Domain for the Memory */
+                build_append_int_noprefix(table_data,
+                                          numa_hmat_cache->mem_proximity, 4);
+                /* Reserved */
+                build_append_int_noprefix(table_data, 0, 4);
+                /* Memory Side Cache Size */
+                build_append_int_noprefix(table_data,
+                                          numa_hmat_cache->size, 8);
+                /* Cache Attributes */
+                build_append_int_noprefix(table_data, cache_attr, 4);
+                /* Reserved */
+                build_append_int_noprefix(table_data, 0, 2);
+                /* Number of SMBIOS handles (n) */
+                build_append_int_noprefix(table_data, n, 2);
+
+                /* SMBIOS Handles */
+                /* TBD: set smbios handles */
+                build_append_int_noprefix(table_data, 0, 2 * n);
+            }
+        }
+    }
 }
 
 void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
index f37e30e533..8f563f19dd 100644
--- a/hw/acpi/hmat.h
+++ b/hw/acpi/hmat.h
@@ -77,6 +77,31 @@ struct HMAT_LB_Info {
     uint16_t    bandwidth[MAX_NODES][MAX_NODES];
 };
 
+struct HMAT_Cache_Info {
+    /* The memory proximity domain to which the memory belongs. */
+    uint32_t    mem_proximity;
+    /* Size of memory side cache in bytes. */
+    uint64_t    size;
+    /*
+     * Total cache levels for this memory
+     * pr#include "hw/acpi/aml-build.h"oximity domain.
+     */
+    uint8_t     total_levels;
+    /* Cache level described in this structure. */
+    uint8_t     level;
+    /* Cache Associativity: None/Direct Mapped/Comple Cache Indexing */
+    uint8_t     associativity;
+    /* Write Policy: None/Write Back(WB)/Write Through(WT) */
+    uint8_t     write_policy;
+    /* Cache Line size in bytes. */
+    uint16_t    line_size;
+    /*
+     * Number of SMBIOS handles that contributes to
+     * the memory side cache physical devices.
+     */
+    uint16_t    num_smbios_handles;
+};
+
 void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
 
 #endif
diff --git a/include/hw/boards.h b/include/hw/boards.h
index e0169b0a64..8609f923d9 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -266,6 +266,9 @@ typedef struct NumaState {
 
     /* NUMA modes HMAT Locality Latency and Bandwidth Information */
     HMAT_LB_Info *hmat_lb[HMAT_LB_LEVELS][HMAT_LB_TYPES];
+
+    /* Memory Side Cache Information Structure */
+    HMAT_Cache_Info *hmat_cache[MAX_NODES][MAX_HMAT_CACHE_LEVEL + 1];
 } NumaState;
 
 /**
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index c0257e936b..d971f5109e 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -33,6 +33,7 @@ typedef struct FWCfgEntry FWCfgEntry;
 typedef struct FWCfgIoState FWCfgIoState;
 typedef struct FWCfgMemState FWCfgMemState;
 typedef struct FWCfgState FWCfgState;
+typedef struct HMAT_Cache_Info HMAT_Cache_Info;
 typedef struct HMAT_LB_Info HMAT_LB_Info;
 typedef struct HVFX86EmulatorState HVFX86EmulatorState;
 typedef struct I2CBus I2CBus;
diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index da51a9bc26..0cfb387887 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -143,9 +143,17 @@ enum {
     HMAT_LB_DATA_WRITE_BANDWIDTH  = 5,
 };
 
+#define MAX_HMAT_CACHE_LEVEL        3
+
 #define HMAT_LB_LEVELS    (HMAT_LB_MEM_CACHE_3RD_LEVEL + 1)
 #define HMAT_LB_TYPES     (HMAT_LB_DATA_WRITE_BANDWIDTH + 1)
 
+#define HMAT_CACHE_TOTAL_LEVEL(level)      (level & 0xF)
+#define HMAT_CACHE_CURRENT_LEVEL(level)    ((level & 0xF) << 4)
+#define HMAT_CACHE_ASSOC(assoc)            ((assoc & 0xF) << 8)
+#define HMAT_CACHE_WRITE_POLICY(policy)    ((policy & 0xF) << 12)
+#define HMAT_CACHE_LINE_SIZE(size)         ((size & 0xFFFF) << 16)
+
 #define MAX_OPTION_ROMS 16
 typedef struct QEMUOptionRom {
     const char *name;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (6 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache " Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-06-05 14:40   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 09/11] numa: Extend the command-line to provide memory side cache information Tao Xu
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

From: Liu Jingqi <jingqi.liu@intel.com>

Add -numa hmat-lb option to provide System Locality Latency and
Bandwidth Information. These memory attributes help to build
System Locality Latency and Bandwidth Information Structure(s)
in ACPI Heterogeneous Memory Attribute Table (HMAT).

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - update the version tag from 4.0 to 4.1
---
 numa.c          | 127 ++++++++++++++++++++++++++++++++++++++++++++++++
 qapi/misc.json  |  94 ++++++++++++++++++++++++++++++++++-
 qemu-options.hx |  28 ++++++++++-
 3 files changed, 246 insertions(+), 3 deletions(-)

diff --git a/numa.c b/numa.c
index 71b0aee02a..1aecb7a2e9 100644
--- a/numa.c
+++ b/numa.c
@@ -40,6 +40,7 @@
 #include "qemu/option.h"
 #include "qemu/config-file.h"
 #include "qemu/cutils.h"
+#include "hw/acpi/hmat.h"
 
 QemuOptsList qemu_numa_opts = {
     .name = "numa",
@@ -179,6 +180,126 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
     ms->numa_state->have_numa_distance = true;
 }
 
+static void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
+                               Error **errp)
+{
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    NodeInfo *numa_info = ms->numa_state->nodes;
+    HMAT_LB_Info *hmat_lb = NULL;
+
+    if (node->data_type <= HMATLB_DATA_TYPE_WRITE_LATENCY) {
+        if (!node->has_latency) {
+            error_setg(errp, "Missing 'latency' option.");
+            return;
+        }
+        if (node->has_bandwidth) {
+            error_setg(errp, "Invalid option 'bandwidth' since "
+                       "the data type is latency.");
+            return;
+        }
+        if (node->has_base_bw) {
+            error_setg(errp, "Invalid option 'base_bw' since "
+                       "the data type is latency.");
+            return;
+        }
+    }
+
+    if (node->data_type >= HMATLB_DATA_TYPE_ACCESS_BANDWIDTH) {
+        if (!node->has_bandwidth) {
+            error_setg(errp, "Missing 'bandwidth' option.");
+            return;
+        }
+        if (node->has_latency) {
+            error_setg(errp, "Invalid option 'latency' since "
+                       "the data type is bandwidth.");
+            return;
+        }
+        if (node->has_base_lat) {
+            error_setg(errp, "Invalid option 'base_lat' since "
+                       "the data type is bandwidth.");
+            return;
+        }
+    }
+
+    if (node->initiator >= nb_numa_nodes) {
+        error_setg(errp, "Invalid initiator=%"
+                   PRIu16 ", it should be less than %d.",
+                   node->initiator, nb_numa_nodes);
+        return;
+    }
+    if (!numa_info[node->initiator].is_initiator) {
+        error_setg(errp, "Invalid initiator=%"
+                   PRIu16 ", it isn't an initiator proximity domain.",
+                   node->initiator);
+        return;
+    }
+
+    if (node->target >= nb_numa_nodes) {
+        error_setg(errp, "Invalid initiator=%"
+                   PRIu16 ", it should be less than %d.",
+                   node->target, nb_numa_nodes);
+        return;
+    }
+    if (!numa_info[node->target].is_target) {
+        error_setg(errp, "Invalid target=%"
+                   PRIu16 ", it isn't a target proximity domain.",
+                   node->target);
+        return;
+    }
+
+    if (node->has_latency) {
+        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
+
+        if (!hmat_lb) {
+            hmat_lb = g_malloc0(sizeof(*hmat_lb));
+            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
+        } else if (hmat_lb->latency[node->initiator][node->target]) {
+            error_setg(errp, "Duplicate configuration of the latency for "
+                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
+                       node->initiator, node->target);
+            return;
+        }
+
+        /* Only the first time of setting the base unit is valid. */
+        if ((hmat_lb->base_lat == 0) && (node->has_base_lat)) {
+            hmat_lb->base_lat = node->base_lat;
+        }
+
+        hmat_lb->latency[node->initiator][node->target] = node->latency;
+    }
+
+    if (node->has_bandwidth) {
+        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
+
+        if (!hmat_lb) {
+            hmat_lb = g_malloc0(sizeof(*hmat_lb));
+            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
+        } else if (hmat_lb->bandwidth[node->initiator][node->target]) {
+            error_setg(errp, "Duplicate configuration of the bandwidth for "
+                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
+                       node->initiator, node->target);
+            return;
+        }
+
+        /* Only the first time of setting the base unit is valid. */
+        if (hmat_lb->base_bw == 0) {
+            if (!node->has_base_bw) {
+                error_setg(errp, "Missing 'base-bw' option");
+                return;
+            } else {
+                hmat_lb->base_bw = node->base_bw;
+            }
+        }
+
+        hmat_lb->bandwidth[node->initiator][node->target] = node->bandwidth;
+    }
+
+    if (hmat_lb) {
+        hmat_lb->hierarchy = node->hierarchy;
+        hmat_lb->data_type = node->data_type;
+    }
+}
+
 static
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
 {
@@ -217,6 +338,12 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
         machine_set_cpu_numa_node(ms, qapi_NumaCpuOptions_base(&object->u.cpu),
                                   &err);
         break;
+    case NUMA_OPTIONS_TYPE_HMAT_LB:
+        parse_numa_hmat_lb(ms, &object->u.hmat_lb, &err);
+        if (err) {
+            goto end;
+        }
+        break;
     default:
         abort();
     }
diff --git a/qapi/misc.json b/qapi/misc.json
index 8b3ca4fdd3..d7fce75702 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2539,10 +2539,12 @@
 #
 # @cpu: property based CPU(s) to node mapping (Since: 2.10)
 #
+# @hmat-lb: memory latency and bandwidth information (Since: 4.1)
+#
 # Since: 2.1
 ##
 { 'enum': 'NumaOptionsType',
-  'data': [ 'node', 'dist', 'cpu' ] }
+  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
 
 ##
 # @NumaOptions:
@@ -2557,7 +2559,8 @@
   'data': {
     'node': 'NumaNodeOptions',
     'dist': 'NumaDistOptions',
-    'cpu': 'NumaCpuOptions' }}
+    'cpu': 'NumaCpuOptions',
+    'hmat-lb': 'NumaHmatLBOptions' }}
 
 ##
 # @NumaNodeOptions:
@@ -2620,6 +2623,93 @@
    'base': 'CpuInstanceProperties',
    'data' : {} }
 
+##
+# @HmatLBMemoryHierarchy:
+#
+# The memory hierarchy in the System Locality Latency
+# and Bandwidth Information Structure of HMAT (Heterogeneous
+# Memory Attribute Table)
+#
+# @memory: the structure represents the memory performance
+#
+# @last-level: last level memory of memory side cached memory
+#
+# @first-level: first level memory of memory side cached memory
+#
+# @second-level: second level memory of memory side cached memory
+#
+# @third-level: third level memory of memory side cached memory
+#
+# Since: 4.1
+##
+{ 'enum': 'HmatLBMemoryHierarchy',
+  'data': [ 'memory', 'last-level', 'first-level',
+            'second-level', 'third-level' ] }
+
+##
+# @HmatLBDataType:
+#
+# Data type in the System Locality Latency
+# and Bandwidth Information Structure of HMAT (Heterogeneous
+# Memory Attribute Table)
+#
+# @access-latency: access latency (nanoseconds)
+#
+# @read-latency: read latency (nanoseconds)
+#
+# @write-latency: write latency (nanoseconds)
+#
+# @access-bandwidth: access bandwidth (MB/s)
+#
+# @read-bandwidth: read bandwidth (MB/s)
+#
+# @write-bandwidth: write bandwidth (MB/s)
+#
+# Since: 4.1
+##
+{ 'enum': 'HmatLBDataType',
+  'data': [ 'access-latency', 'read-latency', 'write-latency',
+            'access-bandwidth', 'read-bandwidth', 'write-bandwidth' ] }
+
+##
+# @NumaHmatLBOptions:
+#
+# Set the system locality latency and bandwidth information
+# between Initiator and Target proximity Domains.
+#
+# @initiator: the Initiator Proximity Domain.
+#
+# @target: the Target Proximity Domain.
+#
+# @hierarchy: the Memory Hierarchy. Indicates the performance
+#             of memory or side cache.
+#
+# @data-type: presents the type of data, access/read/write
+#             latency or hit latency.
+#
+# @base-lat: the base unit for latency in nanoseconds.
+#
+# @base-bw: the base unit for bandwidth in megabytes per second(MB/s).
+#
+# @latency: the value of latency based on Base Unit from @initiator
+#           to @target proximity domain.
+#
+# @bandwidth: the value of bandwidth based on Base Unit between
+#             @initiator and @target proximity domain.
+#
+# Since: 4.1
+##
+{ 'struct': 'NumaHmatLBOptions',
+  'data': {
+   'initiator': 'uint16',
+   'target': 'uint16',
+   'hierarchy': 'HmatLBMemoryHierarchy',
+   'data-type': 'HmatLBDataType',
+   '*base-lat': 'uint64',
+   '*base-bw': 'uint64',
+   '*latency': 'uint16',
+   '*bandwidth': 'uint16' }}
+
 ##
 # @HostMemPolicy:
 #
diff --git a/qemu-options.hx b/qemu-options.hx
index 51802cbb26..5351b0e453 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -163,16 +163,19 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
     "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
     "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
     "-numa dist,src=source,dst=destination,val=distance\n"
-    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
+    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
+    "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|last-level,data-type=access-latency|read-latency|write-latency[,base-lat=blat][,base-bw=bbw][,latency=lat][,bandwidth=bw]\n",
     QEMU_ARCH_ALL)
 STEXI
 @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
 @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
 @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
 @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
+@itemx -numa hmat-lb,initiator=@var{node},target=@var{node},hierarchy=@var{str},data-type=@var{str}[,base-lat=@var{blat}][,base-bw=@var{bbw}][,latency=@var{lat}][,bandwidth=@var{bw}]
 @findex -numa
 Define a NUMA node and assign RAM and VCPUs to it.
 Set the NUMA distance from a source node to a destination node.
+Set the ACPI Heterogeneous Memory Attribute for the given nodes.
 
 Legacy VCPU assignment uses @samp{cpus} option where
 @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
@@ -230,6 +233,29 @@ specified resources, it just assigns existing resources to NUMA
 nodes. This means that one still has to use the @option{-m},
 @option{-smp} options to allocate RAM and VCPUs respectively.
 
+Use 'hmat-lb' to set System Locality Latency and Bandwidth Information
+between initiator NUMA node and target NUMA node to build ACPI Heterogeneous Attribute Memory Table (HMAT).
+Initiator NUMA node can create memory requests, usually including one or more processors.
+Target NUMA node contains addressable memory.
+
+For example:
+@example
+-m 2G \
+-smp 3,sockets=2,maxcpus=3 \
+-numa node,cpus=0-1,nodeid=0 \
+-numa node,mem=1G,cpus=2,nodeid=1 \
+-numa node,mem=1G,nodeid=2 \
+-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,base-lat=10,base-bw=20,latency=10,bandwidth=10 \
+-numa hmat-lb,initiator=1,target=2,hierarchy=first-level,data-type=access-latency,base-bw=10,bandwidth=20
+@end example
+
+When the processors in NUMA node 0 access memory in NUMA node 1,
+the first line containing 'hmat-lb' sets the latency and bandwidth information.
+The latency is @var{lat} multiplied by @var{blat} and the bandwidth is @var{bw} multiplied by @var{bbw}.
+
+When the processors in NUMA node 1 access memory in NUMA node 2 that acts as 2nd level memory side cache,
+the second line containing 'hmat-lb' sets the access hit bandwidth information.
+
 ETEXI
 
 DEF("add-fd", HAS_ARG, QEMU_OPTION_add_fd,
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 09/11] numa: Extend the command-line to provide memory side cache information
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (7 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-06-16 19:52   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 10/11] acpi: introduce build_acpi_aml_common for NFIT generalizations Tao Xu
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

From: Liu Jingqi <jingqi.liu@intel.com>

Add -numa hmat-cache option to provide Memory Side Cache Information.
These memory attributes help to build Memory Side Cache Information
Structure(s) in ACPI Heterogeneous Memory Attribute Table (HMAT).

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - update the version tag from 4.0 to 4.1
---
 numa.c         | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++
 qapi/misc.json | 72 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/numa.c b/numa.c
index 1aecb7a2e9..4866736fc8 100644
--- a/numa.c
+++ b/numa.c
@@ -300,6 +300,75 @@ static void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
     }
 }
 
+static
+void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node,
+                            Error **errp)
+{
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+    HMAT_Cache_Info *hmat_cache = NULL;
+
+    if (node->node_id >= nb_numa_nodes) {
+        error_setg(errp, "Invalid node-id=%" PRIu32
+                   ", it should be less than %d.",
+                   node->node_id, nb_numa_nodes);
+        return;
+    }
+    if (!ms->numa_state->nodes[node->node_id].is_target) {
+        error_setg(errp, "Invalid node-id=%" PRIu32
+                   ", it isn't a target proximity domain.",
+                   node->node_id);
+        return;
+    }
+
+    if (node->total > MAX_HMAT_CACHE_LEVEL) {
+        error_setg(errp, "Invalid total=%" PRIu8
+                   ", it should be less than or equal to %d.",
+                   node->total, MAX_HMAT_CACHE_LEVEL);
+        return;
+    }
+    if (node->level > node->total) {
+        error_setg(errp, "Invalid level=%" PRIu8
+                   ", it should be less than or equal to"
+                   " total=%" PRIu8 ".",
+                   node->level, node->total);
+        return;
+    }
+    if (ms->numa_state->hmat_cache[node->node_id][node->level]) {
+        error_setg(errp, "Duplicate configuration of the side cache for "
+                   "node-id=%" PRIu32 " and level=%" PRIu8 ".",
+                   node->node_id, node->level);
+        return;
+    }
+
+    if ((node->level > 1) &&
+        ms->numa_state->hmat_cache[node->node_id][node->level - 1] &&
+        (node->size >=
+            ms->numa_state->hmat_cache[node->node_id][node->level - 1]->size)) {
+        error_setg(errp, "Invalid size=0x%" PRIx64
+                   ", the size of level=%" PRIu8
+                   " should be less than the size(0x%" PRIx64
+                   ") of level=%" PRIu8 ".",
+                   node->size, node->level,
+                   ms->numa_state->hmat_cache[node->node_id]
+                                             [node->level - 1]->size,
+                   node->level - 1);
+        return;
+    }
+
+    hmat_cache = g_malloc0(sizeof(*hmat_cache));
+
+    hmat_cache->mem_proximity = node->node_id;
+    hmat_cache->size = node->size;
+    hmat_cache->total_levels = node->total;
+    hmat_cache->level = node->level;
+    hmat_cache->associativity = node->assoc;
+    hmat_cache->write_policy = node->policy;
+    hmat_cache->line_size = node->line;
+    hmat_cache->num_smbios_handles = 0;
+
+    ms->numa_state->hmat_cache[node->node_id][node->level] = hmat_cache;
+}
+
 static
 void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
 {
@@ -344,6 +413,12 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
             goto end;
         }
         break;
+    case NUMA_OPTIONS_TYPE_HMAT_CACHE:
+        parse_numa_hmat_cache(ms, &object->u.hmat_cache, &err);
+        if (err) {
+            goto end;
+        }
+        break;
     default:
         abort();
     }
diff --git a/qapi/misc.json b/qapi/misc.json
index d7fce75702..2b7e34b469 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -2541,10 +2541,12 @@
 #
 # @hmat-lb: memory latency and bandwidth information (Since: 4.1)
 #
+# @hmat-cache: memory side cache information (Since: 4.1)
+#
 # Since: 2.1
 ##
 { 'enum': 'NumaOptionsType',
-  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
+   'data': [ 'node', 'dist', 'cpu', 'hmat-lb', 'hmat-cache' ] }
 
 ##
 # @NumaOptions:
@@ -2560,7 +2562,8 @@
     'node': 'NumaNodeOptions',
     'dist': 'NumaDistOptions',
     'cpu': 'NumaCpuOptions',
-    'hmat-lb': 'NumaHmatLBOptions' }}
+    'hmat-lb': 'NumaHmatLBOptions',
+    'hmat-cache': 'NumaHmatCacheOptions' }}
 
 ##
 # @NumaNodeOptions:
@@ -2710,6 +2713,71 @@
    '*latency': 'uint16',
    '*bandwidth': 'uint16' }}
 
+##
+# @HmatCacheAssociativity:
+#
+# Cache associativity in the Memory Side Cache
+# Information Structure of HMAT
+#
+# @none: None
+#
+# @direct: Direct Mapped
+#
+# @complex: Complex Cache Indexing (implementation specific)
+#
+# Since: 4.1
+##
+{ 'enum': 'HmatCacheAssociativity',
+  'data': [ 'none', 'direct', 'complex' ] }
+
+##
+# @HmatCacheWritePolicy:
+#
+# Cache write policy in the Memory Side Cache
+# Information Structure of HMAT
+#
+# @none: None
+#
+# @write-back: Write Back (WB)
+#
+# @write-through: Write Through (WT)
+#
+# Since: 4.1
+##
+{ 'enum': 'HmatCacheWritePolicy',
+  'data': [ 'none', 'write-back', 'write-through' ] }
+
+##
+# @NumaHmatCacheOptions:
+#
+# Set the memory side cache information for a given memory domain.
+#
+# @node-id: the memory proximity domain to which the memory belongs.
+#
+# @size: the size of memory side cache in bytes.
+#
+# @total: the total cache levels for this memory proximity domain.
+#
+# @level: the cache level described in this structure.
+#
+# @assoc: the cache associativity, none/direct-mapped/complex(complex cache indexing).
+
+# @policy: the write policy, none/write-back/write-through.
+#
+# @line: the cache Line size in bytes.
+#
+# Since: 4.1
+##
+{ 'struct': 'NumaHmatCacheOptions',
+  'data': {
+   'node-id': 'uint32',
+   'size': 'size',
+   'total': 'uint8',
+   'level': 'uint8',
+   'assoc': 'HmatCacheAssociativity',
+   'policy': 'HmatCacheWritePolicy',
+   'line': 'uint16' }}
+
 ##
 # @HostMemPolicy:
 #
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 10/11] acpi: introduce build_acpi_aml_common for NFIT generalizations
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (8 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 09/11] numa: Extend the command-line to provide memory side cache information Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-06-06 17:00   ` Igor Mammedov
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 11/11] hmat acpi: Implement _HMA method to update HMAT at runtime Tao Xu
  2019-05-31  4:55   ` Dan Williams
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

The aim of this patch is to move some of the NFIT Aml-build codes into
build_acpi_aml_common(), and then NFIT and HMAT can both use it.

Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - Split 8/8 of patch v3 into two parts, introduces NFIT
    generalizations (build_acpi_aml_common)
---
 hw/acpi/nvdimm.c        | 49 +++++++++++++++++++++++++++--------------
 include/hw/mem/nvdimm.h |  6 +++++
 2 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
index 9fdad6dc3f..e2be79a8b7 100644
--- a/hw/acpi/nvdimm.c
+++ b/hw/acpi/nvdimm.c
@@ -1140,12 +1140,11 @@ static void nvdimm_build_device_dsm(Aml *dev, uint32_t handle)
 
 static void nvdimm_build_fit(Aml *dev)
 {
-    Aml *method, *pkg, *buf, *buf_size, *offset, *call_result;
-    Aml *whilectx, *ifcond, *ifctx, *elsectx, *fit;
+    Aml *method, *pkg, *buf, *buf_name, *buf_size, *call_result;
 
     buf = aml_local(0);
     buf_size = aml_local(1);
-    fit = aml_local(2);
+    buf_name = aml_local(2);
 
     aml_append(dev, aml_name_decl(NVDIMM_DSM_RFIT_STATUS, aml_int(0)));
 
@@ -1164,6 +1163,22 @@ static void nvdimm_build_fit(Aml *dev)
                             aml_int(1) /* Revision 1 */,
                             aml_int(0x1) /* Read FIT */,
                             pkg, aml_int(NVDIMM_QEMU_RSVD_HANDLE_ROOT));
+
+    build_acpi_aml_common(method, buf, buf_size,
+                          call_result, buf_name, dev,
+                          "RFIT", "_FIT",
+                          NVDIMM_DSM_RET_STATUS_SUCCESS,
+                          NVDIMM_DSM_RET_STATUS_FIT_CHANGED);
+}
+
+void build_acpi_aml_common(Aml *method, Aml *buf, Aml *buf_size,
+                           Aml *call_result, Aml *buf_name, Aml *dev,
+                           const char *help_function, const char *method_name,
+                           int ret_status_success,
+                           int ret_status_changed)
+{
+    Aml *offset, *whilectx, *ifcond, *ifctx, *elsectx;
+
     aml_append(method, aml_store(call_result, buf));
 
     /* handle _DSM result. */
@@ -1174,7 +1189,7 @@ static void nvdimm_build_fit(Aml *dev)
                                  aml_name(NVDIMM_DSM_RFIT_STATUS)));
 
      /* if something is wrong during _DSM. */
-    ifcond = aml_equal(aml_int(NVDIMM_DSM_RET_STATUS_SUCCESS),
+    ifcond = aml_equal(aml_int(ret_status_success),
                        aml_name("STAU"));
     ifctx = aml_if(aml_lnot(ifcond));
     aml_append(ifctx, aml_return(aml_buffer(0, NULL)));
@@ -1185,7 +1200,7 @@ static void nvdimm_build_fit(Aml *dev)
                                     aml_int(4) /* the size of "STAU" */,
                                     buf_size));
 
-    /* if we read the end of fit. */
+    /* if we read the end of fit or hma. */
     ifctx = aml_if(aml_equal(buf_size, aml_int(0)));
     aml_append(ifctx, aml_return(aml_buffer(0, NULL)));
     aml_append(method, ifctx);
@@ -1196,38 +1211,38 @@ static void nvdimm_build_fit(Aml *dev)
     aml_append(method, aml_return(aml_name("BUFF")));
     aml_append(dev, method);
 
-    /* build _FIT. */
-    method = aml_method("_FIT", 0, AML_SERIALIZED);
+    /* build _FIT or _HMA. */
+    method = aml_method(method_name, 0, AML_SERIALIZED);
     offset = aml_local(3);
 
-    aml_append(method, aml_store(aml_buffer(0, NULL), fit));
+    aml_append(method, aml_store(aml_buffer(0, NULL), buf_name));
     aml_append(method, aml_store(aml_int(0), offset));
 
     whilectx = aml_while(aml_int(1));
-    aml_append(whilectx, aml_store(aml_call1("RFIT", offset), buf));
+    aml_append(whilectx, aml_store(aml_call1(help_function, offset), buf));
     aml_append(whilectx, aml_store(aml_sizeof(buf), buf_size));
 
     /*
-     * if fit buffer was changed during RFIT, read from the beginning
-     * again.
+     * if buffer was changed during RFIT or RHMA,
+     * read from the beginning again.
      */
     ifctx = aml_if(aml_equal(aml_name(NVDIMM_DSM_RFIT_STATUS),
-                             aml_int(NVDIMM_DSM_RET_STATUS_FIT_CHANGED)));
-    aml_append(ifctx, aml_store(aml_buffer(0, NULL), fit));
+                             aml_int(ret_status_changed)));
+    aml_append(ifctx, aml_store(aml_buffer(0, NULL), buf_name));
     aml_append(ifctx, aml_store(aml_int(0), offset));
     aml_append(whilectx, ifctx);
 
     elsectx = aml_else();
 
-    /* finish fit read if no data is read out. */
+    /* finish fit or hma read if no data is read out. */
     ifctx = aml_if(aml_equal(buf_size, aml_int(0)));
-    aml_append(ifctx, aml_return(fit));
+    aml_append(ifctx, aml_return(buf_name));
     aml_append(elsectx, ifctx);
 
     /* update the offset. */
     aml_append(elsectx, aml_add(offset, buf_size, offset));
-    /* append the data we read out to the fit buffer. */
-    aml_append(elsectx, aml_concatenate(fit, buf, fit));
+    /* append the data we read out to the fit or hma buffer. */
+    aml_append(elsectx, aml_concatenate(buf_name, buf, buf_name));
     aml_append(whilectx, elsectx);
     aml_append(method, whilectx);
 
diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
index 523a9b3d4a..6f04eddb40 100644
--- a/include/hw/mem/nvdimm.h
+++ b/include/hw/mem/nvdimm.h
@@ -25,6 +25,7 @@
 
 #include "hw/mem/pc-dimm.h"
 #include "hw/acpi/bios-linker-loader.h"
+#include "hw/acpi/aml-build.h"
 
 #define NVDIMM_DEBUG 0
 #define nvdimm_debug(fmt, ...)                                \
@@ -150,4 +151,9 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
                        uint32_t ram_slots);
 void nvdimm_plug(NVDIMMState *state);
 void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
+void build_acpi_aml_common(Aml *method, Aml *buf, Aml *buf_size,
+                           Aml *call_result, Aml *buf_name, Aml *dev,
+                           const char *help_function, const char *method_name,
+                           int ret_status_success,
+                           int ret_status_changed);
 #endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [PATCH v4 11/11] hmat acpi: Implement _HMA method to update HMAT at runtime
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
                   ` (9 preceding siblings ...)
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 10/11] acpi: introduce build_acpi_aml_common for NFIT generalizations Tao Xu
@ 2019-05-08  6:17 ` Tao Xu
  2019-06-16 20:07   ` Igor Mammedov
  2019-05-31  4:55   ` Dan Williams
  11 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-05-08  6:17 UTC (permalink / raw)
  To: imammedo, mst, eblake, ehabkost, xiaoguangrong.eric
  Cc: pbonzini, tao3.xu, jingqi.liu, qemu-devel, rth

From: Liu Jingqi <jingqi.liu@intel.com>

OSPM evaluates HMAT only during system initialization.
Any changes to the HMAT state at runtime or information
regarding HMAT for hot plug are communicated using _HMA method.

_HMA is an optional object that enables the platform to provide
the OS with updated Heterogeneous Memory Attributes information
at runtime. _HMA provides OSPM with the latest HMAT in entirety
overriding existing HMAT.

Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

Changes in v4 -> v3:
    - move AcpiHmaState from PCMachineState to MachineState
    to make HMAT more generalic (Igor)
    - use build_acpi_aml_common() introduced in patch 10/11 to
    simplify hmat_build_aml (Igor)
---
 hw/acpi/hmat.c          | 296 ++++++++++++++++++++++++++++++++++++++++
 hw/acpi/hmat.h          |  72 ++++++++++
 hw/core/machine.c       |   3 +
 hw/i386/acpi-build.c    |   2 +
 hw/i386/pc.c            |   3 +
 hw/i386/pc_piix.c       |   4 +
 hw/i386/pc_q35.c        |   4 +
 include/hw/boards.h     |   1 +
 include/qemu/typedefs.h |   1 +
 9 files changed, 386 insertions(+)

diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
index 3a8c41162d..bc2dffd079 100644
--- a/hw/acpi/hmat.c
+++ b/hw/acpi/hmat.c
@@ -28,6 +28,7 @@
 #include "hw/i386/pc.h"
 #include "hw/acpi/hmat.h"
 #include "hw/nvram/fw_cfg.h"
+#include "hw/mem/nvdimm.h"
 
 static uint32_t initiator_pxm[MAX_NODES], target_pxm[MAX_NODES];
 static uint32_t num_initiator, num_target;
@@ -262,6 +263,270 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
     }
 }
 
+static uint64_t
+hmat_hma_method_read(void *opaque, hwaddr addr, unsigned size)
+{
+    printf("BUG: we never read _HMA IO Port.\n");
+    return 0;
+}
+
+/* _HMA Method: read HMA data. */
+static void hmat_handle_hma_method(AcpiHmaState *state,
+                                   HmatHmamIn *in, hwaddr hmam_mem_addr)
+{
+    HmatHmaBuffer *hma_buf = &state->hma_buf;
+    HmatHmamOut *read_hma_out;
+    GArray *hma;
+    uint32_t read_len = 0, ret_status;
+    int size;
+
+    if (in != NULL) {
+        le32_to_cpus(&in->offset);
+    }
+
+    hma = hma_buf->hma;
+    if (in->offset > hma->len) {
+        ret_status = HMAM_RET_STATUS_INVALID;
+        goto exit;
+    }
+
+   /* It is the first time to read HMA. */
+    if (!in->offset) {
+        hma_buf->dirty = false;
+    } else if (hma_buf->dirty) {
+        /* HMA has been changed during Reading HMA. */
+        ret_status = HMAM_RET_STATUS_HMA_CHANGED;
+        goto exit;
+    }
+
+    ret_status = HMAM_RET_STATUS_SUCCESS;
+    read_len = MIN(hma->len - in->offset,
+                   HMAM_MEMORY_SIZE - 2 * sizeof(uint32_t));
+exit:
+    size = sizeof(HmatHmamOut) + read_len;
+    read_hma_out = g_malloc(size);
+
+    read_hma_out->len = cpu_to_le32(size);
+    read_hma_out->ret_status = cpu_to_le32(ret_status);
+    memcpy(read_hma_out->data, hma->data + in->offset, read_len);
+
+    cpu_physical_memory_write(hmam_mem_addr, read_hma_out, size);
+
+    g_free(read_hma_out);
+}
+
+static void
+hmat_hma_method_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
+{
+    AcpiHmaState *state = opaque;
+    hwaddr hmam_mem_addr = val;
+    HmatHmamIn *in;
+
+    in = g_new(HmatHmamIn, 1);
+    cpu_physical_memory_read(hmam_mem_addr, in, sizeof(*in));
+
+    hmat_handle_hma_method(state, in, hmam_mem_addr);
+}
+
+static const MemoryRegionOps hmat_hma_method_ops = {
+    .read = hmat_hma_method_read,
+    .write = hmat_hma_method_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .valid = {
+        .min_access_size = 4,
+        .max_access_size = 4,
+    },
+};
+
+static void hmat_init_hma_buffer(HmatHmaBuffer *hma_buf)
+{
+    hma_buf->hma = g_array_new(false, true /* clear */, 1);
+}
+
+static uint8_t hmat_acpi_table_checksum(uint8_t *buffer, uint32_t length)
+{
+    uint8_t sum = 0;
+    uint8_t *end = buffer + length;
+
+    while (buffer < end) {
+        sum = (uint8_t) (sum + *(buffer++));
+    }
+    return (uint8_t)(0 - sum);
+}
+
+static void hmat_build_header(AcpiTableHeader *h,
+             const char *sig, int len, uint8_t rev,
+             const char *oem_id, const char *oem_table_id)
+{
+    memcpy(&h->signature, sig, 4);
+    h->length = cpu_to_le32(len);
+    h->revision = rev;
+
+    if (oem_id) {
+        strncpy((char *)h->oem_id, oem_id, sizeof h->oem_id);
+    } else {
+        memcpy(h->oem_id, ACPI_BUILD_APPNAME6, 6);
+    }
+
+    if (oem_table_id) {
+        strncpy((char *)h->oem_table_id, oem_table_id, sizeof(h->oem_table_id));
+    } else {
+        memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
+        memcpy(h->oem_table_id + 4, sig, 4);
+    }
+
+    h->oem_revision = cpu_to_le32(1);
+    memcpy(h->asl_compiler_id, ACPI_BUILD_APPNAME4, 4);
+    h->asl_compiler_revision = cpu_to_le32(1);
+
+    /* Caculate the checksum of acpi table. */
+    h->checksum = 0;
+    h->checksum = hmat_acpi_table_checksum((uint8_t *)h, len);
+}
+
+static void hmat_build_hma_buffer(MachineState *ms)
+{
+    HmatHmaBuffer *hma_buf = &(ms->acpi_hma_state->hma_buf);
+
+    /* Free the old hma buffer before new allocation. */
+    g_array_free(hma_buf->hma, true);
+
+    hma_buf->hma = g_array_new(false, true /* clear */, 1);
+    acpi_data_push(hma_buf->hma, 40);
+
+    /* build HMAT in a given buffer. */
+    hmat_build_hma(hma_buf->hma, ms);
+    hmat_build_header((void *)hma_buf->hma->data,
+                      "HMAT", hma_buf->hma->len, 1, NULL, NULL);
+    hma_buf->dirty = true;
+}
+
+static void hmat_build_common_aml(Aml *dev)
+{
+    Aml *method, *ifctx, *hmam_mem;
+    Aml *unsupport;
+    Aml *pckg, *pckg_index, *pckg_buf, *field;
+    Aml *hmam_out_buf, *hmam_out_buf_size;
+    uint8_t byte_list[1];
+
+    method = aml_method(HMA_COMMON_METHOD, 1, AML_SERIALIZED);
+    hmam_mem = aml_local(6);
+    hmam_out_buf = aml_local(7);
+
+    aml_append(method, aml_store(aml_name(HMAM_ACPI_MEM_ADDR), hmam_mem));
+
+    /* map _HMA memory and IO into ACPI namespace. */
+    aml_append(method, aml_operation_region(HMAM_IOPORT, AML_SYSTEM_IO,
+               aml_int(HMAM_ACPI_IO_BASE), HMAM_ACPI_IO_LEN));
+    aml_append(method, aml_operation_region(HMAM_MEMORY,
+               AML_SYSTEM_MEMORY, hmam_mem, HMAM_MEMORY_SIZE));
+
+    /*
+     * _HMAC notifier:
+     * HMAM_NOTIFY: write the address of DSM memory and notify QEMU to
+     *                    emulate the access.
+     *
+     * It is the IO port so that accessing them will cause VM-exit, the
+     * control will be transferred to QEMU.
+     */
+    field = aml_field(HMAM_IOPORT, AML_DWORD_ACC, AML_NOLOCK,
+                      AML_PRESERVE);
+    aml_append(field, aml_named_field(HMAM_NOTIFY,
+               sizeof(uint32_t) * BITS_PER_BYTE));
+    aml_append(method, field);
+
+    /*
+     * _HMAC input:
+     * HMAM_OFFSET: store the current offset of _HMA buffer.
+     *
+     * They are RAM mapping on host so that these accesses never cause VMExit.
+     */
+    field = aml_field(HMAM_MEMORY, AML_DWORD_ACC, AML_NOLOCK,
+                      AML_PRESERVE);
+    aml_append(field, aml_named_field(HMAM_OFFSET,
+               sizeof(typeof_field(HmatHmamIn, offset)) * BITS_PER_BYTE));
+    aml_append(method, field);
+
+    /*
+     * _HMAC output:
+     * HMAM_OUT_BUF_SIZE: the size of the buffer filled by QEMU.
+     * HMAM_OUT_BUF: the buffer QEMU uses to store the result.
+     *
+     * Since the page is reused by both input and out, the input data
+     * will be lost after storing new result into ODAT so we should fetch
+     * all the input data before writing the result.
+     */
+    field = aml_field(HMAM_MEMORY, AML_DWORD_ACC, AML_NOLOCK,
+                      AML_PRESERVE);
+    aml_append(field, aml_named_field(HMAM_OUT_BUF_SIZE,
+               sizeof(typeof_field(HmatHmamOut, len)) * BITS_PER_BYTE));
+    aml_append(field, aml_named_field(HMAM_OUT_BUF,
+       (sizeof(HmatHmamOut) - sizeof(uint32_t)) * BITS_PER_BYTE));
+    aml_append(method, field);
+
+    /*
+     * do not support any method if HMA memory address has not been
+     * patched.
+     */
+    unsupport = aml_if(aml_equal(hmam_mem, aml_int(0x0)));
+    byte_list[0] = HMAM_RET_STATUS_UNSUPPORT;
+    aml_append(unsupport, aml_return(aml_buffer(1, byte_list)));
+    aml_append(method, unsupport);
+
+    /* The parameter (Arg0) of _HMAC is a package which contains a buffer. */
+    pckg = aml_arg(0);
+    ifctx = aml_if(aml_and(aml_equal(aml_object_type(pckg),
+                   aml_int(4 /* Package */)) /* It is a Package? */,
+                   aml_equal(aml_sizeof(pckg), aml_int(1)) /* 1 element */,
+                   NULL));
+
+    pckg_index = aml_local(2);
+    pckg_buf = aml_local(3);
+    aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)), pckg_index));
+    aml_append(ifctx, aml_store(aml_derefof(pckg_index), pckg_buf));
+    aml_append(ifctx, aml_store(pckg_buf, aml_name(HMAM_OFFSET)));
+    aml_append(method, ifctx);
+
+    /*
+     * tell QEMU about the real address of HMA memory, then QEMU
+     * gets the control and fills the result in _HMAC memory.
+     */
+    aml_append(method, aml_store(hmam_mem, aml_name(HMAM_NOTIFY)));
+
+    hmam_out_buf_size = aml_local(1);
+    /* RLEN is not included in the payload returned to guest. */
+    aml_append(method, aml_subtract(aml_name(HMAM_OUT_BUF_SIZE),
+                                aml_int(4), hmam_out_buf_size));
+    aml_append(method, aml_store(aml_shiftleft(hmam_out_buf_size, aml_int(3)),
+                                 hmam_out_buf_size));
+    aml_append(method, aml_create_field(aml_name(HMAM_OUT_BUF),
+                                aml_int(0), hmam_out_buf_size, "OBUF"));
+    aml_append(method, aml_concatenate(aml_buffer(0, NULL), aml_name("OBUF"),
+                                hmam_out_buf));
+    aml_append(method, aml_return(hmam_out_buf));
+    aml_append(dev, method);
+}
+
+void hmat_init_acpi_state(AcpiHmaState *state, MemoryRegion *io,
+                          FWCfgState *fw_cfg, Object *owner)
+{
+    memory_region_init_io(&state->io_mr, owner, &hmat_hma_method_ops, state,
+                          "hma-acpi-io", HMAM_ACPI_IO_LEN);
+    memory_region_add_subregion(io, HMAM_ACPI_IO_BASE, &state->io_mr);
+
+    state->hmam_mem = g_array_new(false, true /* clear */, 1);
+    fw_cfg_add_file(fw_cfg, HMAM_MEM_FILE, state->hmam_mem->data,
+                    state->hmam_mem->len);
+
+    hmat_init_hma_buffer(&state->hma_buf);
+}
+
+void hmat_update(MachineState *ms)
+{
+    /* build HMAT in a given buffer. */
+    hmat_build_hma_buffer(ms);
+}
+
 void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
 {
     uint64_t hmat_start, hmat_len;
@@ -276,3 +541,34 @@ void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
                  (void *)(table_data->data + hmat_start),
                  "HMAT", hmat_len, 1, NULL, NULL);
 }
+
+void hmat_build_aml(Aml *dev)
+{
+    Aml *method, *pkg, *buf, *buf_name, *buf_size, *call_result;
+
+    hmat_build_common_aml(dev);
+
+    buf = aml_local(0);
+    buf_size = aml_local(1);
+    buf_name = aml_local(2);
+
+    aml_append(dev, aml_name_decl(HMAM_RHMA_STATUS, aml_int(0)));
+
+    /* build helper function, RHMA. */
+    method = aml_method("RHMA", 1, AML_SERIALIZED);
+    aml_append(method, aml_name_decl("OFST", aml_int(0)));
+
+    /* prepare input package. */
+    pkg = aml_package(1);
+    aml_append(method, aml_store(aml_arg(0), aml_name("OFST")));
+    aml_append(pkg, aml_name("OFST"));
+
+    /* call Read HMA function. */
+    call_result = aml_call1(HMA_COMMON_METHOD, pkg);
+
+    build_acpi_aml_common(method, buf, buf_size,
+                          call_result, buf_name, dev,
+                          "RHMA", "_HMA",
+                          HMAM_RET_STATUS_SUCCESS,
+                          HMAM_RET_STATUS_HMA_CHANGED);
+}
diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
index 8f563f19dd..7b24a3327f 100644
--- a/hw/acpi/hmat.h
+++ b/hw/acpi/hmat.h
@@ -102,6 +102,78 @@ struct HMAT_Cache_Info {
     uint16_t    num_smbios_handles;
 };
 
+#define HMAM_MEMORY_SIZE    4096
+#define HMAM_MEM_FILE       "etc/acpi/hma-mem"
+
+/*
+ * 32 bits IO port starting from 0x0a19 in guest is reserved for
+ * HMA ACPI emulation.
+ */
+#define HMAM_ACPI_IO_BASE     0x0a19
+#define HMAM_ACPI_IO_LEN      4
+
+#define HMAM_ACPI_MEM_ADDR  "HMTA"
+#define HMAM_MEMORY         "HRAM"
+#define HMAM_IOPORT         "HPIO"
+
+#define HMAM_NOTIFY         "NTFI"
+#define HMAM_OUT_BUF_SIZE   "RLEN"
+#define HMAM_OUT_BUF        "ODAT"
+
+#define HMAM_RHMA_STATUS    "RSTA"
+#define HMA_COMMON_METHOD   "HMAC"
+#define HMAM_OFFSET         "OFFT"
+
+#define HMAM_RET_STATUS_SUCCESS        0 /* Success */
+#define HMAM_RET_STATUS_UNSUPPORT      1 /* Not Supported */
+#define HMAM_RET_STATUS_INVALID        2 /* Invalid Input Parameters */
+#define HMAM_RET_STATUS_HMA_CHANGED    0x100 /* HMA Changed */
+
+/*
+ * HmatHmaBuffer:
+ * @hma: HMA buffer with the updated HMAT. It is updated when
+ *   the memory device is plugged or unplugged.
+ * @dirty: It allows OSPM to detect changes and restart read if there is any.
+ */
+struct HmatHmaBuffer {
+    GArray *hma;
+    bool dirty;
+};
+typedef struct HmatHmaBuffer HmatHmaBuffer;
+
+struct AcpiHmaState {
+    /* detect if HMA support is enabled. */
+    bool is_enabled;
+
+    /* the data of the fw_cfg file HMAM_MEM_FILE. */
+    GArray *hmam_mem;
+
+    HmatHmaBuffer hma_buf;
+
+    /* the IO region used by OSPM to transfer control to QEMU. */
+    MemoryRegion io_mr;
+};
+
+typedef struct AcpiHmaState AcpiHmaState;
+
+struct HmatHmamIn {
+    /* the offset in the _HMA buffer */
+    uint32_t offset;
+} QEMU_PACKED;
+typedef struct HmatHmamIn HmatHmamIn;
+
+struct HmatHmamOut {
+    /* the size of buffer filled by QEMU. */
+    uint32_t len;
+    uint32_t ret_status;   /* return status code. */
+    uint8_t data[4088];
+} QEMU_PACKED;
+typedef struct HmatHmamOut HmatHmamOut;
+
 void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
+void hmat_build_aml(Aml *dsdt);
+void hmat_init_acpi_state(AcpiHmaState *state, MemoryRegion *io,
+                          FWCfgState *fw_cfg, Object *owner);
+void hmat_update(MachineState *ms);
 
 #endif
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 90bebb8d3a..f4a6dc5b2e 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -23,6 +23,7 @@
 #include "sysemu/qtest.h"
 #include "hw/pci/pci.h"
 #include "hw/mem/nvdimm.h"
+#include "hw/acpi/hmat.h"
 
 GlobalProperty hw_compat_4_0[] = {};
 const size_t hw_compat_4_0_len = G_N_ELEMENTS(hw_compat_4_0);
@@ -859,6 +860,7 @@ static void machine_initfn(Object *obj)
 
     if (mc->numa_supported) {
         ms->numa_state = g_new0(NumaState, 1);
+        ms->acpi_hma_state = g_new0(AcpiHmaState, 1);
     } else {
         ms->numa_state = NULL;
     }
@@ -883,6 +885,7 @@ static void machine_finalize(Object *obj)
     g_free(ms->device_memory);
     g_free(ms->nvdimms_state);
     g_free(ms->numa_state);
+    g_free(ms->acpi_hma_state);
 }
 
 bool machine_usb(MachineState *machine)
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index d3d8c93631..d869c5ae7b 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1844,6 +1844,8 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
         build_q35_pci0_int(dsdt);
     }
 
+    hmat_build_aml(dsdt);
+
     if (pcmc->legacy_cpu_hotplug) {
         build_legacy_cpu_hotplug_aml(dsdt, machine, pm->cpu_hp_io_base);
     } else {
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 1c7b2a97bc..3021375144 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -77,6 +77,7 @@
 #include "hw/i386/intel_iommu.h"
 #include "hw/net/ne2000-isa.h"
 #include "standard-headers/asm-x86/bootparam.h"
+#include "hw/acpi/hmat.h"
 
 /* debug PC/ISA interrupts */
 //#define DEBUG_IRQ
@@ -2130,6 +2131,8 @@ static void pc_memory_plug(HotplugHandler *hotplug_dev,
         nvdimm_plug(ms->nvdimms_state);
     }
 
+    hmat_update(ms);
+
     hotplug_handler_plug(HOTPLUG_HANDLER(pcms->acpi_dev), dev, &error_abort);
 out:
     error_propagate(errp, local_err);
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index c07c4a5b38..966d98d619 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -58,6 +58,7 @@
 #include "migration/misc.h"
 #include "kvm_i386.h"
 #include "sysemu/numa.h"
+#include "hw/acpi/hmat.h"
 
 #define MAX_IDE_BUS 2
 
@@ -301,6 +302,9 @@ static void pc_init1(MachineState *machine,
         nvdimm_init_acpi_state(machine->nvdimms_state, system_io,
                                pcms->fw_cfg, OBJECT(pcms));
     }
+
+    hmat_init_acpi_state(machine->acpi_hma_state, system_io,
+                         pcms->fw_cfg, OBJECT(pcms));
 }
 
 /* Looking for a pc_compat_2_4() function? It doesn't exist.
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 37dd350511..610b10467a 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -54,6 +54,7 @@
 #include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "sysemu/numa.h"
+#include "hw/acpi/hmat.h"
 
 /* ICH9 AHCI has 6 ports */
 #define MAX_SATA_PORTS     6
@@ -333,6 +334,9 @@ static void pc_q35_init(MachineState *machine)
         nvdimm_init_acpi_state(machine->nvdimms_state, system_io,
                                pcms->fw_cfg, OBJECT(pcms));
     }
+
+    hmat_init_acpi_state(machine->acpi_hma_state, system_io,
+                         pcms->fw_cfg, OBJECT(pcms));
 }
 
 #define DEFINE_Q35_MACHINE(suffix, name, compatfn, optionfn) \
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 8609f923d9..e8d94a69b5 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -315,6 +315,7 @@ struct MachineState {
     CPUArchIdList *possible_cpus;
     struct NVDIMMState *nvdimms_state;
     NumaState *numa_state;
+    AcpiHmaState *acpi_hma_state;
 };
 
 #define DEFINE_MACHINE(namestr, machine_initfn) \
diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
index d971f5109e..a207cc1f88 100644
--- a/include/qemu/typedefs.h
+++ b/include/qemu/typedefs.h
@@ -5,6 +5,7 @@
    pull in all the real definitions.  */
 
 /* Please keep this list in case-insensitive alphabetical order */
+typedef struct AcpiHmaState AcpiHmaState;
 typedef struct AdapterInfo AdapterInfo;
 typedef struct AddressSpace AddressSpace;
 typedef struct AioContext AioContext;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState Tao Xu
@ 2019-05-23 13:04   ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-05-23 13:04 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, pbonzini, rth, ehabkost

On Wed,  8 May 2019 14:17:16 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> The aim of this patch is to add struct NumaState in MachineState
> and move existing numa global nb_numa_nodes(renamed as "num_nodes")
> into NumaState. And add variable numa_support into MachineClass to
> decide which submachines support NUMA.

patch looks fine to me (modulo minor comments to be addressed/answered).

> 
> Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - send the patch together with HMAT patches
> 
> Changes in v3 -> v2:
>     - rename the "NumaState::nb_numa_nodes" as "NumaState::num_nodes"
>     (Eduardo)
>     - use machine_num_numa_nodes(MachineState *ms) to check if
>     ms->numa_state is NULL before using NumaState::num_nodes (Eduardo)
>     - check if ms->numa_state == NULL in the set_numa_options to avoid
>     using -numa in a machine-type which don't support numa
> 
> Changes in v2:
>     - fix the mistake in numa_complete_configuration in numa.c
>     - add MachineState into some functions to avoid using
>     qdev_get_machine
>     - add some if experssion to avoid the NumaState is null
> ---
>  exec.c                              |  5 ++-
>  hw/acpi/aml-build.c                 |  3 +-
>  hw/arm/boot.c                       |  2 ++
>  hw/arm/virt-acpi-build.c            |  8 +++--
>  hw/arm/virt.c                       |  5 ++-
>  hw/core/machine.c                   | 21 ++++++++---
>  hw/i386/acpi-build.c                |  2 +-
>  hw/i386/pc.c                        |  7 +++-
>  hw/mem/pc-dimm.c                    |  2 ++
>  hw/pci-bridge/pci_expander_bridge.c |  2 ++
>  hw/ppc/spapr.c                      | 12 ++++++-
>  include/hw/acpi/aml-build.h         |  2 +-
>  include/hw/boards.h                 | 10 ++++++
>  include/sysemu/numa.h               |  3 +-
>  monitor.c                           |  4 ++-
>  numa.c                              | 54 ++++++++++++++++++-----------
>  16 files changed, 105 insertions(+), 37 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 4e734770c2..c7eb4af42d 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1733,6 +1733,7 @@ long qemu_minrampagesize(void)
>      long hpsize = LONG_MAX;
>      long mainrampagesize;
>      Object *memdev_root;
> +    MachineState *ms = MACHINE(qdev_get_machine());
>  
>      mainrampagesize = qemu_mempath_getpagesize(mem_path);
>  
> @@ -1760,7 +1761,9 @@ long qemu_minrampagesize(void)
>       * so if its page size is smaller we have got to report that size instead.
>       */
>      if (hpsize > mainrampagesize &&
> -        (nb_numa_nodes == 0 || numa_info[0].node_memdev == NULL)) {
> +        (ms->numa_state == NULL ||
> +         ms->numa_state->num_nodes == 0 ||
> +         numa_info[0].node_memdev == NULL)) {
>          static bool warned;
>          if (!warned) {
>              error_report("Huge page support disabled (n/a for main memory).");
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index 555c24f21d..c67f4561a4 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1726,10 +1726,11 @@ void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
>   * ACPI spec 5.2.17 System Locality Distance Information Table
>   * (Revision 2.0 or later)
>   */
> -void build_slit(GArray *table_data, BIOSLinker *linker)
> +void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms)
>  {
>      int slit_start, i, j;
>      slit_start = table_data->len;
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      acpi_data_push(table_data, sizeof(AcpiTableHeader));
>  
> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> index a830655e1a..8ff08814fd 100644
> --- a/hw/arm/boot.c
> +++ b/hw/arm/boot.c
> @@ -532,6 +532,8 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>      hwaddr mem_base, mem_len;
>      char **node_path;
>      Error *err = NULL;
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
instead of calling qdev_get_machine() here, I suggest to add
nb_numa_nodes field to arm_boot_info and make board that cares about numa (virt)
to set it to the configured value.

>  
>      if (binfo->dtb_filename) {
>          char *filename;
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index bf9c0bc2f4..6805b4de51 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -516,7 +516,9 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      int i, srat_start;
>      uint64_t mem_base;
>      MachineClass *mc = MACHINE_GET_CLASS(vms);
> -    const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(MACHINE(vms));
> +    MachineState *ms = MACHINE(vms);
> +    const CPUArchIdList *cpu_list = mc->possible_cpu_arch_ids(ms);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      srat_start = table_data->len;
>      srat = acpi_data_push(table_data, sizeof(*srat));
> @@ -780,6 +782,8 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>      GArray *table_offsets;
>      unsigned dsdt, xsdt;
>      GArray *tables_blob = tables->table_data;
> +    MachineState *ms = MACHINE(vms);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      table_offsets = g_array_new(false, true /* clear */,
>                                          sizeof(uint32_t));
> @@ -813,7 +817,7 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>          build_srat(tables_blob, tables->linker, vms);
>          if (have_numa_distance) {
>              acpi_add_table(table_offsets, tables_blob);
> -            build_slit(tables_blob, tables->linker);
> +            build_slit(tables_blob, tables->linker, ms);
>          }
>      }
>  
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 16ba67f7a7..70954b658d 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -195,6 +195,8 @@ static bool cpu_type_valid(const char *cpu)
>  
>  static void create_fdt(VirtMachineState *vms)
>  {
> +    MachineState *ms = MACHINE(vms);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>      void *fdt = create_device_tree(&vms->fdt_size);
>  
>      if (!fdt) {
> @@ -1780,7 +1782,7 @@ virt_cpu_index_to_props(MachineState *ms, unsigned cpu_index)
>  
>  static int64_t virt_get_default_cpu_node_id(const MachineState *ms, int idx)
>  {
> -    return idx % nb_numa_nodes;
> +    return idx % machine_num_numa_nodes(ms);
>  }
>  
>  static const CPUArchIdList *virt_possible_cpu_arch_ids(MachineState *ms)
> @@ -1886,6 +1888,7 @@ static void virt_machine_class_init(ObjectClass *oc, void *data)
>      mc->kvm_type = virt_kvm_type;
>      assert(!mc->get_hotplug_handler);
>      mc->get_hotplug_handler = virt_machine_get_hotplug_handler;
> +    mc->numa_supported = true;
>      hc->plug = virt_machine_device_plug_cb;
>  }
>  
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 5d046a43e3..90bebb8d3a 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -857,6 +857,11 @@ static void machine_initfn(Object *obj)
>                                          NULL);
>      }
>  
> +    if (mc->numa_supported) {
> +        ms->numa_state = g_new0(NumaState, 1);
> +    } else {

> +        ms->numa_state = NULL;
it's not necessary, all QOM objects are zero initialized on allocation.


> +    }
>  
>      /* Register notifier when init is done for sysbus sanity checks */
>      ms->sysbus_notifier.notify = machine_init_notify;
> @@ -877,6 +882,7 @@ static void machine_finalize(Object *obj)
>      g_free(ms->firmware);
>      g_free(ms->device_memory);
>      g_free(ms->nvdimms_state);
> +    g_free(ms->numa_state);
>  }
>  
>  bool machine_usb(MachineState *machine)
> @@ -919,6 +925,11 @@ bool machine_mem_merge(MachineState *machine)
>      return machine->mem_merge;
>  }
>  
> +int machine_num_numa_nodes(const MachineState *machine)
> +{
> +    return machine->numa_state ? machine->numa_state->num_nodes : 0;
> +}
Wrapper looks unnecessary, I'd drop it and use
  machine->numa_state->num_nodes
directly at call sites.


>  static char *cpu_slot_to_string(const CPUArchId *cpu)
>  {
>      GString *s = g_string_new(NULL);
> @@ -948,7 +959,7 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>      const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
>  
> -    assert(nb_numa_nodes);
> +    assert(machine_num_numa_nodes(machine));
>      for (i = 0; i < possible_cpus->len; i++) {
>          if (possible_cpus->cpus[i].props.has_node_id) {
>              break;
> @@ -994,9 +1005,11 @@ void machine_run_board_init(MachineState *machine)
>  {
>      MachineClass *machine_class = MACHINE_GET_CLASS(machine);
>  
> -    numa_complete_configuration(machine);
> -    if (nb_numa_nodes) {
> -        machine_numa_finish_cpu_init(machine);
> +    if (machine_class->numa_supported) {
> +        numa_complete_configuration(machine);
> +        if (machine->numa_state->num_nodes) {
> +            machine_numa_finish_cpu_init(machine);
> +        }
>      }
>  
>      /* If the machine supports the valid_cpu_types check and the user
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 416da318ae..7d9bc88ac9 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2687,7 +2687,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
>          build_srat(tables_blob, tables->linker, machine);
>          if (have_numa_distance) {
>              acpi_add_table(table_offsets, tables_blob);
> -            build_slit(tables_blob, tables->linker);
> +            build_slit(tables_blob, tables->linker, machine);
>          }
>      }
>      if (acpi_get_mcfg(&mcfg)) {
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index d98b737b8f..6404ae508e 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -999,6 +999,8 @@ static FWCfgState *bochs_bios_init(AddressSpace *as, PCMachineState *pcms)
>      int i;
>      const CPUArchIdList *cpus;
>      MachineClass *mc = MACHINE_GET_CLASS(pcms);
> +    MachineState *ms = MACHINE(pcms);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      fw_cfg = fw_cfg_init_io_dma(FW_CFG_IO_BASE, FW_CFG_IO_BASE + 4, as);
>      fw_cfg_add_i16(fw_cfg, FW_CFG_NB_CPUS, pcms->boot_cpus);
> @@ -1675,6 +1677,8 @@ void pc_machine_done(Notifier *notifier, void *data)
>  void pc_guest_info_init(PCMachineState *pcms)
>  {
>      int i;
> +    MachineState *ms = MACHINE(pcms);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      pcms->apic_xrupt_override = kvm_allows_irq0_override();
>      pcms->numa_nodes = nb_numa_nodes;
> @@ -2658,7 +2662,7 @@ static int64_t pc_get_default_cpu_node_id(const MachineState *ms, int idx)
>     assert(idx < ms->possible_cpus->len);
>     x86_topo_ids_from_apicid(ms->possible_cpus->cpus[idx].arch_id,
>                              smp_cores, smp_threads, &topo);
> -   return topo.pkg_id % nb_numa_nodes;
> +   return topo.pkg_id % machine_num_numa_nodes(ms);
>  }
>  
>  static const CPUArchIdList *pc_possible_cpu_arch_ids(MachineState *ms)
> @@ -2752,6 +2756,7 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>      nc->nmi_monitor_handler = x86_nmi;
>      mc->default_cpu_type = TARGET_DEFAULT_CPU_TYPE;
>      mc->nvdimm_supported = true;
> +    mc->numa_supported = true;
>  
>      object_class_property_add(oc, PC_MACHINE_DEVMEM_REGION_SIZE, "int",
>          pc_machine_get_device_memory_region_size, NULL,
> diff --git a/hw/mem/pc-dimm.c b/hw/mem/pc-dimm.c
> index 152400b1fc..48cbd53e6b 100644
> --- a/hw/mem/pc-dimm.c
> +++ b/hw/mem/pc-dimm.c
> @@ -160,6 +160,8 @@ static void pc_dimm_realize(DeviceState *dev, Error **errp)
>  {
>      PCDIMMDevice *dimm = PC_DIMM(dev);
>      PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      if (!dimm->hostmem) {
>          error_setg(errp, "'" PC_DIMM_MEMDEV_PROP "' property is not set");
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index e62de4218f..d0590c0973 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -217,6 +217,8 @@ static void pxb_dev_realize_common(PCIDevice *dev, bool pcie, Error **errp)
>      PCIBus *bus;
>      const char *dev_name = NULL;
>      Error *local_err = NULL;
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      if (pxb->numa_node != NUMA_NODE_UNASSIGNED &&
>          pxb->numa_node >= nb_numa_nodes) {
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 2ef3ce4362..4f0a8d4e2e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -290,6 +290,8 @@ static int spapr_fixup_cpu_dt(void *fdt, SpaprMachineState *spapr)
>      CPUState *cs;
>      char cpu_model[32];
>      uint32_t pft_size_prop[] = {0, cpu_to_be32(spapr->htab_shift)};
> +    MachineState *ms = MACHINE(spapr);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      CPU_FOREACH(cs) {
>          PowerPCCPU *cpu = POWERPC_CPU(cs);
> @@ -344,6 +346,7 @@ static int spapr_fixup_cpu_dt(void *fdt, SpaprMachineState *spapr)
>  
>  static hwaddr spapr_node0_size(MachineState *machine)
>  {
> +    int nb_numa_nodes = machine_num_numa_nodes(machine);
>      if (nb_numa_nodes) {
>          int i;
>          for (i = 0; i < nb_numa_nodes; ++i) {
> @@ -390,6 +393,7 @@ static int spapr_populate_memory_node(void *fdt, int nodeid, hwaddr start,
>  static int spapr_populate_memory(SpaprMachineState *spapr, void *fdt)
>  {
>      MachineState *machine = MACHINE(spapr);
> +    int nb_numa_nodes = machine_num_numa_nodes(machine);
>      hwaddr mem_start, node_size;
>      int i, nb_nodes = nb_numa_nodes;
>      NodeInfo *nodes = numa_info;
> @@ -444,6 +448,8 @@ static void spapr_populate_cpu_dt(CPUState *cs, void *fdt, int offset,
>      PowerPCCPU *cpu = POWERPC_CPU(cs);
>      CPUPPCState *env = &cpu->env;
>      PowerPCCPUClass *pcc = POWERPC_CPU_GET_CLASS(cs);
> +    MachineState *ms = MACHINE(spapr);
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>      int index = spapr_get_vcpu_id(cpu);
>      uint32_t segs[] = {cpu_to_be32(28), cpu_to_be32(40),
>                         0xffffffff, 0xffffffff};
> @@ -849,6 +855,7 @@ static int spapr_populate_drmem_v1(SpaprMachineState *spapr, void *fdt,
>  static int spapr_populate_drconf_memory(SpaprMachineState *spapr, void *fdt)
>  {
>      MachineState *machine = MACHINE(spapr);
> +    int nb_numa_nodes = machine_num_numa_nodes(machine);
>      int ret, i, offset;
>      uint64_t lmb_size = SPAPR_MEMORY_BLOCK_SIZE;
>      uint32_t prop_lmb_size[] = {0, cpu_to_be32(lmb_size)};
> @@ -1693,6 +1700,7 @@ static void spapr_machine_reset(void)
>  {
>      MachineState *machine = MACHINE(qdev_get_machine());
>      SpaprMachineState *spapr = SPAPR_MACHINE(machine);
> +    int nb_numa_nodes = machine_num_numa_nodes(machine);
>      PowerPCCPU *first_ppc_cpu;
>      uint32_t rtas_limit;
>      hwaddr rtas_addr, fdt_addr;
> @@ -2509,6 +2517,7 @@ static void spapr_create_lmb_dr_connectors(SpaprMachineState *spapr)
>  static void spapr_validate_node_memory(MachineState *machine, Error **errp)
>  {
>      int i;
> +    int nb_numa_nodes = machine_num_numa_nodes(machine);
>  
>      if (machine->ram_size % SPAPR_MEMORY_BLOCK_SIZE) {
>          error_setg(errp, "Memory size 0x" RAM_ADDR_FMT
> @@ -4111,7 +4120,7 @@ spapr_cpu_index_to_props(MachineState *machine, unsigned cpu_index)
>  
>  static int64_t spapr_get_default_cpu_node_id(const MachineState *ms, int idx)
>  {
> -    return idx / smp_cores % nb_numa_nodes;
> +    return idx / smp_cores % machine_num_numa_nodes(ms);
>  }
>  
>  static const CPUArchIdList *spapr_possible_cpu_arch_ids(MachineState *machine)
> @@ -4315,6 +4324,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      smc->update_dt_enabled = true;
>      mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power9_v2.0");
>      mc->has_hotpluggable_cpus = true;
> +    mc->numa_supported = true;
>      smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
>      fwc->get_dev_path = spapr_get_fw_dev_path;
>      nc->nmi_monitor_handler = spapr_nmi;
> diff --git a/include/hw/acpi/aml-build.h b/include/hw/acpi/aml-build.h
> index 1a563ad756..991cf05134 100644
> --- a/include/hw/acpi/aml-build.h
> +++ b/include/hw/acpi/aml-build.h
> @@ -414,7 +414,7 @@ build_append_gas_from_struct(GArray *table, const struct AcpiGenericAddress *s)
>  void build_srat_memory(AcpiSratMemoryAffinity *numamem, uint64_t base,
>                         uint64_t len, int node, MemoryAffinityFlags flags);
>  
> -void build_slit(GArray *table_data, BIOSLinker *linker);
> +void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms);
>  
>  void build_fadt(GArray *tbl, BIOSLinker *linker, const AcpiFadtData *f,
>                  const char *oem_id, const char *oem_table_id);
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 6f7916f88f..5f102e3075 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -5,6 +5,7 @@
>  
>  #include "sysemu/blockdev.h"
>  #include "sysemu/accel.h"
> +#include "sysemu/sysemu.h"
why it's here?

>  #include "hw/qdev.h"
>  #include "qom/object.h"
>  #include "qom/cpu.h"
> @@ -68,6 +69,7 @@ int machine_kvm_shadow_mem(MachineState *machine);
>  int machine_phandle_start(MachineState *machine);
>  bool machine_dump_guest_core(MachineState *machine);
>  bool machine_mem_merge(MachineState *machine);
> +int machine_num_numa_nodes(const MachineState *machine);
>  HotpluggableCPUList *machine_query_hotpluggable_cpus(MachineState *machine);
>  void machine_set_cpu_numa_node(MachineState *machine,
>                                 const CpuInstanceProperties *props,
> @@ -210,6 +212,7 @@ struct MachineClass {
>      bool ignore_boot_device_suffixes;
>      bool smbus_no_migration_support;
>      bool nvdimm_supported;
> +    bool numa_supported;
>  
>      HotplugHandler *(*get_hotplug_handler)(MachineState *machine,
>                                             DeviceState *dev);
> @@ -230,6 +233,12 @@ typedef struct DeviceMemoryState {
>      MemoryRegion mr;
>  } DeviceMemoryState;
>  
> +typedef struct NumaState {
> +    /* Number of NUMA nodes */
> +    int num_nodes;
> +
> +} NumaState;
> +
>  /**
>   * MachineState:
>   */
> @@ -273,6 +282,7 @@ struct MachineState {
>      AccelState *accelerator;
>      CPUArchIdList *possible_cpus;
>      struct NVDIMMState *nvdimms_state;
> +    NumaState *numa_state;
>  };
>  
>  #define DEFINE_MACHINE(namestr, machine_initfn) \
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index b6ac7de43e..a55e2be563 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -6,7 +6,6 @@
>  #include "sysemu/hostmem.h"
>  #include "hw/boards.h"
>  
> -extern int nb_numa_nodes;   /* Number of NUMA nodes */
>  extern bool have_numa_distance;
>  
>  struct NodeInfo {
> @@ -24,7 +23,7 @@ struct NumaNodeMem {
>  extern NodeInfo numa_info[MAX_NODES];
>  void parse_numa_opts(MachineState *ms);
>  void numa_complete_configuration(MachineState *ms);
> -void query_numa_node_mem(NumaNodeMem node_mem[]);
> +void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms);
>  extern QemuOptsList qemu_numa_opts;
>  void numa_legacy_auto_assign_ram(MachineClass *mc, NodeInfo *nodes,
>                                   int nb_nodes, ram_addr_t size);
> diff --git a/monitor.c b/monitor.c
> index bb48997913..28ea45a731 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -1926,11 +1926,13 @@ static void hmp_info_numa(Monitor *mon, const QDict *qdict)
>      int i;
>      NumaNodeMem *node_mem;
>      CpuInfoList *cpu_list, *cpu;
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      cpu_list = qmp_query_cpus(&error_abort);
>      node_mem = g_new0(NumaNodeMem, nb_numa_nodes);
>  
> -    query_numa_node_mem(node_mem);
> +    query_numa_node_mem(node_mem, ms);
>      monitor_printf(mon, "%d nodes\n", nb_numa_nodes);
>      for (i = 0; i < nb_numa_nodes; i++) {
>          monitor_printf(mon, "node %d cpus:", i);
> diff --git a/numa.c b/numa.c
> index 3875e1efda..343fcaf13f 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -52,7 +52,6 @@ static int have_memdevs = -1;
>  static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
>                               * For all nodes, nodeid < max_numa_nodeid
>                               */
> -int nb_numa_nodes;
>  bool have_numa_distance;
>  NodeInfo numa_info[MAX_NODES];
>  
> @@ -68,7 +67,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>      if (node->has_nodeid) {
>          nodenr = node->nodeid;
>      } else {
> -        nodenr = nb_numa_nodes;
> +        nodenr = machine_num_numa_nodes(ms);
>      }
>  
>      if (nodenr >= MAX_NODES) {
> @@ -136,10 +135,11 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>      }
>      numa_info[nodenr].present = true;
>      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> -    nb_numa_nodes++;
> +    ms->numa_state->num_nodes++;
>  }
>  
> -static void parse_numa_distance(NumaDistOptions *dist, Error **errp)
> +static
> +void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
>  {
>      uint16_t src = dist->src;
>      uint16_t dst = dist->dst;
> @@ -179,6 +179,11 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>  {
>      Error *err = NULL;
>  
> +    if (ms->numa_state == NULL) {
I'd use here
  MachineClass::numa_supported

> +        error_setg(errp, "NUMA is not supported by this machine-type");
> +        goto end;
> +    }
> +
>      switch (object->type) {
>      case NUMA_OPTIONS_TYPE_NODE:
>          parse_numa_node(ms, &object->u.node, &err);
> @@ -187,7 +192,7 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>          }
>          break;
>      case NUMA_OPTIONS_TYPE_DIST:
> -        parse_numa_distance(&object->u.dist, &err);
> +        parse_numa_distance(ms, &object->u.dist, &err);
>          if (err) {
>              goto end;
>          }
> @@ -252,10 +257,11 @@ end:
>   * distance from a node to itself is always NUMA_DISTANCE_MIN,
>   * so providing it is never necessary.
>   */
> -static void validate_numa_distance(void)
> +static void validate_numa_distance(MachineState *ms)
>  {
>      int src, dst;
>      bool is_asymmetrical = false;
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      for (src = 0; src < nb_numa_nodes; src++) {
>          for (dst = src; dst < nb_numa_nodes; dst++) {
> @@ -293,9 +299,10 @@ static void validate_numa_distance(void)
>      }
>  }
>  
> -static void complete_init_numa_distance(void)
> +static void complete_init_numa_distance(MachineState *ms)
>  {
>      int src, dst;
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      /* Fixup NUMA distance by symmetric policy because if it is an
>       * asymmetric distance table, it should be a complete table and
> @@ -369,7 +376,7 @@ void numa_complete_configuration(MachineState *ms)
>       *
>       * Enable NUMA implicitly by adding a new NUMA node automatically.
>       */
> -    if (ms->ram_slots > 0 && nb_numa_nodes == 0 &&
> +    if (ms->ram_slots > 0 && ms->numa_state->num_nodes == 0 &&
>          mc->auto_enable_numa_with_memhp) {
>              NumaNodeOptions node = { };
>              parse_numa_node(ms, &node, &error_abort);
> @@ -387,30 +394,33 @@ void numa_complete_configuration(MachineState *ms)
>      }
>  
>      /* This must be always true if all nodes are present: */
> -    assert(nb_numa_nodes == max_numa_nodeid);
> +    assert(ms->numa_state->num_nodes == max_numa_nodeid);
>  
> -    if (nb_numa_nodes > 0) {
> +    if (ms->numa_state->num_nodes > 0) {
>          uint64_t numa_total;
>  
> -        if (nb_numa_nodes > MAX_NODES) {
> -            nb_numa_nodes = MAX_NODES;
> +        if (ms->numa_state->num_nodes > MAX_NODES) {
> +            ms->numa_state->num_nodes = MAX_NODES;
>          }
>  
>          /* If no memory size is given for any node, assume the default case
>           * and distribute the available memory equally across all nodes
>           */
> -        for (i = 0; i < nb_numa_nodes; i++) {
> +        for (i = 0; i < ms->numa_state->num_nodes; i++) {
>              if (numa_info[i].node_mem != 0) {
>                  break;
>              }
>          }
> -        if (i == nb_numa_nodes) {
> +        if (i == ms->numa_state->num_nodes) {
>              assert(mc->numa_auto_assign_ram);
> -            mc->numa_auto_assign_ram(mc, numa_info, nb_numa_nodes, ram_size);
> +            mc->numa_auto_assign_ram(mc,
> +                                     numa_info,
> +                                     ms->numa_state->num_nodes,
> +                                     ram_size);
>          }
>  
>          numa_total = 0;
> -        for (i = 0; i < nb_numa_nodes; i++) {
> +        for (i = 0; i < ms->numa_state->num_nodes; i++) {
>              numa_total += numa_info[i].node_mem;
>          }
>          if (numa_total != ram_size) {
> @@ -434,10 +444,10 @@ void numa_complete_configuration(MachineState *ms)
>           */
>          if (have_numa_distance) {
>              /* Validate enough NUMA distance information was provided. */
> -            validate_numa_distance();
> +            validate_numa_distance(ms);
>  
>              /* Validation succeeded, now fill in any missing distances. */
> -            complete_init_numa_distance();
> +            complete_init_numa_distance(ms);
>          }
>      }
>  }
> @@ -513,6 +523,8 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner,
>  {
>      uint64_t addr = 0;
>      int i;
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int nb_numa_nodes = machine_num_numa_nodes(ms);
>  
>      if (nb_numa_nodes == 0 || !have_memdevs) {
>          allocate_system_memory_nonnuma(mr, owner, name, ram_size);
> @@ -578,16 +590,16 @@ static void numa_stat_memory_devices(NumaNodeMem node_mem[])
>      qapi_free_MemoryDeviceInfoList(info_list);
>  }
>  
> -void query_numa_node_mem(NumaNodeMem node_mem[])
> +void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms)
>  {
>      int i;
>  
> -    if (nb_numa_nodes <= 0) {
> +    if (ms->numa_state == NULL || ms->numa_state->num_nodes <= 0) {
>          return;
>      }
>  
>      numa_stat_memory_devices(node_mem);
> -    for (i = 0; i < nb_numa_nodes; i++) {
> +    for (i = 0; i < ms->numa_state->num_nodes; i++) {
>          node_mem[i].node_mem += numa_info[i].node_mem;
>      }
>  }



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance into MachineState
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance " Tao Xu
@ 2019-05-23 13:07   ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-05-23 13:07 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:17 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> The aim of this patch is to move existing numa global have_numa_distance
> into NumaState.

s/The aim of this patch is to//

> Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---

Reviewed-by: Igor Mammedov <imammedo@redhat.com>

> Changes in v4 -> v3:
>     - send the patch together with HMAT patches
> ---
>  hw/arm/virt-acpi-build.c | 2 +-
>  hw/arm/virt.c            | 2 +-
>  hw/i386/acpi-build.c     | 2 +-
>  include/hw/boards.h      | 2 ++
>  include/sysemu/numa.h    | 2 --
>  numa.c                   | 5 ++---
>  6 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6805b4de51..65f070843c 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -815,7 +815,7 @@ void virt_acpi_build(VirtMachineState *vms, AcpiBuildTables *tables)
>      if (nb_numa_nodes > 0) {
>          acpi_add_table(table_offsets, tables_blob);
>          build_srat(tables_blob, tables->linker, vms);
> -        if (have_numa_distance) {
> +        if (ms->numa_state->have_numa_distance) {
>              acpi_add_table(table_offsets, tables_blob);
>              build_slit(tables_blob, tables->linker, ms);
>          }
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 70954b658d..f0818ef597 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -228,7 +228,7 @@ static void create_fdt(VirtMachineState *vms)
>                                  "clk24mhz");
>      qemu_fdt_setprop_cell(fdt, "/apb-pclk", "phandle", vms->clock_phandle);
>  
> -    if (have_numa_distance) {
> +    if (nb_numa_nodes > 0 && ms->numa_state->have_numa_distance) {
>          int size = nb_numa_nodes * nb_numa_nodes * 3 * sizeof(uint32_t);
>          uint32_t *matrix = g_malloc0(size);
>          int idx, i, j;
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 7d9bc88ac9..43a807c483 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2685,7 +2685,7 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
>      if (pcms->numa_nodes) {
>          acpi_add_table(table_offsets, tables_blob);
>          build_srat(tables_blob, tables->linker, machine);
> -        if (have_numa_distance) {
> +        if (machine->numa_state->have_numa_distance) {
>              acpi_add_table(table_offsets, tables_blob);
>              build_slit(tables_blob, tables->linker, machine);
>          }
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 5f102e3075..c3c678b7ff 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -237,6 +237,8 @@ typedef struct NumaState {
>      /* Number of NUMA nodes */
>      int num_nodes;
>  
> +    /* Allow setting NUMA distance for different NUMA nodes */
> +    bool have_numa_distance;
>  } NumaState;
>  
>  /**
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index a55e2be563..1a29408db9 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -6,8 +6,6 @@
>  #include "sysemu/hostmem.h"
>  #include "hw/boards.h"
>  
> -extern bool have_numa_distance;
> -
>  struct NodeInfo {
>      uint64_t node_mem;
>      struct HostMemoryBackend *node_memdev;
> diff --git a/numa.c b/numa.c
> index 343fcaf13f..d4f5ff5193 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -52,7 +52,6 @@ static int have_memdevs = -1;
>  static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
>                               * For all nodes, nodeid < max_numa_nodeid
>                               */
> -bool have_numa_distance;
>  NodeInfo numa_info[MAX_NODES];
>  
>  
> @@ -171,7 +170,7 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
>      }
>  
>      numa_info[src].distance[dst] = val;
> -    have_numa_distance = true;
> +    ms->numa_state->have_numa_distance = true;
>  }
>  
>  static
> @@ -442,7 +441,7 @@ void numa_complete_configuration(MachineState *ms)
>           * asymmetric. In this case, the distances for both directions
>           * of all node pairs are required.
>           */
> -        if (have_numa_distance) {
> +        if (ms->numa_state->have_numa_distance) {
>              /* Validate enough NUMA distance information was provided. */
>              validate_numa_distance(ms);
>  



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info into MachineState
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info " Tao Xu
@ 2019-05-23 13:47   ` Igor Mammedov
  2019-05-28  7:43     ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-05-23 13:47 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost,
	qemu-ppc, pbonzini, david, rth

On Wed,  8 May 2019 14:17:18 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> The aim of this patch is to move existing numa global numa_info
> (renamed as "nodes") into NumaState.

s/The aim of this patch is to //

there is repeated pattern you use in patches

  ms->numa_state ? ms->numa_state->FOO : NULL

which might be not justified and plain use of

  ms->numa_state->FOO

would be sufficient.
The places where NULL could be used probably are broken
and should be fixed by not dereferencing
ms->numa_state in the first place.
   

> 
> Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - send the patch together with HMAT patches
> 
> Changes in v3 -> v2:
>     - rename the "NumaState::numa_info" as "NumaState::nodes" (Eduardo)
> ---
>  exec.c                   |  2 +-
>  hw/acpi/aml-build.c      |  6 ++++--
>  hw/arm/boot.c            |  2 +-
>  hw/arm/virt-acpi-build.c |  7 ++++---
>  hw/arm/virt.c            |  1 +
>  hw/i386/pc.c             |  4 ++--
>  hw/ppc/spapr.c           |  8 +++++++-
>  hw/ppc/spapr_pci.c       |  2 ++
>  include/hw/boards.h      | 10 ++++++++++
>  include/sysemu/numa.h    |  8 --------
>  numa.c                   | 15 +++++++++------
>  11 files changed, 41 insertions(+), 24 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index c7eb4af42d..0e30926588 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1763,7 +1763,7 @@ long qemu_minrampagesize(void)
>      if (hpsize > mainrampagesize &&
>          (ms->numa_state == NULL ||
>           ms->numa_state->num_nodes == 0 ||
> -         numa_info[0].node_memdev == NULL)) {
> +         ms->numa_state->nodes[0].node_memdev == NULL)) {
>          static bool warned;
>          if (!warned) {
>              error_report("Huge page support disabled (n/a for main memory).");
> diff --git a/hw/acpi/aml-build.c b/hw/acpi/aml-build.c
> index c67f4561a4..b53a55cb56 100644
> --- a/hw/acpi/aml-build.c
> +++ b/hw/acpi/aml-build.c
> @@ -1737,8 +1737,10 @@ void build_slit(GArray *table_data, BIOSLinker *linker, MachineState *ms)
>      build_append_int_noprefix(table_data, nb_numa_nodes, 8);
>      for (i = 0; i < nb_numa_nodes; i++) {
>          for (j = 0; j < nb_numa_nodes; j++) {
> -            assert(numa_info[i].distance[j]);
> -            build_append_int_noprefix(table_data, numa_info[i].distance[j], 1);
> +            assert(ms->numa_state->nodes[i].distance[j]);
> +            build_append_int_noprefix(table_data,
> +                                      ms->numa_state->nodes[i].distance[j],
> +                                      1);
>          }
>      }
>  
> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
> index 8ff08814fd..845b737ab9 100644
> --- a/hw/arm/boot.c
> +++ b/hw/arm/boot.c
> @@ -602,7 +602,7 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>      if (nb_numa_nodes > 0) {
>          mem_base = binfo->loader_start;
>          for (i = 0; i < nb_numa_nodes; i++) {
> -            mem_len = numa_info[i].node_mem;
> +            mem_len = ms->numa_state->nodes[i].node_mem;
in 1/11 I've suggested to add nb_numa_nodes, but it might be to add
a pointer to MachineState there.
It would also help to simplify arm_load_dtb later as there are other 
bits that we copy to arm_boot_info from MachineState.

>              rc = fdt_add_memory_node(fdt, acells, mem_base,
>                                       scells, mem_len, i);
>              if (rc < 0) {
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 65f070843c..b22c3d27ad 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -535,11 +535,12 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>  
>      mem_base = vms->memmap[VIRT_MEM].base;
>      for (i = 0; i < nb_numa_nodes; ++i) {
> -        if (numa_info[i].node_mem > 0) {
> +        if (ms->numa_state->nodes[i].node_mem > 0) {
>              numamem = acpi_data_push(table_data, sizeof(*numamem));
> -            build_srat_memory(numamem, mem_base, numa_info[i].node_mem, i,
> +            build_srat_memory(numamem, mem_base,
> +                              ms->numa_state->nodes[i].node_mem, i,
>                                MEM_AFFINITY_ENABLED);
> -            mem_base += numa_info[i].node_mem;
> +            mem_base += ms->numa_state->nodes[i].node_mem;
>          }
>      }
>  
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index f0818ef597..853caf606f 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -232,6 +232,7 @@ static void create_fdt(VirtMachineState *vms)
>          int size = nb_numa_nodes * nb_numa_nodes * 3 * sizeof(uint32_t);
>          uint32_t *matrix = g_malloc0(size);
>          int idx, i, j;
> +        NodeInfo *numa_info = ms->numa_state->nodes;
>  
>          for (i = 0; i < nb_numa_nodes; i++) {
>              for (j = 0; j < nb_numa_nodes; j++) {
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 6404ae508e..1c7b2a97bc 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1043,7 +1043,7 @@ static FWCfgState *bochs_bios_init(AddressSpace *as, PCMachineState *pcms)
>      }
>      for (i = 0; i < nb_numa_nodes; i++) {
>          numa_fw_cfg[pcms->apic_id_limit + 1 + i] =
> -            cpu_to_le64(numa_info[i].node_mem);
> +            cpu_to_le64(ms->numa_state->nodes[i].node_mem);
>      }
>      fw_cfg_add_bytes(fw_cfg, FW_CFG_NUMA, numa_fw_cfg,
>                       (1 + pcms->apic_id_limit + nb_numa_nodes) *
> @@ -1685,7 +1685,7 @@ void pc_guest_info_init(PCMachineState *pcms)
>      pcms->node_mem = g_malloc0(pcms->numa_nodes *
>                                      sizeof *pcms->node_mem);
>      for (i = 0; i < nb_numa_nodes; i++) {
> -        pcms->node_mem[i] = numa_info[i].node_mem;
> +        pcms->node_mem[i] = ms->numa_state->nodes[i].node_mem;
>      }
>  
>      pcms->machine_done.notify = pc_machine_done;
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 4f0a8d4e2e..d577c2025e 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -349,6 +349,7 @@ static hwaddr spapr_node0_size(MachineState *machine)
>      int nb_numa_nodes = machine_num_numa_nodes(machine);
>      if (nb_numa_nodes) {
>          int i;
> +        NodeInfo *numa_info = machine->numa_state->nodes;
>          for (i = 0; i < nb_numa_nodes; ++i) {
>              if (numa_info[i].node_mem) {
>                  return MIN(pow2floor(numa_info[i].node_mem),
> @@ -396,7 +397,9 @@ static int spapr_populate_memory(SpaprMachineState *spapr, void *fdt)
>      int nb_numa_nodes = machine_num_numa_nodes(machine);
>      hwaddr mem_start, node_size;
>      int i, nb_nodes = nb_numa_nodes;
> -    NodeInfo *nodes = numa_info;
> +    NodeInfo *nodes = machine->numa_state ?

can machine->numa_state actually be NULL?

> +                      machine->numa_state->nodes :
> +                      NULL;
>      NodeInfo ramnode;
>  
>      /* No NUMA nodes, assume there is just one node with whole RAM */
> @@ -2518,6 +2521,9 @@ static void spapr_validate_node_memory(MachineState *machine, Error **errp)
>  {
>      int i;
>      int nb_numa_nodes = machine_num_numa_nodes(machine);
> +    NodeInfo *numa_info = machine->numa_state ?
ditto

> +                          machine->numa_state->nodes :
> +                          NULL;

Also question to PPC folks,
  spapr_validate_node_memory()
seems to be 'broken' in 1 implicit node case since
spapr_populate_memory() 'creates' implicit node info
temporarily but then spapr_validate_node_memory() would use
global nb_numa_nodes which is 0 and skip check.

>  
>      if (machine->ram_size % SPAPR_MEMORY_BLOCK_SIZE) {
>          error_setg(errp, "Memory size 0x" RAM_ADDR_FMT
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index 97961b0128..f4e5c0f5b2 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -1660,6 +1660,8 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
>      SysBusDevice *s = SYS_BUS_DEVICE(dev);
>      SpaprPhbState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
>      PCIHostState *phb = PCI_HOST_BRIDGE(s);
> +    MachineState *ms = MACHINE(spapr);
> +    NodeInfo *numa_info = ms->numa_state ? ms->numa_state->nodes : NULL;
>      char *namebuf;
>      int i;
>      PCIBus *bus;
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index c3c678b7ff..777eed4dd9 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -233,12 +233,22 @@ typedef struct DeviceMemoryState {
>      MemoryRegion mr;
>  } DeviceMemoryState;
>  
> +struct NodeInfo {
> +    uint64_t node_mem;
> +    struct HostMemoryBackend *node_memdev;
> +    bool present;
> +    uint8_t distance[MAX_NODES];
> +};
> +
>  typedef struct NumaState {
>      /* Number of NUMA nodes */
>      int num_nodes;
>  
>      /* Allow setting NUMA distance for different NUMA nodes */
>      bool have_numa_distance;
> +
> +    /* NUMA nodes information */
> +    NodeInfo nodes[MAX_NODES];
>  } NumaState;
>  
>  /**
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 1a29408db9..7b8011f9ea 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -6,19 +6,11 @@
>  #include "sysemu/hostmem.h"
>  #include "hw/boards.h"
>  
> -struct NodeInfo {
> -    uint64_t node_mem;
> -    struct HostMemoryBackend *node_memdev;
> -    bool present;
> -    uint8_t distance[MAX_NODES];
> -};
> -
>  struct NumaNodeMem {
>      uint64_t node_mem;
>      uint64_t node_plugged_mem;
>  };
>  
> -extern NodeInfo numa_info[MAX_NODES];
>  void parse_numa_opts(MachineState *ms);
>  void numa_complete_configuration(MachineState *ms);
>  void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms);
> diff --git a/numa.c b/numa.c
> index d4f5ff5193..ddea376d72 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -52,8 +52,6 @@ static int have_memdevs = -1;
>  static int max_numa_nodeid; /* Highest specified NUMA node ID, plus one.
>                               * For all nodes, nodeid < max_numa_nodeid
>                               */
> -NodeInfo numa_info[MAX_NODES];
> -
>  
>  static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>                              Error **errp)
> @@ -62,6 +60,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>      uint16_t nodenr;
>      uint16List *cpus = NULL;
>      MachineClass *mc = MACHINE_GET_CLASS(ms);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
>  
>      if (node->has_nodeid) {
>          nodenr = node->nodeid;
> @@ -143,6 +142,7 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
>      uint16_t src = dist->src;
>      uint16_t dst = dist->dst;
>      uint8_t val = dist->val;
> +    NodeInfo *numa_info = ms->numa_state->nodes;
>  
>      if (src >= MAX_NODES || dst >= MAX_NODES) {
>          error_setg(errp, "Parameter '%s' expects an integer between 0 and %d",
> @@ -201,7 +201,7 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>              error_setg(&err, "Missing mandatory node-id property");
>              goto end;
>          }
> -        if (!numa_info[object->u.cpu.node_id].present) {
> +        if (!ms->numa_state->nodes[object->u.cpu.node_id].present) {
>              error_setg(&err, "Invalid node-id=%" PRId64 ", NUMA node must be "
>                  "defined with -numa node,nodeid=ID before it's used with "
>                  "-numa cpu,node-id=ID", object->u.cpu.node_id);
> @@ -261,6 +261,7 @@ static void validate_numa_distance(MachineState *ms)
>      int src, dst;
>      bool is_asymmetrical = false;
>      int nb_numa_nodes = machine_num_numa_nodes(ms);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
>  
>      for (src = 0; src < nb_numa_nodes; src++) {
>          for (dst = src; dst < nb_numa_nodes; dst++) {
> @@ -302,6 +303,7 @@ static void complete_init_numa_distance(MachineState *ms)
>  {
>      int src, dst;
>      int nb_numa_nodes = machine_num_numa_nodes(ms);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
>  
>      /* Fixup NUMA distance by symmetric policy because if it is an
>       * asymmetric distance table, it should be a complete table and
> @@ -361,6 +363,7 @@ void numa_complete_configuration(MachineState *ms)
>  {
>      int i;
>      MachineClass *mc = MACHINE_GET_CLASS(ms);
> +    NodeInfo *numa_info = ms->numa_state->nodes;
>  
>      /*
>       * If memory hotplug is enabled (slots > 0) but without '-numa'
> @@ -532,8 +535,8 @@ void memory_region_allocate_system_memory(MemoryRegion *mr, Object *owner,
>  
>      memory_region_init(mr, owner, name, ram_size);
>      for (i = 0; i < nb_numa_nodes; i++) {
> -        uint64_t size = numa_info[i].node_mem;
> -        HostMemoryBackend *backend = numa_info[i].node_memdev;
> +        uint64_t size = ms->numa_state->nodes[i].node_mem;
> +        HostMemoryBackend *backend = ms->numa_state->nodes[i].node_memdev;
>          if (!backend) {
>              continue;
>          }
> @@ -599,7 +602,7 @@ void query_numa_node_mem(NumaNodeMem node_mem[], MachineState *ms)
>  
>      numa_stat_memory_devices(node_mem);
>      for (i = 0; i < ms->numa_state->num_nodes; i++) {
> -        node_mem[i].node_mem += numa_info[i].node_mem;
> +        node_mem[i].node_mem += ms->numa_state->nodes[i].node_mem;
>      }
>  }
>  



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook Tao Xu
@ 2019-05-24 12:35   ` Igor Mammedov
  2019-06-06  5:15     ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-05-24 12:35 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:19 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> Add build_mem_ranges callback to AcpiDeviceIfClass and use
> it for generating SRAT and HMAT numa memory ranges.
> 
> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> Co-developed-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - spilt the 1/8 of v3 patch into two patches, 4/13 introduces
>     build_mem_ranges() and adding it to ACPI interface, 5/13 builds
>     HMAT (Igor)
> ---
>  hw/acpi/piix4.c                      |   1 +
>  hw/i386/acpi-build.c                 | 116 ++++++++++++++++-----------
>  hw/isa/lpc_ich9.c                    |   1 +
>  include/hw/acpi/acpi_dev_interface.h |   3 +
>  include/hw/boards.h                  |  12 +++
>  include/hw/i386/pc.h                 |   1 +
>  stubs/Makefile.objs                  |   1 +
>  stubs/pc_build_mem_ranges.c          |   6 ++
>  8 files changed, 96 insertions(+), 45 deletions(-)
>  create mode 100644 stubs/pc_build_mem_ranges.c
> 
> diff --git a/hw/acpi/piix4.c b/hw/acpi/piix4.c
> index 9c079d6834..7c320a49b2 100644
> --- a/hw/acpi/piix4.c
> +++ b/hw/acpi/piix4.c
> @@ -723,6 +723,7 @@ static void piix4_pm_class_init(ObjectClass *klass, void *data)
>      adevc->ospm_status = piix4_ospm_status;
>      adevc->send_event = piix4_send_gpe;
>      adevc->madt_cpu = pc_madt_cpu_entry;
> +    adevc->build_mem_ranges = pc_build_mem_ranges;
>  }
>  
>  static const TypeInfo piix4_pm_info = {
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 43a807c483..5598e7f780 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -2271,6 +2271,65 @@ build_tpm2(GArray *table_data, BIOSLinker *linker, GArray *tcpalog)
>  #define HOLE_640K_START  (640 * KiB)
>  #define HOLE_640K_END   (1 * MiB)
>  
> +void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *ms)
> +{
> +    uint64_t mem_len, mem_base, next_base;
> +    int i;
> +    PCMachineState *pcms = PC_MACHINE(ms);
> +    /*
> +     * the memory map is a bit tricky, it contains at least one hole
> +     * from 640k-1M and possibly another one from 3.5G-4G.
> +     */
> +    NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
> +    ms->numa_state->mem_ranges_num = 0;
> +    next_base = 0;
> +
> +    for (i = 0; i < pcms->numa_nodes; ++i) {
> +        mem_base = next_base;
> +        mem_len = pcms->node_mem[i];
> +        next_base = mem_base + mem_len;
> +
> +        /* Cut out the 640K hole */
> +        if (mem_base <= HOLE_640K_START &&
> +            next_base > HOLE_640K_START) {
> +            mem_len -= next_base - HOLE_640K_START;
> +            if (mem_len > 0) {
> +                mem_ranges[ms->numa_state->mem_ranges_num].base = mem_base;
> +                mem_ranges[ms->numa_state->mem_ranges_num].length = mem_len;
> +                mem_ranges[ms->numa_state->mem_ranges_num].node = i;
> +                ms->numa_state->mem_ranges_num++;
> +            }
> +
> +            /* Check for the rare case: 640K < RAM < 1M */
> +            if (next_base <= HOLE_640K_END) {
> +                next_base = HOLE_640K_END;
> +                continue;
> +            }
> +            mem_base = HOLE_640K_END;
> +            mem_len = next_base - HOLE_640K_END;
> +        }
> +
> +        /* Cut out the ACPI_PCI hole */
> +        if (mem_base <= pcms->below_4g_mem_size &&
> +            next_base > pcms->below_4g_mem_size) {
> +            mem_len -= next_base - pcms->below_4g_mem_size;
> +            if (mem_len > 0) {
> +                mem_ranges[ms->numa_state->mem_ranges_num].base = mem_base;
> +                mem_ranges[ms->numa_state->mem_ranges_num].length = mem_len;
> +                mem_ranges[ms->numa_state->mem_ranges_num].node = i;
> +                ms->numa_state->mem_ranges_num++;
> +            }
> +            mem_base = 1ULL << 32;
> +            mem_len = next_base - pcms->below_4g_mem_size;
> +            next_base = mem_base + mem_len;
> +        }


> +        mem_ranges[ms->numa_state->mem_ranges_num].base = mem_base;
> +        mem_ranges[ms->numa_state->mem_ranges_num].length = mem_len;
> +        mem_ranges[ms->numa_state->mem_ranges_num].node = i;
> +        ms->numa_state->mem_ranges_num++;

why did you drop 'if (mem_len > 0) {' as it was in original code?

> +
> +}
> +
>  static void
>  build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>  {
> @@ -2279,10 +2338,13 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>  
>      int i;
>      int srat_start, numa_start, slots;
> -    uint64_t mem_len, mem_base, next_base;
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>      const CPUArchIdList *apic_ids = mc->possible_cpu_arch_ids(machine);
>      PCMachineState *pcms = PC_MACHINE(machine);
> +    AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
> +    AcpiDeviceIf *adev = ACPI_DEVICE_IF(pcms->acpi_dev);
> +    uint32_t mem_ranges_num = machine->numa_state->mem_ranges_num;
> +    NumaMemRange *mem_ranges = machine->numa_state->mem_ranges;
>      ram_addr_t hotplugabble_address_space_size =
>          object_property_get_int(OBJECT(pcms), PC_MACHINE_DEVMEM_REGION_SIZE,
>                                  NULL);
> @@ -2319,57 +2381,21 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>          }
>      }
>  
> +    if (pcms->numa_nodes && !mem_ranges_num) {
> +        adevc->build_mem_ranges(adev, machine);
> +    }
>  
> -    /* the memory map is a bit tricky, it contains at least one hole
> -     * from 640k-1M and possibly another one from 3.5G-4G.
> -     */
> -    next_base = 0;
>      numa_start = table_data->len;
>  
> -    for (i = 1; i < pcms->numa_nodes + 1; ++i) {
> -        mem_base = next_base;
> -        mem_len = pcms->node_mem[i - 1];
> -        next_base = mem_base + mem_len;
> -
> -        /* Cut out the 640K hole */
> -        if (mem_base <= HOLE_640K_START &&
> -            next_base > HOLE_640K_START) {
> -            mem_len -= next_base - HOLE_640K_START;
> -            if (mem_len > 0) {
> +    for (i = 0; i < mem_ranges_num; i++) {
> +        if (mem_ranges[i].length > 0) {
>                  numamem = acpi_data_push(table_data, sizeof *numamem);
> -                build_srat_memory(numamem, mem_base, mem_len, i - 1,
> +            build_srat_memory(numamem, mem_ranges[i].base,
> +                              mem_ranges[i].length,
> +                              mem_ranges[i].node,
>                                    MEM_AFFINITY_ENABLED);
>              }
> -
> -            /* Check for the rare case: 640K < RAM < 1M */
> -            if (next_base <= HOLE_640K_END) {
> -                next_base = HOLE_640K_END;
> -                continue;
>              }
> -            mem_base = HOLE_640K_END;
> -            mem_len = next_base - HOLE_640K_END;
> -        }
> -
> -        /* Cut out the ACPI_PCI hole */
> -        if (mem_base <= pcms->below_4g_mem_size &&
> -            next_base > pcms->below_4g_mem_size) {
> -            mem_len -= next_base - pcms->below_4g_mem_size;
> -            if (mem_len > 0) {
> -                numamem = acpi_data_push(table_data, sizeof *numamem);
> -                build_srat_memory(numamem, mem_base, mem_len, i - 1,
> -                                  MEM_AFFINITY_ENABLED);
> -            }
> -            mem_base = 1ULL << 32;
> -            mem_len = next_base - pcms->below_4g_mem_size;
> -            next_base = mem_base + mem_len;
> -        }
> -
> -        if (mem_len > 0) {
> -            numamem = acpi_data_push(table_data, sizeof *numamem);
> -            build_srat_memory(numamem, mem_base, mem_len, i - 1,
> -                              MEM_AFFINITY_ENABLED);
> -        }
> -    }
>      slots = (table_data->len - numa_start) / sizeof *numamem;
>      for (; slots < pcms->numa_nodes + 2; slots++) {
>          numamem = acpi_data_push(table_data, sizeof *numamem);
> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c
> index ac44aa53be..4ae64846ba 100644
> --- a/hw/isa/lpc_ich9.c
> +++ b/hw/isa/lpc_ich9.c
> @@ -812,6 +812,7 @@ static void ich9_lpc_class_init(ObjectClass *klass, void *data)
>      adevc->ospm_status = ich9_pm_ospm_status;
>      adevc->send_event = ich9_send_gpe;
>      adevc->madt_cpu = pc_madt_cpu_entry;
> +    adevc->build_mem_ranges = pc_build_mem_ranges;
>  }
>  
>  static const TypeInfo ich9_lpc_info = {
> diff --git a/include/hw/acpi/acpi_dev_interface.h b/include/hw/acpi/acpi_dev_interface.h
> index 43ff119179..d8634ac1ed 100644
> --- a/include/hw/acpi/acpi_dev_interface.h
> +++ b/include/hw/acpi/acpi_dev_interface.h
> @@ -39,6 +39,7 @@ void acpi_send_event(DeviceState *dev, AcpiEventStatusBits event);
>   *           for CPU indexed by @uid in @apic_ids array,
>   *           returned structure types are:
>   *           0 - Local APIC, 9 - Local x2APIC, 0xB - GICC
> + * build_mem_ranges: build memory ranges of ACPI SRAT and HMAT

it's not exactly what it does, it does above only partially leaving out misc
and hotplug SRAT ranges.

>   *
>   * Interface is designed for providing unified interface
>   * to generic ACPI functionality that could be used without
> @@ -54,5 +55,7 @@ typedef struct AcpiDeviceIfClass {
>      void (*send_event)(AcpiDeviceIf *adev, AcpiEventStatusBits ev);
>      void (*madt_cpu)(AcpiDeviceIf *adev, int uid,
>                       const CPUArchIdList *apic_ids, GArray *entry);
> +    void (*build_mem_ranges)(AcpiDeviceIf *adev, MachineState *ms);
> +
>  } AcpiDeviceIfClass;
>  #endif
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 777eed4dd9..9fbf921ecf 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -240,6 +240,12 @@ struct NodeInfo {
>      uint8_t distance[MAX_NODES];
>  };
>  
> +typedef struct NumaMemRange {
> +    uint64_t base;
> +    uint64_t length;
> +    uint32_t node;
> +} NumaMemRange;
> +
>  typedef struct NumaState {
>      /* Number of NUMA nodes */
>      int num_nodes;
> @@ -249,6 +255,12 @@ typedef struct NumaState {
>  
>      /* NUMA nodes information */
>      NodeInfo nodes[MAX_NODES];
> +
> +    /* Number of NUMA memory ranges */
> +    uint32_t mem_ranges_num;
> +
> +    /* NUMA memory ranges */
> +    NumaMemRange mem_ranges[MAX_NODES + 2];
why MAX_NODES + 2 ???
I'd use GArray here instead of 2 above fields

>  } NumaState;
>  
>  /**
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index 43df7230a2..1e4ee404ae 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -281,6 +281,7 @@ void pc_system_firmware_init(PCMachineState *pcms, MemoryRegion *rom_memory);
>  /* acpi-build.c */
>  void pc_madt_cpu_entry(AcpiDeviceIf *adev, int uid,
>                         const CPUArchIdList *apic_ids, GArray *entry);
> +void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *ms);
>  
>  /* e820 types */
>  #define E820_RAM        1
> diff --git a/stubs/Makefile.objs b/stubs/Makefile.objs
> index 269dfa5832..7e0a962815 100644
> --- a/stubs/Makefile.objs
> +++ b/stubs/Makefile.objs
> @@ -33,6 +33,7 @@ stub-obj-y += qmp_memory_device.o
>  stub-obj-y += target-monitor-defs.o
>  stub-obj-y += target-get-monitor-def.o
>  stub-obj-y += pc_madt_cpu_entry.o
> +stub-obj-y += pc_build_mem_ranges.o
>  stub-obj-y += vmgenid.o
>  stub-obj-y += xen-common.o
>  stub-obj-y += xen-hvm.o
> diff --git a/stubs/pc_build_mem_ranges.c b/stubs/pc_build_mem_ranges.c
> new file mode 100644
> index 0000000000..0f104ba79d
> --- /dev/null
> +++ b/stubs/pc_build_mem_ranges.c
> @@ -0,0 +1,6 @@
> +#include "qemu/osdep.h"
> +#include "hw/i386/pc.h"
> +
> +void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *machine)
> +{
> +}

why do you need stub?




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 05/11] hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI HMAT
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 05/11] hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI HMAT Tao Xu
@ 2019-05-24 14:16   ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-05-24 14:16 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:20 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> HMAT is defined in ACPI 6.2: 5.2.27 Heterogeneous Memory Attribute Table (HMAT).
> The specification references below link:
> http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> 
> It describes the memory attributes, such as memory side cache
> attributes and bandwidth and latency details, related to the
> System Physical Address (SPA) Memory Ranges. The software is
> expected to use this information as hint for optimization.
> 
> This structure describes the System Physical Address(SPA) range
> occupied by memory subsystem and its associativity with processor
> proximity domain as well as hint for memory usage.
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - spilt the 1/8 of v3 patch into two patches, 4/13 introduces
>     build_mem_ranges() and adding it to ACPI interface, 5/13 builds
>     HMAT (Igor)
>     - use MachineState instead of PCMachineState to build HMAT more
>     generalic (Igor)
>     - move hmat_build_spa() inside of hmat_build_hma() (Igor)
> ---
>  hw/acpi/Kconfig       |   5 ++
>  hw/acpi/Makefile.objs |   1 +
>  hw/acpi/hmat.c        | 135 ++++++++++++++++++++++++++++++++++++++++++
>  hw/acpi/hmat.h        |  43 ++++++++++++++
>  hw/i386/acpi-build.c  |  11 ++--
>  include/hw/boards.h   |   2 +
>  numa.c                |   6 ++
>  7 files changed, 199 insertions(+), 4 deletions(-)
>  create mode 100644 hw/acpi/hmat.c
>  create mode 100644 hw/acpi/hmat.h
> 
> diff --git a/hw/acpi/Kconfig b/hw/acpi/Kconfig
> index eca3beed75..074dbd5a42 100644
> --- a/hw/acpi/Kconfig
> +++ b/hw/acpi/Kconfig
> @@ -7,6 +7,7 @@ config ACPI_X86
>      select ACPI_NVDIMM
>      select ACPI_CPU_HOTPLUG
>      select ACPI_MEMORY_HOTPLUG
> +    select ACPI_HMAT
>  
>  config ACPI_X86_ICH
>      bool
> @@ -27,3 +28,7 @@ config ACPI_VMGENID
>      bool
>      default y
>      depends on PC
> +
> +config ACPI_HMAT
> +    bool
> +    depends on ACPI
> diff --git a/hw/acpi/Makefile.objs b/hw/acpi/Makefile.objs
> index 2d46e3789a..932ba42d13 100644
> --- a/hw/acpi/Makefile.objs
> +++ b/hw/acpi/Makefile.objs
> @@ -6,6 +6,7 @@ common-obj-$(CONFIG_ACPI_MEMORY_HOTPLUG) += memory_hotplug.o
>  common-obj-$(CONFIG_ACPI_CPU_HOTPLUG) += cpu.o
>  common-obj-$(CONFIG_ACPI_NVDIMM) += nvdimm.o
>  common-obj-$(CONFIG_ACPI_VMGENID) += vmgenid.o
> +common-obj-$(CONFIG_ACPI_HMAT) += hmat.o
>  common-obj-$(call lnot,$(CONFIG_ACPI_X86)) += acpi-stub.o
>  
>  common-obj-y += acpi_interface.o
> diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> new file mode 100644
> index 0000000000..bffe453280
> --- /dev/null
> +++ b/hw/acpi/hmat.c
> @@ -0,0 +1,135 @@
> +/*
> + * HMAT ACPI Implementation
> + *
> + * Copyright(C) 2019 Intel Corporation.
> + *
> + * Author:
> + *  Liu jingqi <jingqi.liu@linux.intel.com>
> + *  Tao Xu <tao3.xu@intel.com>
> + *
> + * HMAT is defined in ACPI 6.2.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>
> + */
> +
> +#include "qemu/osdep.h"
> +#include "sysemu/numa.h"
> +#include "hw/i386/pc.h"
table is generic, pls make code generic too so it could be reused elsewhere

> +#include "hw/acpi/hmat.h"
> +#include "hw/nvram/fw_cfg.h"
why do you need this heared?

> +
> +/* Build Memory Subsystem Address Range Structure */
when creating APIs that build ACPI spec primitives, pls add
earliest version of the spec it is supported in and reference
chapter/table in that spec version where it's described.

see hw/acpi/aml-build.c for examples:

typical comment should look like:
 /* ACPI 1.0b: x.x.x.x chapter foo: Table y-y */

point is that it should be trivial for reader to find the spec
and grep referenced chapter/table in that spec by just copy-pasting
description form the code.

> +static void build_hmat_spa(GArray *table_data, MachineState *ms,
> +                           uint64_t base, uint64_t length, int node)
> +{
> +    uint16_t flags = 0;
> +
> +    if (ms->numa_state->nodes[node].is_initiator) {
you use only ms->numa_state->nodes from machine state here,
I'd suggest to pass is_initiator/is_target as arguments
so API won't depend on machine state

> +        flags |= HMAT_SPA_PROC_VALID;
> +    }
> +    if (ms->numa_state->nodes[node].is_target) {
> +        flags |= HMAT_SPA_MEM_VALID;
> +    }
> +
> +    /* Memory Subsystem Address Range Structure */
> +    /* Type */
> +    build_append_int_noprefix(table_data, 0, 2);
> +    /* Reserved */
> +    build_append_int_noprefix(table_data, 0, 2);
> +    /* Length */
> +    build_append_int_noprefix(table_data, 40, 4);
> +    /* Flags */
> +    build_append_int_noprefix(table_data, flags, 2);
> +    /* Reserved */
> +    build_append_int_noprefix(table_data, 0, 2);
> +    /* Process Proximity Domain */
> +    build_append_int_noprefix(table_data, node, 4);
> +    /* Memory Proximity Domain */
> +    build_append_int_noprefix(table_data, node, 4);
> +    /* Reserved */
> +    build_append_int_noprefix(table_data, 0, 4);
> +    /* System Physical Address Range Base */
> +    build_append_int_noprefix(table_data, base, 8);
> +    /* System Physical Address Range Length */
> +    build_append_int_noprefix(table_data, length, 8);
> +}
> +
> +static int pc_dimm_device_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> +        *list = g_slist_append(*list, DEVICE(obj));
> +    }

missing 'if (dev->realized)' check, see memory_device_build_list()

> +
> +    object_child_foreach(obj, pc_dimm_device_list, opaque);
> +    return 0;
> +}
> +
> +/*
> + * The Proximity Domain of System Physical Address ranges defined
> + * in the HMAT, NFIT and SRAT tables shall match each other.
> + */

where does this comment comes from? (pointer to spec pls)

> +static void hmat_build_hma(GArray *table_data, MachineState *ms)
where does _hma comes from?
What you are building here is "Memory Subsystem Address Range Structure"
so I'd rather use acronym: msar

> +{
> +    GSList *device_list = NULL;
> +    uint64_t mem_base, mem_len;
> +    int i;
> +    uint32_t mem_ranges_num = ms->numa_state->mem_ranges_num;
> +    NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
> +
> +    PCMachineState *pcms = PC_MACHINE(ms);
> +    AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
> +    AcpiDeviceIf *adev = ACPI_DEVICE_IF(pcms->acpi_dev);
> +
> +    /* Build HMAT Memory Subsystem Address Range. */
> +    if (pcms->numa_nodes && !mem_ranges_num) {
well, you've just moved a bunch of numa globals into MachineState,
why do you still use PCMachineState here making code depend on PCMachine.
I'd suggest to make it specific machine agnostic if possible
using MachineState instead.

With your refactoring duplicated PCMachineState numa fields probably
shouldn't be necessary and should be removed.

> +        adevc->build_mem_ranges(adev, ms);
> +    }
> +
> +    for (i = 0; i < mem_ranges_num; i++) {
> +        build_hmat_spa(table_data, ms, mem_ranges[i].base,
> +                       mem_ranges[i].length,
> +                       mem_ranges[i].node);
> +    }
> +
> +    /* Build HMAT SPA structures for PC-DIMM devices. */
> +    object_child_foreach(qdev_get_machine(),
> +                         pc_dimm_device_list, &device_list);
> +
> +    for (; device_list; device_list = device_list->next) {
> +        PCDIMMDevice *dimm = device_list->data;
> +        mem_base = object_property_get_uint(OBJECT(dimm), PC_DIMM_ADDR_PROP,
> +                                            NULL);
> +        mem_len = object_property_get_uint(OBJECT(dimm), PC_DIMM_SIZE_PROP,
> +                                           NULL);
> +        i = object_property_get_uint(OBJECT(dimm), PC_DIMM_NODE_PROP, NULL);
> +        build_hmat_spa(table_data, ms, mem_base, mem_len, i);
> +    }
> +}
> +
> +void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
> +{
> +    uint64_t hmat_start, hmat_len;
> +
> +    hmat_start = table_data->len;

    +  /* reserve space for HMAT header  */

> +    acpi_data_push(table_data, 40);
> +
> +    hmat_build_hma(table_data, ms);
> +    hmat_len = table_data->len - hmat_start;
> +
> +    build_header(linker, table_data,
> +                 (void *)(table_data->data + hmat_start),
> +                 "HMAT", hmat_len, 1, NULL, NULL);

s/hmat_len/table_data->len - hmat_start/


> +}
> diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
> new file mode 100644
> index 0000000000..4f480c1e43
> --- /dev/null
> +++ b/hw/acpi/hmat.h
> @@ -0,0 +1,43 @@
> +/*
> + * HMAT ACPI Implementation Header
> + *
> + * Copyright(C) 2019 Intel Corporation.
> + *
> + * Author:
> + *  Liu jingqi <jingqi.liu@linux.intel.com>
> + *  Tao Xu <tao3.xu@intel.com>
> + *
> + * HMAT is defined in ACPI 6.2.
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>
> + */
> +
> +#ifndef HMAT_H
> +#define HMAT_H
> +
> +#include "hw/acpi/acpi-defs.h"
> +#include "hw/acpi/acpi.h"
> +#include "hw/acpi/bios-linker-loader.h"
> +#include "hw/acpi/aml-build.h"
> +
> +/* the values of AcpiHmatSpaRange flag */
> +enum {
> +    HMAT_SPA_PROC_VALID       = 0x1,
> +    HMAT_SPA_MEM_VALID        = 0x2,
> +    HMAT_SPA_RESERVATION_HINT = 0x4,
> +};
> +
> +void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);

s/hmat_build_acpi/build_hmat/

> +
> +#endif
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index 5598e7f780..d3d8c93631 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -64,6 +64,7 @@
>  #include "hw/i386/intel_iommu.h"
>  
>  #include "hw/acpi/ipmi.h"
> +#include "hw/acpi/hmat.h"
>  
>  /* These are used to size the ACPI tables for -M pc-i440fx-1.7 and
>   * -M pc-i440fx-2.0.  Even if the actual amount of AML generated grows
> @@ -2389,13 +2390,13 @@ build_srat(GArray *table_data, BIOSLinker *linker, MachineState *machine)
>  
>      for (i = 0; i < mem_ranges_num; i++) {
>          if (mem_ranges[i].length > 0) {
> -                numamem = acpi_data_push(table_data, sizeof *numamem);
> +            numamem = acpi_data_push(table_data, sizeof *numamem);
>              build_srat_memory(numamem, mem_ranges[i].base,
>                                mem_ranges[i].length,
>                                mem_ranges[i].node,
> -                                  MEM_AFFINITY_ENABLED);
> -            }
> -            }
> +                              MEM_AFFINITY_ENABLED);
> +        }
> +    }

unrelated hunk move it to the patch that introduced wrongly
aligned lines in the first place

>      slots = (table_data->len - numa_start) / sizeof *numamem;
>      for (; slots < pcms->numa_nodes + 2; slots++) {
>          numamem = acpi_data_push(table_data, sizeof *numamem);
> @@ -2715,6 +2716,8 @@ void acpi_build(AcpiBuildTables *tables, MachineState *machine)
>              acpi_add_table(table_offsets, tables_blob);
>              build_slit(tables_blob, tables->linker, machine);
>          }
> +        acpi_add_table(table_offsets, tables_blob);
> +        hmat_build_acpi(tables_blob, tables->linker, machine);
>      }
>      if (acpi_get_mcfg(&mcfg)) {
>          acpi_add_table(table_offsets, tables_blob);
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 9fbf921ecf..d392634e08 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -237,6 +237,8 @@ struct NodeInfo {
>      uint64_t node_mem;
>      struct HostMemoryBackend *node_memdev;
>      bool present;
> +    bool is_initiator;
> +    bool is_target;
>      uint8_t distance[MAX_NODES];
>  };
>  
> diff --git a/numa.c b/numa.c
> index ddea376d72..71b0aee02a 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -102,6 +102,10 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>          }
>      }
>  
> +    if (node->cpus) {
> +        numa_info[nodenr].is_initiator = true;
> +    }
this only takes care of legacy '-numa node,cpus=range' option
you also need to add handling for '-numa cpu' option

probably the better place to take care of all cpu options at once
is machine_numa_finish_cpu_init().


>      if (node->has_mem && node->has_memdev) {
>          error_setg(errp, "cannot specify both mem= and memdev=");
>          return;
> @@ -118,6 +122,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>  
>      if (node->has_mem) {
>          numa_info[nodenr].node_mem = node->mem;
> +        numa_info[nodenr].is_target = true;
>      }
>      if (node->has_memdev) {
>          Object *o;
> @@ -130,6 +135,7 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>          object_ref(o);
>          numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
>          numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
> +        numa_info[nodenr].is_target = true;
>      }
>      numa_info[nodenr].present = true;
>      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info into MachineState
  2019-05-23 13:47   ` Igor Mammedov
@ 2019-05-28  7:43     ` Tao Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Tao Xu @ 2019-05-28  7:43 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost,
	qemu-ppc, pbonzini, david, rth


On 23/05/2019 21:47, Igor Mammedov wrote:
> On Wed,  8 May 2019 14:17:18 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
...
>> diff --git a/hw/arm/boot.c b/hw/arm/boot.c
>> index 8ff08814fd..845b737ab9 100644
>> --- a/hw/arm/boot.c
>> +++ b/hw/arm/boot.c
>> @@ -602,7 +602,7 @@ int arm_load_dtb(hwaddr addr, const struct arm_boot_info *binfo,
>>       if (nb_numa_nodes > 0) {
>>           mem_base = binfo->loader_start;
>>           for (i = 0; i < nb_numa_nodes; i++) {
>> -            mem_len = numa_info[i].node_mem;
>> +            mem_len = ms->numa_state->nodes[i].node_mem;
> in 1/11 I've suggested to add nb_numa_nodes, but it might be to add
> a pointer to MachineState there.
> It would also help to simplify arm_load_dtb later as there are other
> bits that we copy to arm_boot_info from MachineState.
> 

Hi Igor,

Thank you for your review. I will simplify arm_load_dtb() in the next 
version of patch and improve the other issues.

Tao


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT)
  2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
@ 2019-05-31  4:55   ` Dan Williams
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance " Tao Xu
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Dan Williams @ 2019-05-31  4:55 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, Michael S. Tsirkin, jingqi.liu, linux-nvdimm,
	qemu-devel, Paolo Bonzini, Igor Mammedov, rth, eblake, ehabkost

On Tue, May 7, 2019 at 11:32 PM Tao Xu <tao3.xu@intel.com> wrote:
>
> This series of patches will build Heterogeneous Memory Attribute Table (HMAT)
> according to the command line. The ACPI HMAT describes the memory attributes,
> such as memory side cache attributes and bandwidth and latency details,
> related to the System Physical Address (SPA) Memory Ranges.
> The software is expected to use this information as hint for optimization.
>
> OSPM evaluates HMAT only during system initialization. Any changes to the HMAT
> state at runtime or information regarding HMAT for hot plug are communicated
> using the _HMA method.
[..]

Hi,

I gave these patches a try while developing support for the new EFI
v2.8 Specific Purpose Memory attribute [1]. I have a gap / feature
request to note to make this implementation capable of emulating
current shipping platform BIOS implementations for persistent memory
platforms.

The NUMA configuration I tested was:

        -numa node,mem=4G,cpus=0-19,nodeid=0
        -numa node,mem=4G,cpus=20-39,nodeid=1
        -numa node,mem=4G,nodeid=2
        -numa node,mem=4G,nodeid=3

...and it produced an entry like the following for proximity domain 2.

[0C8h 0200   2]               Structure Type : 0000 [Memory Proximity
Domain Attributes]
[0CAh 0202   2]                     Reserved : 0000
[0CCh 0204   4]                       Length : 00000028
[0D0h 0208   2]        Flags (decoded below) : 0002
            Processor Proximity Domain Valid : 0
[0D2h 0210   2]                    Reserved1 : 0000
[0D4h 0212   4]   Processor Proximity Domain : 00000002
[0D8h 0216   4]      Memory Proximity Domain : 00000002
[0DCh 0220   4]                    Reserved2 : 00000000
[0E0h 0224   8]                    Reserved3 : 0000000240000000
[0E8h 0232   8]                    Reserved4 : 0000000100000000

Notice that the Processor "Proximity Domain Valid" bit is clear. I
understand that the implementation is keying off of whether cpus are
defined for that same node or not, but that's not how current
persistent memory platforms implement "Processor Proximity Domain". On
these platforms persistent memory indeed has its own proximity domain,
but the Processor Proximity Domain is expected to be assigned to the
domain that houses the memory controller for that persistent memory.
So to emulate that configuration it would be useful to have a way to
specify "Processor Proximity Domain" without needing to define CPUs in
that domain.

Something like:

        -numa node,mem=4G,cpus=0-19,nodeid=0
        -numa node,mem=4G,cpus=20-39,nodeid=1
        -numa node,mem=4G,nodeid=2,localnodeid=0
        -numa node,mem=4G,nodeid=3,localnodeid=1

...to specify that node2 memory is connected / local to node0 and
node3 memory is connected / local to node1. In general HMAT specifies
that all performance differentiated memory ranges have their own
proximity domain, but those are expected to still be associated with a
local/host/home-socket memory controller.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT)
@ 2019-05-31  4:55   ` Dan Williams
  0 siblings, 0 replies; 38+ messages in thread
From: Dan Williams @ 2019-05-31  4:55 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, Michael S. Tsirkin, jingqi.liu, linux-nvdimm,
	qemu-devel, Paolo Bonzini, Igor Mammedov, rth, ehabkost

On Tue, May 7, 2019 at 11:32 PM Tao Xu <tao3.xu@intel.com> wrote:
>
> This series of patches will build Heterogeneous Memory Attribute Table (HMAT)
> according to the command line. The ACPI HMAT describes the memory attributes,
> such as memory side cache attributes and bandwidth and latency details,
> related to the System Physical Address (SPA) Memory Ranges.
> The software is expected to use this information as hint for optimization.
>
> OSPM evaluates HMAT only during system initialization. Any changes to the HMAT
> state at runtime or information regarding HMAT for hot plug are communicated
> using the _HMA method.
[..]

Hi,

I gave these patches a try while developing support for the new EFI
v2.8 Specific Purpose Memory attribute [1]. I have a gap / feature
request to note to make this implementation capable of emulating
current shipping platform BIOS implementations for persistent memory
platforms.

The NUMA configuration I tested was:

        -numa node,mem=4G,cpus=0-19,nodeid=0
        -numa node,mem=4G,cpus=20-39,nodeid=1
        -numa node,mem=4G,nodeid=2
        -numa node,mem=4G,nodeid=3

...and it produced an entry like the following for proximity domain 2.

[0C8h 0200   2]               Structure Type : 0000 [Memory Proximity
Domain Attributes]
[0CAh 0202   2]                     Reserved : 0000
[0CCh 0204   4]                       Length : 00000028
[0D0h 0208   2]        Flags (decoded below) : 0002
            Processor Proximity Domain Valid : 0
[0D2h 0210   2]                    Reserved1 : 0000
[0D4h 0212   4]   Processor Proximity Domain : 00000002
[0D8h 0216   4]      Memory Proximity Domain : 00000002
[0DCh 0220   4]                    Reserved2 : 00000000
[0E0h 0224   8]                    Reserved3 : 0000000240000000
[0E8h 0232   8]                    Reserved4 : 0000000100000000

Notice that the Processor "Proximity Domain Valid" bit is clear. I
understand that the implementation is keying off of whether cpus are
defined for that same node or not, but that's not how current
persistent memory platforms implement "Processor Proximity Domain". On
these platforms persistent memory indeed has its own proximity domain,
but the Processor Proximity Domain is expected to be assigned to the
domain that houses the memory controller for that persistent memory.
So to emulate that configuration it would be useful to have a way to
specify "Processor Proximity Domain" without needing to define CPUs in
that domain.

Something like:

        -numa node,mem=4G,cpus=0-19,nodeid=0
        -numa node,mem=4G,cpus=20-39,nodeid=1
        -numa node,mem=4G,nodeid=2,localnodeid=0
        -numa node,mem=4G,nodeid=3,localnodeid=1

...to specify that node2 memory is connected / local to node0 and
node3 memory is connected / local to node1. In general HMAT specifies
that all performance differentiated memory ranges have their own
proximity domain, but those are expected to still be associated with a
local/host/home-socket memory controller.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2019-May/021668.html


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 06/11] hmat acpi: Build System Locality Latency and Bandwidth Information Structure(s) in ACPI HMAT
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 06/11] hmat acpi: Build System Locality Latency and Bandwidth Information " Tao Xu
@ 2019-06-04 14:43   ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-06-04 14:43 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:21 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> This structure describes the memory access latency and bandwidth
> information from various memory access initiator proximity domains.
> The latency and bandwidth numbers represented in this structure
> correspond to rated latency and bandwidth for the platform.
> The software could use this information as hint for optimization.
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - use build_append_int_noprefix() to build System Locality Latency
>     and Bandwidth Information Structure(s) tables (Igor)
>     - move globals (hmat_lb_info) into MachineState (Igor)
>     - move hmat_build_lb() inside of hmat_build_hma() (Igor)
> ---
>  hw/acpi/hmat.c          | 97 ++++++++++++++++++++++++++++++++++++++++-
>  hw/acpi/hmat.h          | 39 +++++++++++++++++
>  include/hw/boards.h     |  3 ++
>  include/qemu/typedefs.h |  1 +
>  include/sysemu/sysemu.h | 22 ++++++++++
>  5 files changed, 161 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> index bffe453280..54aabf77eb 100644
> --- a/hw/acpi/hmat.c
> +++ b/hw/acpi/hmat.c
> @@ -29,6 +29,9 @@
>  #include "hw/acpi/hmat.h"
>  #include "hw/nvram/fw_cfg.h"
>  
> +static uint32_t initiator_pxm[MAX_NODES], target_pxm[MAX_NODES];
> +static uint32_t num_initiator, num_target;
> +
>  /* Build Memory Subsystem Address Range Structure */
>  static void build_hmat_spa(GArray *table_data, MachineState *ms,
>                             uint64_t base, uint64_t length, int node)
> @@ -77,6 +80,20 @@ static int pc_dimm_device_list(Object *obj, void *opaque)
>      return 0;
>  }
>  
> +static void classify_proximity_domains(MachineState *ms)
> +{
> +    int node;
> +
> +    for (node = 0; node < ms->numa_state->num_nodes; node++) {
> +        if (ms->numa_state->nodes[node].is_initiator) {
> +            initiator_pxm[num_initiator++] = node;
> +        }
> +        if (ms->numa_state->nodes[node].is_target) {
> +            target_pxm[num_target++] = node;
> +        }
> +    }
> +}
> +
>  /*
>   * The Proximity Domain of System Physical Address ranges defined
>   * in the HMAT, NFIT and SRAT tables shall match each other.
> @@ -85,9 +102,10 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
>  {
>      GSList *device_list = NULL;
>      uint64_t mem_base, mem_len;
> -    int i;
> +    int i, j, hrchy, type;
>      uint32_t mem_ranges_num = ms->numa_state->mem_ranges_num;
>      NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
> +    HMAT_LB_Info *numa_hmat_lb;
>  
>      PCMachineState *pcms = PC_MACHINE(ms);
>      AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
> @@ -117,6 +135,83 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
>          i = object_property_get_uint(OBJECT(dimm), PC_DIMM_NODE_PROP, NULL);
>          build_hmat_spa(table_data, ms, mem_base, mem_len, i);
>      }
> 

Considering below part is sufficiently big, I'd move it into separate function
ex: build_hmat_lb()

> +    if (!num_initiator && !num_target) {
> +        classify_proximity_domains(ms);
This part I'd just inline instead of it being a separate function
and make initiator_pxm,target_pxm,num_initiator,num_target local variables
instead of globals. (is there a reason why they weren't made locals?)

> +    }
> +
> +    /* Build HMAT System Locality Latency and Bandwidth Information. */
> +    for (hrchy = HMAT_LB_MEM_MEMORY;
> +         hrchy <= HMAT_LB_MEM_CACHE_3RD_LEVEL; hrchy++) {
> +        for (type = HMAT_LB_DATA_ACCESS_LATENCY;
> +             type <= HMAT_LB_DATA_WRITE_BANDWIDTH; type++) {
> +            numa_hmat_lb = ms->numa_state->hmat_lb[hrchy][type];
> +
> +            if (numa_hmat_lb) {
> +                uint32_t s = num_initiator;
> +                uint32_t t = num_target;
> +                uint8_t m, n;
> +
> +                /* Type */
> +                build_append_int_noprefix(table_data, 1, 2);
> +                /* Reserved */
> +                build_append_int_noprefix(table_data, 0, 2);
> +                /* Length */
> +                build_append_int_noprefix(table_data,
> +                                          32 + 4 * s + 4 * t + 2 * s * t, 4);
> +                /* Flags */
> +                build_append_int_noprefix(table_data,
> +                                          numa_hmat_lb->hierarchy, 1);
> +                /* Data Type */
> +                build_append_int_noprefix(table_data,
> +                                          numa_hmat_lb->data_type, 1);
> +                /* Reserved */
> +                build_append_int_noprefix(table_data, 0, 2);
> +                /* Number of Initiator Proximity Domains (s) */
> +                build_append_int_noprefix(table_data, s, 4);
> +                /* Number of Target Proximity Domains (t) */
> +                build_append_int_noprefix(table_data, t, 4);
> +                /* Reserved */
> +                build_append_int_noprefix(table_data, 0, 4);
> +
> +                /* Entry Base Unit */
> +                if (type <= HMAT_LB_DATA_WRITE_LATENCY) {
> +                    build_append_int_noprefix(table_data,
> +                                              numa_hmat_lb->base_lat, 8);
> +                } else {
> +                    build_append_int_noprefix(table_data,
> +                                              numa_hmat_lb->base_bw, 8);
> +                }
> +
> +                /* Initiator Proximity Domain List */
> +                for (i = 0; i < s; i++) {
> +                    build_append_int_noprefix(table_data, initiator_pxm[i], 4);
> +                }
> +
> +                /* Target Proximity Domain List */
> +                for (i = 0; i < t; i++) {
> +                    build_append_int_noprefix(table_data, target_pxm[i], 4);
> +                }
> +
> +                /* Latency or Bandwidth Entries */
> +                for (i = 0; i < s; i++) {
> +                    m = initiator_pxm[i];
> +                    for (j = 0; j < t; j++) {
> +                        n = target_pxm[j];
> +                        uint16_t entry;
> +
> +                        if (type <= HMAT_LB_DATA_WRITE_LATENCY) {
> +                            entry = numa_hmat_lb->latency[m][n];
> +                        } else {
> +                            entry = numa_hmat_lb->bandwidth[m][n];
> +                        }
> +
> +                        build_append_int_noprefix(table_data, entry, 2);
> +                    }
> +                }
> +            }
> +        }
> +    }
>  }
>  
>  void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
> diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
> index 4f480c1e43..f37e30e533 100644
> --- a/hw/acpi/hmat.h
> +++ b/hw/acpi/hmat.h
> @@ -38,6 +38,45 @@ enum {
>      HMAT_SPA_RESERVATION_HINT = 0x4,
>  };
>  
> +struct HMAT_LB_Info {
> +    /*
> +     * Indicates total number of Proximity Domains
> +     * that can initiate memory access requests.
> +     */
> +    uint32_t    num_initiator;
> +    /*
> +     * Indicates total number of Proximity Domains
> +     * that can act as target.
> +     */
> +    uint32_t    num_target;
> +    /*
> +     * Indicates it's memory or
> +     * the specified level memory side cache.
> +     */
> +    uint8_t     hierarchy;
> +    /*
> +     * Present the type of data,
> +     * access/read/write latency or bandwidth.
> +     */
> +    uint8_t     data_type;
> +    /* The base unit for latency in nanoseconds. */
> +    uint64_t    base_lat;
> +    /* The base unit for bandwidth in megabytes per second(MB/s). */
> +    uint64_t    base_bw;
> +    /*
> +     * latency[i][j]:
> +     * Indicates the latency based on base_lat
> +     * from Initiator Proximity Domain i to Target Proximity Domain j.
> +     */
> +    uint16_t    latency[MAX_NODES][MAX_NODES];
> +    /*
> +     * bandwidth[i][j]:
> +     * Indicates the bandwidth based on base_bw
> +     * from Initiator Proximity Domain i to Target Proximity Domain j.
> +     */
> +    uint16_t    bandwidth[MAX_NODES][MAX_NODES];
> +};
> +
>  void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
>  
>  #endif
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index d392634e08..e0169b0a64 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -263,6 +263,9 @@ typedef struct NumaState {
>  
>      /* NUMA memory ranges */
>      NumaMemRange mem_ranges[MAX_NODES + 2];
> +
> +    /* NUMA modes HMAT Locality Latency and Bandwidth Information */
> +    HMAT_LB_Info *hmat_lb[HMAT_LB_LEVELS][HMAT_LB_TYPES];
>  } NumaState;
>  
>  /**
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index fcdaae58c4..c0257e936b 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -33,6 +33,7 @@ typedef struct FWCfgEntry FWCfgEntry;
>  typedef struct FWCfgIoState FWCfgIoState;
>  typedef struct FWCfgMemState FWCfgMemState;
>  typedef struct FWCfgState FWCfgState;
> +typedef struct HMAT_LB_Info HMAT_LB_Info;
>  typedef struct HVFX86EmulatorState HVFX86EmulatorState;
>  typedef struct I2CBus I2CBus;
>  typedef struct I2SCodec I2SCodec;
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 5f133cae83..da51a9bc26 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -124,6 +124,28 @@ extern int mem_prealloc;
>  #define NUMA_DISTANCE_MAX         254
>  #define NUMA_DISTANCE_UNREACHABLE 255
>  
> +/* the value of AcpiHmatLBInfo flags */
> +enum {
> +    HMAT_LB_MEM_MEMORY           = 0,
> +    HMAT_LB_MEM_CACHE_LAST_LEVEL = 1,
> +    HMAT_LB_MEM_CACHE_1ST_LEVEL  = 2,
> +    HMAT_LB_MEM_CACHE_2ND_LEVEL  = 3,
> +    HMAT_LB_MEM_CACHE_3RD_LEVEL  = 4,
> +};
> +
> +/* the value of AcpiHmatLBInfo data type */
> +enum {
> +    HMAT_LB_DATA_ACCESS_LATENCY   = 0,
> +    HMAT_LB_DATA_READ_LATENCY     = 1,
> +    HMAT_LB_DATA_WRITE_LATENCY    = 2,
> +    HMAT_LB_DATA_ACCESS_BANDWIDTH = 3,
> +    HMAT_LB_DATA_READ_BANDWIDTH   = 4,
> +    HMAT_LB_DATA_WRITE_BANDWIDTH  = 5,
> +};
> +
> +#define HMAT_LB_LEVELS    (HMAT_LB_MEM_CACHE_3RD_LEVEL + 1)
> +#define HMAT_LB_TYPES     (HMAT_LB_DATA_WRITE_BANDWIDTH + 1)
>  #define MAX_OPTION_ROMS 16
>  typedef struct QEMUOptionRom {
>      const char *name;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache " Tao Xu
@ 2019-06-04 15:04   ` Igor Mammedov
  2019-06-05  6:04     ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-06-04 15:04 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:22 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> This structure describes memory side cache information for memory
> proximity domains if the memory side cache is present and the
> physical device(SMBIOS handle) forms the memory side cache.
> The software could use this information to effectively place
> the data in memory to maximize the performance of the system
> memory that use the memory side cache.
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - use build_append_int_noprefix() to build Memory Side Cache
>     Information Structure(s) tables (Igor)
>     - move globals (hmat_cache_info) into MachineState (Igor)
>     - move hmat_build_cache() inside of hmat_build_hma() (Igor)
> ---
>  hw/acpi/hmat.c          | 50 ++++++++++++++++++++++++++++++++++++++++-
>  hw/acpi/hmat.h          | 25 +++++++++++++++++++++
>  include/hw/boards.h     |  3 +++
>  include/qemu/typedefs.h |  1 +
>  include/sysemu/sysemu.h |  8 +++++++
>  5 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> index 54aabf77eb..3a8c41162d 100644
> --- a/hw/acpi/hmat.c
> +++ b/hw/acpi/hmat.c
> @@ -102,10 +102,11 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
>  {
>      GSList *device_list = NULL;
>      uint64_t mem_base, mem_len;
> -    int i, j, hrchy, type;
> +    int i, j, hrchy, type, level;
>      uint32_t mem_ranges_num = ms->numa_state->mem_ranges_num;
>      NumaMemRange *mem_ranges = ms->numa_state->mem_ranges;
>      HMAT_LB_Info *numa_hmat_lb;
> +    HMAT_Cache_Info *numa_hmat_cache = NULL;
>  
>      PCMachineState *pcms = PC_MACHINE(ms);
>      AcpiDeviceIfClass *adevc = ACPI_DEVICE_IF_GET_CLASS(pcms->acpi_dev);
> @@ -212,6 +213,53 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
>              }
>          }
>      }
> +
> +    /* Build HMAT Memory Side Cache Information. */
> +    for (i = 0; i < ms->numa_state->num_nodes; i++) {
> +        for (level = 0; level <= MAX_HMAT_CACHE_LEVEL; level++) {
> +            numa_hmat_cache = ms->numa_state->hmat_cache[i][level];
> +            if (numa_hmat_cache) {
> +                uint16_t n = numa_hmat_cache->num_smbios_handles;


> +                uint32_t cache_attr = HMAT_CACHE_TOTAL_LEVEL(
> +                                      numa_hmat_cache->total_levels);
> +                cache_attr |= HMAT_CACHE_CURRENT_LEVEL(
> +                              numa_hmat_cache->level);
> +                cache_attr |= HMAT_CACHE_ASSOC(
> +                                          numa_hmat_cache->associativity);
> +                cache_attr |= HMAT_CACHE_WRITE_POLICY(
> +                                          numa_hmat_cache->write_policy);
> +                cache_attr |= HMAT_CACHE_LINE_SIZE(
> +                                          numa_hmat_cache->line_size);
I don't see a merit of hiding bitfield manipulation behind macro
I'd suggest to drop macros here and mask+shift data here.

> +                cache_attr = cpu_to_le32(cache_attr);
> +
> +                /* Memory Side Cache Information Structure */
> +                /* Type */
> +                build_append_int_noprefix(table_data, 2, 2);
> +                /* Reserved */
> +                build_append_int_noprefix(table_data, 0, 2);
> +                /* Length */
> +                build_append_int_noprefix(table_data, 32 + 2 * n, 4);
> +                /* Proximity Domain for the Memory */
> +                build_append_int_noprefix(table_data,
> +                                          numa_hmat_cache->mem_proximity, 4);
> +                /* Reserved */
> +                build_append_int_noprefix(table_data, 0, 4);
> +                /* Memory Side Cache Size */
> +                build_append_int_noprefix(table_data,
> +                                          numa_hmat_cache->size, 8);
> +                /* Cache Attributes */
> +                build_append_int_noprefix(table_data, cache_attr, 4);
> +                /* Reserved */
> +                build_append_int_noprefix(table_data, 0, 2);
> +                /* Number of SMBIOS handles (n) */
> +                build_append_int_noprefix(table_data, n, 2);
> +
> +                /* SMBIOS Handles */
> +                /* TBD: set smbios handles */
> +                build_append_int_noprefix(table_data, 0, 2 * n);
Is memory side cache structure useful at all without pointing to SMBIOS entries?

> +            }
> +        }
> +    }
>  }
>  
>  void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
> diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
> index f37e30e533..8f563f19dd 100644
> --- a/hw/acpi/hmat.h
> +++ b/hw/acpi/hmat.h
> @@ -77,6 +77,31 @@ struct HMAT_LB_Info {
>      uint16_t    bandwidth[MAX_NODES][MAX_NODES];
>  };
>  
> +struct HMAT_Cache_Info {
> +    /* The memory proximity domain to which the memory belongs. */
> +    uint32_t    mem_proximity;
> +    /* Size of memory side cache in bytes. */
> +    uint64_t    size;
> +    /*
> +     * Total cache levels for this memory
> +     * pr#include "hw/acpi/aml-build.h"oximity domain.
> +     */
> +    uint8_t     total_levels;
> +    /* Cache level described in this structure. */
> +    uint8_t     level;
> +    /* Cache Associativity: None/Direct Mapped/Comple Cache Indexing */
> +    uint8_t     associativity;
> +    /* Write Policy: None/Write Back(WB)/Write Through(WT) */
> +    uint8_t     write_policy;
> +    /* Cache Line size in bytes. */
> +    uint16_t    line_size;
> +    /*
> +     * Number of SMBIOS handles that contributes to
> +     * the memory side cache physical devices.
> +     */
> +    uint16_t    num_smbios_handles;
> +};
> +
>  void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
>  
>  #endif
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index e0169b0a64..8609f923d9 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -266,6 +266,9 @@ typedef struct NumaState {
>  
>      /* NUMA modes HMAT Locality Latency and Bandwidth Information */
>      HMAT_LB_Info *hmat_lb[HMAT_LB_LEVELS][HMAT_LB_TYPES];
> +
> +    /* Memory Side Cache Information Structure */
> +    HMAT_Cache_Info *hmat_cache[MAX_NODES][MAX_HMAT_CACHE_LEVEL + 1];
>  } NumaState;
>  
>  /**
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index c0257e936b..d971f5109e 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -33,6 +33,7 @@ typedef struct FWCfgEntry FWCfgEntry;
>  typedef struct FWCfgIoState FWCfgIoState;
>  typedef struct FWCfgMemState FWCfgMemState;
>  typedef struct FWCfgState FWCfgState;
> +typedef struct HMAT_Cache_Info HMAT_Cache_Info;
>  typedef struct HMAT_LB_Info HMAT_LB_Info;
>  typedef struct HVFX86EmulatorState HVFX86EmulatorState;
>  typedef struct I2CBus I2CBus;
> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index da51a9bc26..0cfb387887 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -143,9 +143,17 @@ enum {
>      HMAT_LB_DATA_WRITE_BANDWIDTH  = 5,
>  };
>  
> +#define MAX_HMAT_CACHE_LEVEL        3
> +
>  #define HMAT_LB_LEVELS    (HMAT_LB_MEM_CACHE_3RD_LEVEL + 1)
>  #define HMAT_LB_TYPES     (HMAT_LB_DATA_WRITE_BANDWIDTH + 1)
>  
> +#define HMAT_CACHE_TOTAL_LEVEL(level)      (level & 0xF)
> +#define HMAT_CACHE_CURRENT_LEVEL(level)    ((level & 0xF) << 4)
> +#define HMAT_CACHE_ASSOC(assoc)            ((assoc & 0xF) << 8)
> +#define HMAT_CACHE_WRITE_POLICY(policy)    ((policy & 0xF) << 12)
> +#define HMAT_CACHE_LINE_SIZE(size)         ((size & 0xFFFF) << 16)
> +
>  #define MAX_OPTION_ROMS 16
>  typedef struct QEMUOptionRom {
>      const char *name;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-06-04 15:04   ` Igor Mammedov
@ 2019-06-05  6:04     ` Tao Xu
  2019-06-05 12:12       ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-06-05  6:04 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On 6/4/2019 11:04 PM, Igor Mammedov wrote:
> On Wed,  8 May 2019 14:17:22 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
>> From: Liu Jingqi <jingqi.liu@intel.com>
>>
>> This structure describes memory side cache information for memory
>> proximity domains if the memory side cache is present and the
>> physical device(SMBIOS handle) forms the memory side cache.
>> The software could use this information to effectively place
>> the data in memory to maximize the performance of the system
>> memory that use the memory side cache.
>>
>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>> ---
>>
...
>> +
>> +                /* SMBIOS Handles */
>> +                /* TBD: set smbios handles */
>> +                build_append_int_noprefix(table_data, 0, 2 * n);
> Is memory side cache structure useful at all without pointing to SMBIOS entries?
> 
They are not useful yet, and the kernel 5.1 HMAT sysfs doesn't show 
SMBIOS entries. We can update it if it useful in the future.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-06-05  6:04     ` Tao Xu
@ 2019-06-05 12:12       ` Igor Mammedov
  2019-06-06  3:00         ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-06-05 12:12 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, pbonzini, rth, ehabkost

On Wed, 5 Jun 2019 14:04:10 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 6/4/2019 11:04 PM, Igor Mammedov wrote:
> > On Wed,  8 May 2019 14:17:22 +0800
> > Tao Xu <tao3.xu@intel.com> wrote:
> >   
> >> From: Liu Jingqi <jingqi.liu@intel.com>
> >>
> >> This structure describes memory side cache information for memory
> >> proximity domains if the memory side cache is present and the
> >> physical device(SMBIOS handle) forms the memory side cache.
> >> The software could use this information to effectively place
> >> the data in memory to maximize the performance of the system
> >> memory that use the memory side cache.
> >>
> >> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> >> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> >> ---
> >>  
> ...
> >> +
> >> +                /* SMBIOS Handles */
> >> +                /* TBD: set smbios handles */
> >> +                build_append_int_noprefix(table_data, 0, 2 * n);  
> > Is memory side cache structure useful at all without pointing to SMBIOS entries?
> >   
> They are not useful yet, and the kernel 5.1 HMAT sysfs doesn't show 
> SMBIOS entries. We can update it if it useful in the future.

In that case I'd suggest to drop it for now until this table is properly
populated and ready for consumption. (i.e. drop this patch and corresponding
CLI 9/11 patch).


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information Tao Xu
@ 2019-06-05 14:40   ` Igor Mammedov
  2019-06-06  7:47     ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-06-05 14:40 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:23 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> Add -numa hmat-lb option to provide System Locality Latency and
> Bandwidth Information. These memory attributes help to build
> System Locality Latency and Bandwidth Information Structure(s)
> in ACPI Heterogeneous Memory Attribute Table (HMAT).
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - update the version tag from 4.0 to 4.1
> ---
>  numa.c          | 127 ++++++++++++++++++++++++++++++++++++++++++++++++
>  qapi/misc.json  |  94 ++++++++++++++++++++++++++++++++++-
>  qemu-options.hx |  28 ++++++++++-
>  3 files changed, 246 insertions(+), 3 deletions(-)
> 
> diff --git a/numa.c b/numa.c
> index 71b0aee02a..1aecb7a2e9 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -40,6 +40,7 @@
>  #include "qemu/option.h"
>  #include "qemu/config-file.h"
>  #include "qemu/cutils.h"
> +#include "hw/acpi/hmat.h"
>  
>  QemuOptsList qemu_numa_opts = {
>      .name = "numa",
> @@ -179,6 +180,126 @@ void parse_numa_distance(MachineState *ms, NumaDistOptions *dist, Error **errp)
>      ms->numa_state->have_numa_distance = true;
>  }
>  
> +static void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
> +                               Error **errp)
> +{
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    NodeInfo *numa_info = ms->numa_state->nodes;
> +    HMAT_LB_Info *hmat_lb = NULL;
> +
> +    if (node->data_type <= HMATLB_DATA_TYPE_WRITE_LATENCY) {
> +        if (!node->has_latency) {
> +            error_setg(errp, "Missing 'latency' option.");
> +            return;
> +        }
> +        if (node->has_bandwidth) {
> +            error_setg(errp, "Invalid option 'bandwidth' since "
> +                       "the data type is latency.");
> +            return;
> +        }
> +        if (node->has_base_bw) {
> +            error_setg(errp, "Invalid option 'base_bw' since "
> +                       "the data type is latency.");
> +            return;
> +        }
> +    }
> +
> +    if (node->data_type >= HMATLB_DATA_TYPE_ACCESS_BANDWIDTH) {
> +        if (!node->has_bandwidth) {
> +            error_setg(errp, "Missing 'bandwidth' option.");
> +            return;
> +        }
> +        if (node->has_latency) {
> +            error_setg(errp, "Invalid option 'latency' since "
> +                       "the data type is bandwidth.");
> +            return;
> +        }
> +        if (node->has_base_lat) {
> +            error_setg(errp, "Invalid option 'base_lat' since "
> +                       "the data type is bandwidth.");
> +            return;
> +        }
> +    }
> +
> +    if (node->initiator >= nb_numa_nodes) {
> +        error_setg(errp, "Invalid initiator=%"
> +                   PRIu16 ", it should be less than %d.",
> +                   node->initiator, nb_numa_nodes);
> +        return;
> +    }
> +    if (!numa_info[node->initiator].is_initiator) {
> +        error_setg(errp, "Invalid initiator=%"
> +                   PRIu16 ", it isn't an initiator proximity domain.",
> +                   node->initiator);
> +        return;
> +    }
> +
> +    if (node->target >= nb_numa_nodes) {
> +        error_setg(errp, "Invalid initiator=%"
Shouldn't it be 'target' here

> +                   PRIu16 ", it should be less than %d.",
> +                   node->target, nb_numa_nodes);
> +        return;
> +    }
> +    if (!numa_info[node->target].is_target) {
> +        error_setg(errp, "Invalid target=%"
> +                   PRIu16 ", it isn't a target proximity domain.",
> +                   node->target);
> +        return;
> +    }
> +
> +    if (node->has_latency) {
> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
> +
> +        if (!hmat_lb) {
> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
> +        } else if (hmat_lb->latency[node->initiator][node->target]) {
> +            error_setg(errp, "Duplicate configuration of the latency for "
> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
> +                       node->initiator, node->target);
> +            return;
> +        }
> +
> +        /* Only the first time of setting the base unit is valid. */
> +        if ((hmat_lb->base_lat == 0) && (node->has_base_lat)) {
> +            hmat_lb->base_lat = node->base_lat;
> +        }
> +
> +        hmat_lb->latency[node->initiator][node->target] = node->latency;
> +    }
> +
> +    if (node->has_bandwidth) {
> +        hmat_lb = ms->numa_state->hmat_lb[node->hierarchy][node->data_type];
> +
> +        if (!hmat_lb) {
> +            hmat_lb = g_malloc0(sizeof(*hmat_lb));
> +            ms->numa_state->hmat_lb[node->hierarchy][node->data_type] = hmat_lb;
> +        } else if (hmat_lb->bandwidth[node->initiator][node->target]) {
> +            error_setg(errp, "Duplicate configuration of the bandwidth for "
> +                       "initiator=%" PRIu16 " and target=%" PRIu16 ".",
> +                       node->initiator, node->target);
> +            return;
> +        }
> +
> +        /* Only the first time of setting the base unit is valid. */
> +        if (hmat_lb->base_bw == 0) {
> +            if (!node->has_base_bw) {
> +                error_setg(errp, "Missing 'base-bw' option");
> +                return;
> +            } else {
> +                hmat_lb->base_bw = node->base_bw;
> +            }
> +        }
> +
> +        hmat_lb->bandwidth[node->initiator][node->target] = node->bandwidth;
> +    }
> +
> +    if (hmat_lb) {
> +        hmat_lb->hierarchy = node->hierarchy;
> +        hmat_lb->data_type = node->data_type;
> +    }
> +}
> +
>  static
>  void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>  {
> @@ -217,6 +338,12 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>          machine_set_cpu_numa_node(ms, qapi_NumaCpuOptions_base(&object->u.cpu),
>                                    &err);
>          break;
> +    case NUMA_OPTIONS_TYPE_HMAT_LB:
> +        parse_numa_hmat_lb(ms, &object->u.hmat_lb, &err);
> +        if (err) {
> +            goto end;
> +        }
> +        break;
>      default:
>          abort();
>      }
> diff --git a/qapi/misc.json b/qapi/misc.json
> index 8b3ca4fdd3..d7fce75702 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -2539,10 +2539,12 @@
>  #
>  # @cpu: property based CPU(s) to node mapping (Since: 2.10)
>  #
> +# @hmat-lb: memory latency and bandwidth information (Since: 4.1)
> +#
>  # Since: 2.1
>  ##
>  { 'enum': 'NumaOptionsType',
> -  'data': [ 'node', 'dist', 'cpu' ] }
> +  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
>  
>  ##
>  # @NumaOptions:
> @@ -2557,7 +2559,8 @@
>    'data': {
>      'node': 'NumaNodeOptions',
>      'dist': 'NumaDistOptions',
> -    'cpu': 'NumaCpuOptions' }}
> +    'cpu': 'NumaCpuOptions',
> +    'hmat-lb': 'NumaHmatLBOptions' }}
>  
>  ##
>  # @NumaNodeOptions:
> @@ -2620,6 +2623,93 @@
>     'base': 'CpuInstanceProperties',
>     'data' : {} }
>  
> +##
> +# @HmatLBMemoryHierarchy:
> +#
> +# The memory hierarchy in the System Locality Latency
> +# and Bandwidth Information Structure of HMAT (Heterogeneous
> +# Memory Attribute Table)
> +#
> +# @memory: the structure represents the memory performance
> +#
> +# @last-level: last level memory of memory side cached memory
> +#
> +# @first-level: first level memory of memory side cached memory
> +#
> +# @second-level: second level memory of memory side cached memory
> +#
> +# @third-level: third level memory of memory side cached memory
> +#
> +# Since: 4.1
> +##
> +{ 'enum': 'HmatLBMemoryHierarchy',
> +  'data': [ 'memory', 'last-level', 'first-level',
> +            'second-level', 'third-level' ] }
> +
> +##
> +# @HmatLBDataType:
> +#
> +# Data type in the System Locality Latency
> +# and Bandwidth Information Structure of HMAT (Heterogeneous
> +# Memory Attribute Table)
> +#
> +# @access-latency: access latency (nanoseconds)
> +#
> +# @read-latency: read latency (nanoseconds)
> +#
> +# @write-latency: write latency (nanoseconds)
> +#
> +# @access-bandwidth: access bandwidth (MB/s)
> +#
> +# @read-bandwidth: read bandwidth (MB/s)
> +#
> +# @write-bandwidth: write bandwidth (MB/s)
> +#
> +# Since: 4.1
> +##
> +{ 'enum': 'HmatLBDataType',
> +  'data': [ 'access-latency', 'read-latency', 'write-latency',
> +            'access-bandwidth', 'read-bandwidth', 'write-bandwidth' ] }
> +
> +##
> +# @NumaHmatLBOptions:
> +#
> +# Set the system locality latency and bandwidth information
> +# between Initiator and Target proximity Domains.
> +#
> +# @initiator: the Initiator Proximity Domain.
> +#
> +# @target: the Target Proximity Domain.
> +#
> +# @hierarchy: the Memory Hierarchy. Indicates the performance
> +#             of memory or side cache.
> +#
> +# @data-type: presents the type of data, access/read/write
> +#             latency or hit latency.
> +#
> +# @base-lat: the base unit for latency in nanoseconds.
> +#
> +# @base-bw: the base unit for bandwidth in megabytes per second(MB/s).
> +#
> +# @latency: the value of latency based on Base Unit from @initiator
> +#           to @target proximity domain.
> +#
> +# @bandwidth: the value of bandwidth based on Base Unit between
> +#             @initiator and @target proximity domain.
> +#
> +# Since: 4.1
> +##
> +{ 'struct': 'NumaHmatLBOptions',
> +  'data': {
> +   'initiator': 'uint16',
> +   'target': 'uint16',
> +   'hierarchy': 'HmatLBMemoryHierarchy',
> +   'data-type': 'HmatLBDataType',
I think union will be better here with data-type used as discriminator,
on top of that you'll be able to drop a bit of error checking above since
QAPI's union will not allow user to mix latency and bandwidth.

> +   '*base-lat': 'uint64',
> +   '*base-bw': 'uint64',
> +   '*latency': 'uint16',
> +   '*bandwidth': 'uint16' }}
> +
>  ##
>  # @HostMemPolicy:
>  #
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 51802cbb26..5351b0e453 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -163,16 +163,19 @@ DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>      "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
>      "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
>      "-numa dist,src=source,dst=destination,val=distance\n"
> -    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
> +    "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n"
> +    "-numa hmat-lb,initiator=node,target=node,hierarchy=memory|last-level,data-type=access-latency|read-latency|write-latency[,base-lat=blat][,base-bw=bbw][,latency=lat][,bandwidth=bw]\n",
>      QEMU_ARCH_ALL)
>  STEXI
>  @item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>  @itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>  @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
>  @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
> +@itemx -numa hmat-lb,initiator=@var{node},target=@var{node},hierarchy=@var{str},data-type=@var{str}[,base-lat=@var{blat}][,base-bw=@var{bbw}][,latency=@var{lat}][,bandwidth=@var{bw}]
>  @findex -numa
>  Define a NUMA node and assign RAM and VCPUs to it.
>  Set the NUMA distance from a source node to a destination node.
> +Set the ACPI Heterogeneous Memory Attribute for the given nodes.

s/Attribute/Attributes/

>  
>  Legacy VCPU assignment uses @samp{cpus} option where
>  @var{firstcpu} and @var{lastcpu} are CPU indexes. Each
> @@ -230,6 +233,29 @@ specified resources, it just assigns existing resources to NUMA
>  nodes. This means that one still has to use the @option{-m},
>  @option{-smp} options to allocate RAM and VCPUs respectively.
>  
> +Use 'hmat-lb' to set System Locality Latency and Bandwidth Information
> +between initiator NUMA node and target NUMA node to build ACPI Heterogeneous Attribute Memory Table (HMAT).
s/ initiator NUMA node and target NUMA node .*\./
initiator and target NUMA nodes in ACPI Heterogeneous Attribute Memory Table (HMAT)./

Also I don't see any description of possible values/units


> +Initiator NUMA node can create memory requests, usually including one or more processors.
> +Target NUMA node contains addressable memory.
> +
> +For example:
> +@example
> +-m 2G \
> +-smp 3,sockets=2,maxcpus=3 \
> +-numa node,cpus=0-1,nodeid=0 \
> +-numa node,mem=1G,cpus=2,nodeid=1 \
> +-numa node,mem=1G,nodeid=2 \

pls use '-numa cpu' and '-numa memdev' instead of legacy 'cpus/mem' options in examples.

> +-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,base-lat=10,base-bw=20,latency=10,bandwidth=10 \
> +-numa hmat-lb,initiator=1,target=2,hierarchy=first-level,data-type=access-latency,base-bw=10,bandwidth=20
> +@end example
> +
> +When the processors in NUMA node 0 access memory in NUMA node 1,
> +the first line containing 'hmat-lb' sets the latency and bandwidth information.
What does it set "FOO information" for?

> +The latency is @var{lat} multiplied by @var{blat} and the bandwidth is @var{bw} multiplied by @var{bbw}.
that's rather cryptic and probably not necessary at all.

> +
> +When the processors in NUMA node 1 access memory in NUMA node 2 that acts as 2nd level memory side cache,
> +the second line containing 'hmat-lb' sets the access hit bandwidth information.
> +
>  ETEXI
>  
>  DEF("add-fd", HAS_ARG, QEMU_OPTION_add_fd,



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-06-05 12:12       ` Igor Mammedov
@ 2019-06-06  3:00         ` Tao Xu
  2019-06-06 16:45           ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-06-06  3:00 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, Liu, Jingqi, qemu-devel, pbonzini, rth,
	ehabkost

On 6/5/2019 8:12 PM, Igor Mammedov wrote:
> On Wed, 5 Jun 2019 14:04:10 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
>> On 6/4/2019 11:04 PM, Igor Mammedov wrote:
>>> On Wed,  8 May 2019 14:17:22 +0800
>>> Tao Xu <tao3.xu@intel.com> wrote:
>>>    
...
>>>> +
>>>> +                /* SMBIOS Handles */
>>>> +                /* TBD: set smbios handles */
>>>> +                build_append_int_noprefix(table_data, 0, 2 * n);
>>> Is memory side cache structure useful at all without pointing to SMBIOS entries?
>>>    
>> They are not useful yet, and the kernel 5.1 HMAT sysfs doesn't show
>> SMBIOS entries. We can update it if it useful in the future.
> 
> In that case I'd suggest to drop it for now until this table is properly
> populated and ready for consumption. (i.e. drop this patch and corresponding
> CLI 9/11 patch).
> 

But the kernel HMAT can read othe Memory Side Cache Information except 
SMBIOS entries and the host HMAT tables also haven’t SMBIOS Handles it 
also shows Number of SMBIOS handles (n) as 0. So I am wondering if it is 
better to setting "SMBIOS handles (n)" as 0, remove TODO and comment the 
reason why set it 0?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook
  2019-05-24 12:35   ` Igor Mammedov
@ 2019-06-06  5:15     ` Tao Xu
  2019-06-06 16:25       ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-06-06  5:15 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On 5/24/2019 8:35 PM, Igor Mammedov wrote:
> On Wed,  8 May 2019 14:17:19 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
>> Add build_mem_ranges callback to AcpiDeviceIfClass and use
>> it for generating SRAT and HMAT numa memory ranges.
>>
>> Suggested-by: Igor Mammedov <imammedo@redhat.com>
>> Co-developed-by: Liu Jingqi <jingqi.liu@intel.com>
>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>> ---
...
>> diff --git a/stubs/pc_build_mem_ranges.c b/stubs/pc_build_mem_ranges.c
>> new file mode 100644
>> index 0000000000..0f104ba79d
>> --- /dev/null
>> +++ b/stubs/pc_build_mem_ranges.c
>> @@ -0,0 +1,6 @@
>> +#include "qemu/osdep.h"
>> +#include "hw/i386/pc.h"
>> +
>> +void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *machine)
>> +{
>> +}
> 
> why do you need stub?
> 
Hi Igor,

I have questions here, I use stub here because we add hook pointer in 
piix4.c but other arch such mips use piix4. Without stub, it will failed 
when compile, like pc_madt_cpu_entry.
Or there are other way to make it use just in pc?

Thank you!



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information
  2019-06-05 14:40   ` Igor Mammedov
@ 2019-06-06  7:47     ` Tao Xu
  2019-06-06 13:23       ` Eric Blake
  0 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-06-06  7:47 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On 6/5/2019 10:40 PM, Igor Mammedov wrote:
> On Wed,  8 May 2019 14:17:23 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
>> From: Liu Jingqi <jingqi.liu@intel.com>
>>
>> Add -numa hmat-lb option to provide System Locality Latency and
>> Bandwidth Information. These memory attributes help to build
>> System Locality Latency and Bandwidth Information Structure(s)
>> in ACPI Heterogeneous Memory Attribute Table (HMAT).
>>
>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>> ---
...
>> +##
>> +{ 'struct': 'NumaHmatLBOptions',
>> +  'data': {
>> +   'initiator': 'uint16',
>> +   'target': 'uint16',
>> +   'hierarchy': 'HmatLBMemoryHierarchy',
>> +   'data-type': 'HmatLBDataType',
> I think union will be better here with data-type used as discriminator,
> on top of that you'll be able to drop a bit of error checking above since
> QAPI's union will not allow user to mix latency and bandwidth.
> 
Hi Igor,

I have quesion here, the 'hmat-lb' is a member of a union 'NumaOptions', 
it seems can' use a union as a member of union.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information
  2019-06-06  7:47     ` Tao Xu
@ 2019-06-06 13:23       ` Eric Blake
  2019-06-06 16:50         ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Blake @ 2019-06-06 13:23 UTC (permalink / raw)
  To: Tao Xu, Igor Mammedov
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost,
	Markus Armbruster, pbonzini, rth

[-- Attachment #1: Type: text/plain, Size: 1551 bytes --]

On 6/6/19 2:47 AM, Tao Xu wrote:
> On 6/5/2019 10:40 PM, Igor Mammedov wrote:
>> On Wed,  8 May 2019 14:17:23 +0800
>> Tao Xu <tao3.xu@intel.com> wrote:
>>
>>> From: Liu Jingqi <jingqi.liu@intel.com>
>>>
>>> Add -numa hmat-lb option to provide System Locality Latency and
>>> Bandwidth Information. These memory attributes help to build
>>> System Locality Latency and Bandwidth Information Structure(s)
>>> in ACPI Heterogeneous Memory Attribute Table (HMAT).
>>>
>>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
>>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>>> ---
> ...
>>> +##
>>> +{ 'struct': 'NumaHmatLBOptions',
>>> +  'data': {
>>> +   'initiator': 'uint16',
>>> +   'target': 'uint16',
>>> +   'hierarchy': 'HmatLBMemoryHierarchy',
>>> +   'data-type': 'HmatLBDataType',
>> I think union will be better here with data-type used as discriminator,
>> on top of that you'll be able to drop a bit of error checking above since
>> QAPI's union will not allow user to mix latency and bandwidth.
>>
> Hi Igor,
> 
> I have quesion here, the 'hmat-lb' is a member of a union 'NumaOptions',
> it seems can' use a union as a member of union.

It should be technically possible to expand the QAPI generators to allow
one union as a branch within another union, so long as there are no
collisions in identifiers, if that makes for the smartest on-the-wire
representation.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook
  2019-06-06  5:15     ` Tao Xu
@ 2019-06-06 16:25       ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-06-06 16:25 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Thu, 6 Jun 2019 13:15:43 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 5/24/2019 8:35 PM, Igor Mammedov wrote:
> > On Wed,  8 May 2019 14:17:19 +0800
> > Tao Xu <tao3.xu@intel.com> wrote:
> >   
> >> Add build_mem_ranges callback to AcpiDeviceIfClass and use
> >> it for generating SRAT and HMAT numa memory ranges.
> >>
> >> Suggested-by: Igor Mammedov <imammedo@redhat.com>
> >> Co-developed-by: Liu Jingqi <jingqi.liu@intel.com>
> >> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> >> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> >> ---  
> ...
> >> diff --git a/stubs/pc_build_mem_ranges.c b/stubs/pc_build_mem_ranges.c
> >> new file mode 100644
> >> index 0000000000..0f104ba79d
> >> --- /dev/null
> >> +++ b/stubs/pc_build_mem_ranges.c
> >> @@ -0,0 +1,6 @@
> >> +#include "qemu/osdep.h"
> >> +#include "hw/i386/pc.h"
> >> +
> >> +void pc_build_mem_ranges(AcpiDeviceIf *adev, MachineState *machine)
> >> +{
> >> +}  
> > 
> > why do you need stub?
> >   
> Hi Igor,
> 
> I have questions here, I use stub here because we add hook pointer in 
> piix4.c but other arch such mips use piix4. Without stub, it will failed 
> when compile, like pc_madt_cpu_entry.
> Or there are other way to make it use just in pc?
I forgot that piix4 is uesed by mips as well, it's perfectly fine to add
stub in this case.
Though, I'd add a comment above the stub about why it's there to avoid
questions.
Later comment might make life easier for whoever touches this code
wouldn't have to figure out mips dependency the hard way.

> 
> Thank you!
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-06-06  3:00         ` Tao Xu
@ 2019-06-06 16:45           ` Igor Mammedov
  2019-06-10 13:39             ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-06-06 16:45 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, Liu, Jingqi, qemu-devel, pbonzini, rth,
	ehabkost

On Thu, 6 Jun 2019 11:00:33 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 6/5/2019 8:12 PM, Igor Mammedov wrote:
> > On Wed, 5 Jun 2019 14:04:10 +0800
> > Tao Xu <tao3.xu@intel.com> wrote:
> >   
> >> On 6/4/2019 11:04 PM, Igor Mammedov wrote:  
> >>> On Wed,  8 May 2019 14:17:22 +0800
> >>> Tao Xu <tao3.xu@intel.com> wrote:
> >>>      
> ...
> >>>> +
> >>>> +                /* SMBIOS Handles */
> >>>> +                /* TBD: set smbios handles */
> >>>> +                build_append_int_noprefix(table_data, 0, 2 * n);  
> >>> Is memory side cache structure useful at all without pointing to SMBIOS entries?
> >>>      
> >> They are not useful yet, and the kernel 5.1 HMAT sysfs doesn't show
> >> SMBIOS entries. We can update it if it useful in the future.  
> > 
> > In that case I'd suggest to drop it for now until this table is properly
> > populated and ready for consumption. (i.e. drop this patch and corresponding
> > CLI 9/11 patch).
> >   
> 
> But the kernel HMAT can read othe Memory Side Cache Information except 
> SMBIOS entries and the host HMAT tables also haven’t SMBIOS Handles it 
> also shows Number of SMBIOS handles (n) as 0. So I am wondering if it is 
> better to setting "SMBIOS handles (n)" as 0, remove TODO and comment the 
> reason why set it 0?

My understanding is that SMBIOS handles are used to associate side cache
descriptions with RAM pointed by SMBIOS handles, so that OS would be
able to figure out what RAM modules are cached by what cache.
Hence I suspect that side cache table is useless in the best case without
valid references to SMBIOS handles.
(I might be totally mistaken but the matter requires clarification before
we commit to it)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information
  2019-06-06 13:23       ` Eric Blake
@ 2019-06-06 16:50         ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-06-06 16:50 UTC (permalink / raw)
  To: Eric Blake
  Cc: xiaoguangrong.eric, mst, jingqi.liu, Tao Xu, qemu-devel,
	Markus Armbruster, pbonzini, rth, ehabkost

On Thu, 6 Jun 2019 08:23:47 -0500
Eric Blake <eblake@redhat.com> wrote:

> On 6/6/19 2:47 AM, Tao Xu wrote:
> > On 6/5/2019 10:40 PM, Igor Mammedov wrote:  
> >> On Wed,  8 May 2019 14:17:23 +0800
> >> Tao Xu <tao3.xu@intel.com> wrote:
> >>  
> >>> From: Liu Jingqi <jingqi.liu@intel.com>
> >>>
> >>> Add -numa hmat-lb option to provide System Locality Latency and
> >>> Bandwidth Information. These memory attributes help to build
> >>> System Locality Latency and Bandwidth Information Structure(s)
> >>> in ACPI Heterogeneous Memory Attribute Table (HMAT).
> >>>
> >>> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> >>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> >>> ---  
> > ...  
> >>> +##
> >>> +{ 'struct': 'NumaHmatLBOptions',
> >>> +  'data': {
> >>> +   'initiator': 'uint16',
> >>> +   'target': 'uint16',
> >>> +   'hierarchy': 'HmatLBMemoryHierarchy',
> >>> +   'data-type': 'HmatLBDataType',  
> >> I think union will be better here with data-type used as discriminator,
> >> on top of that you'll be able to drop a bit of error checking above since
> >> QAPI's union will not allow user to mix latency and bandwidth.
> >>  
> > Hi Igor,
> > 
> > I have quesion here, the 'hmat-lb' is a member of a union 'NumaOptions',
> > it seems can' use a union as a member of union.  
> 
> It should be technically possible to expand the QAPI generators to allow
> one union as a branch within another union, so long as there are no
> collisions in identifiers, if that makes for the smartest on-the-wire
> representation.

It would save quite a bit of boiler plate error checking in numa code,
but since I don't know much about QAPI to make meaningful suggestion
how to implement it, I won't insist on using union.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/11] acpi: introduce build_acpi_aml_common for NFIT generalizations
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 10/11] acpi: introduce build_acpi_aml_common for NFIT generalizations Tao Xu
@ 2019-06-06 17:00   ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-06-06 17:00 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:25 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> The aim of this patch is to move some of the NFIT Aml-build codes into
> build_acpi_aml_common(), and then NFIT and HMAT can both use it.
too generic function name, pls name it so it would express what it's doing.

The same applies to commit message, from this one I have no idea what
and why is being done (even if it was me who suggested the change).
Commit message should describe what and why functionality is being
generalized so it would be clear to anyone.

> Reviewed-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - Split 8/8 of patch v3 into two parts, introduces NFIT
>     generalizations (build_acpi_aml_common)
> ---
>  hw/acpi/nvdimm.c        | 49 +++++++++++++++++++++++++++--------------
>  include/hw/mem/nvdimm.h |  6 +++++
>  2 files changed, 38 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/acpi/nvdimm.c b/hw/acpi/nvdimm.c
> index 9fdad6dc3f..e2be79a8b7 100644
> --- a/hw/acpi/nvdimm.c
> +++ b/hw/acpi/nvdimm.c
> @@ -1140,12 +1140,11 @@ static void nvdimm_build_device_dsm(Aml *dev, uint32_t handle)
>  
>  static void nvdimm_build_fit(Aml *dev)
>  {
> -    Aml *method, *pkg, *buf, *buf_size, *offset, *call_result;
> -    Aml *whilectx, *ifcond, *ifctx, *elsectx, *fit;
> +    Aml *method, *pkg, *buf, *buf_name, *buf_size, *call_result;
>  
>      buf = aml_local(0);
>      buf_size = aml_local(1);
> -    fit = aml_local(2);
> +    buf_name = aml_local(2);
>  
>      aml_append(dev, aml_name_decl(NVDIMM_DSM_RFIT_STATUS, aml_int(0)));
>  
> @@ -1164,6 +1163,22 @@ static void nvdimm_build_fit(Aml *dev)
>                              aml_int(1) /* Revision 1 */,
>                              aml_int(0x1) /* Read FIT */,
>                              pkg, aml_int(NVDIMM_QEMU_RSVD_HANDLE_ROOT));
> +
> +    build_acpi_aml_common(method, buf, buf_size,
> +                          call_result, buf_name, dev,
> +                          "RFIT", "_FIT",
> +                          NVDIMM_DSM_RET_STATUS_SUCCESS,
> +                          NVDIMM_DSM_RET_STATUS_FIT_CHANGED);
> +}
> +
> +void build_acpi_aml_common(Aml *method, Aml *buf, Aml *buf_size,
> +                           Aml *call_result, Aml *buf_name, Aml *dev,
> +                           const char *help_function, const char *method_name,
> +                           int ret_status_success,
> +                           int ret_status_changed)
> +{
> +    Aml *offset, *whilectx, *ifcond, *ifctx, *elsectx;
> +
>      aml_append(method, aml_store(call_result, buf));
>  
>      /* handle _DSM result. */
> @@ -1174,7 +1189,7 @@ static void nvdimm_build_fit(Aml *dev)
>                                   aml_name(NVDIMM_DSM_RFIT_STATUS)));
>  
>       /* if something is wrong during _DSM. */
> -    ifcond = aml_equal(aml_int(NVDIMM_DSM_RET_STATUS_SUCCESS),
> +    ifcond = aml_equal(aml_int(ret_status_success),
>                         aml_name("STAU"));
>      ifctx = aml_if(aml_lnot(ifcond));
>      aml_append(ifctx, aml_return(aml_buffer(0, NULL)));
> @@ -1185,7 +1200,7 @@ static void nvdimm_build_fit(Aml *dev)
>                                      aml_int(4) /* the size of "STAU" */,
>                                      buf_size));
>  
> -    /* if we read the end of fit. */
> +    /* if we read the end of fit or hma. */
>      ifctx = aml_if(aml_equal(buf_size, aml_int(0)));
>      aml_append(ifctx, aml_return(aml_buffer(0, NULL)));
>      aml_append(method, ifctx);
> @@ -1196,38 +1211,38 @@ static void nvdimm_build_fit(Aml *dev)
>      aml_append(method, aml_return(aml_name("BUFF")));
>      aml_append(dev, method);
>  
> -    /* build _FIT. */
> -    method = aml_method("_FIT", 0, AML_SERIALIZED);
> +    /* build _FIT or _HMA. */
> +    method = aml_method(method_name, 0, AML_SERIALIZED);
>      offset = aml_local(3);
>  
> -    aml_append(method, aml_store(aml_buffer(0, NULL), fit));
> +    aml_append(method, aml_store(aml_buffer(0, NULL), buf_name));
>      aml_append(method, aml_store(aml_int(0), offset));
>  
>      whilectx = aml_while(aml_int(1));
> -    aml_append(whilectx, aml_store(aml_call1("RFIT", offset), buf));
> +    aml_append(whilectx, aml_store(aml_call1(help_function, offset), buf));
>      aml_append(whilectx, aml_store(aml_sizeof(buf), buf_size));
>  
>      /*
> -     * if fit buffer was changed during RFIT, read from the beginning
> -     * again.
> +     * if buffer was changed during RFIT or RHMA,
> +     * read from the beginning again.
>       */
>      ifctx = aml_if(aml_equal(aml_name(NVDIMM_DSM_RFIT_STATUS),
> -                             aml_int(NVDIMM_DSM_RET_STATUS_FIT_CHANGED)));
> -    aml_append(ifctx, aml_store(aml_buffer(0, NULL), fit));
> +                             aml_int(ret_status_changed)));
> +    aml_append(ifctx, aml_store(aml_buffer(0, NULL), buf_name));
>      aml_append(ifctx, aml_store(aml_int(0), offset));
>      aml_append(whilectx, ifctx);
>  
>      elsectx = aml_else();
>  
> -    /* finish fit read if no data is read out. */
> +    /* finish fit or hma read if no data is read out. */
>      ifctx = aml_if(aml_equal(buf_size, aml_int(0)));
> -    aml_append(ifctx, aml_return(fit));
> +    aml_append(ifctx, aml_return(buf_name));
>      aml_append(elsectx, ifctx);
>  
>      /* update the offset. */
>      aml_append(elsectx, aml_add(offset, buf_size, offset));
> -    /* append the data we read out to the fit buffer. */
> -    aml_append(elsectx, aml_concatenate(fit, buf, fit));
> +    /* append the data we read out to the fit or hma buffer. */
> +    aml_append(elsectx, aml_concatenate(buf_name, buf, buf_name));
>      aml_append(whilectx, elsectx);
>      aml_append(method, whilectx);
>  
> diff --git a/include/hw/mem/nvdimm.h b/include/hw/mem/nvdimm.h
> index 523a9b3d4a..6f04eddb40 100644
> --- a/include/hw/mem/nvdimm.h
> +++ b/include/hw/mem/nvdimm.h
> @@ -25,6 +25,7 @@
>  
>  #include "hw/mem/pc-dimm.h"
>  #include "hw/acpi/bios-linker-loader.h"
> +#include "hw/acpi/aml-build.h"
>  
>  #define NVDIMM_DEBUG 0
>  #define nvdimm_debug(fmt, ...)                                \
> @@ -150,4 +151,9 @@ void nvdimm_build_acpi(GArray *table_offsets, GArray *table_data,
>                         uint32_t ram_slots);
>  void nvdimm_plug(NVDIMMState *state);
>  void nvdimm_acpi_plug_cb(HotplugHandler *hotplug_dev, DeviceState *dev);
> +void build_acpi_aml_common(Aml *method, Aml *buf, Aml *buf_size,
> +                           Aml *call_result, Aml *buf_name, Aml *dev,
> +                           const char *help_function, const char *method_name,
> +                           int ret_status_success,
> +                           int ret_status_changed);
>  #endif



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-06-06 16:45           ` Igor Mammedov
@ 2019-06-10 13:39             ` Tao Xu
  2019-06-16 19:41               ` Igor Mammedov
  0 siblings, 1 reply; 38+ messages in thread
From: Tao Xu @ 2019-06-10 13:39 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, Liu, Jingqi, qemu-devel, pbonzini, rth,
	ehabkost

On 6/7/2019 12:45 AM, Igor Mammedov wrote:
> On Thu, 6 Jun 2019 11:00:33 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
...
>>
>> But the kernel HMAT can read othe Memory Side Cache Information except
>> SMBIOS entries and the host HMAT tables also haven’t SMBIOS Handles it
>> also shows Number of SMBIOS handles (n) as 0. So I am wondering if it is
>> better to setting "SMBIOS handles (n)" as 0, remove TODO and comment the
>> reason why set it 0?
> 
> My understanding is that SMBIOS handles are used to associate side cache
> descriptions with RAM pointed by SMBIOS handles, so that OS would be
> able to figure out what RAM modules are cached by what cache.
> Hence I suspect that side cache table is useless in the best case without
> valid references to SMBIOS handles.
> (I might be totally mistaken but the matter requires clarification before
> we commit to it)
> 

I am sorry for not providing a detailed description for Memory Side 
Cache use case. I will add more detailed description in next version of 
patch.

As the commit message and /Documentation/admin-guide/mm/numaperf.rst of 
Kernel HMAT(listed blow), Memory Side Cache Structure is used to provide 
the cache information about System memory for the software to use. Then 
the software can maximize the performance because it can choose the best 
node to use.

Memory Side Cache Information Structure and System Locality Latency and 
Bandwidth Information Structure can both provide more information than 
numa distance for software to see. So back to the SMBIOS, in spec, 
SMBIOS handles point to the memory side cache physical devices, but they 
are also information and not contribute to the performance of the 
described memory. The field "Proximity Domain for the Memory" can show 
the described memory.

I am wondering if this explanation is clear? Thank you.

"System memory may be constructed in a hierarchy of elements with 
various performance characteristics in order to provide large address 
space of slower performing memory cached by a smaller higher performing 
memory."

"An application does not need to know about caching attributes in order
to use the system. Software may optionally query the memory cache
attributes in order to maximize the performance out of such a setup.
If the system provides a way for the kernel to discover this 
information, for example with ACPI HMAT (Heterogeneous Memory Attribute 
Table), the kernel will append these attributes to the NUMA node memory 
target."

"Each cache level's directory provides its attributes. For example, the
following shows a single cache level and the attributes available for
software to query::

	# tree sys/devices/system/node/node0/memory_side_cache/
	/sys/devices/system/node/node0/memory_side_cache/
	|-- index1
	|   |-- indexing
	|   |-- line_size
	|   |-- size
	|   `-- write_policy
"


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache Information Structure(s) in ACPI HMAT
  2019-06-10 13:39             ` Tao Xu
@ 2019-06-16 19:41               ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-06-16 19:41 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, Liu, Jingqi, qemu-devel, ehabkost,
	pbonzini, rth

On Mon, 10 Jun 2019 21:39:12 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 6/7/2019 12:45 AM, Igor Mammedov wrote:
> > On Thu, 6 Jun 2019 11:00:33 +0800
> > Tao Xu <tao3.xu@intel.com> wrote:
> >   
> ...
> >>
> >> But the kernel HMAT can read othe Memory Side Cache Information except
> >> SMBIOS entries and the host HMAT tables also haven’t SMBIOS Handles it
> >> also shows Number of SMBIOS handles (n) as 0. So I am wondering if it is
> >> better to setting "SMBIOS handles (n)" as 0, remove TODO and comment the
> >> reason why set it 0?  
> > 
> > My understanding is that SMBIOS handles are used to associate side cache
> > descriptions with RAM pointed by SMBIOS handles, so that OS would be
> > able to figure out what RAM modules are cached by what cache.
> > Hence I suspect that side cache table is useless in the best case without
> > valid references to SMBIOS handles.
> > (I might be totally mistaken but the matter requires clarification before
> > we commit to it)
> >   
> 
> I am sorry for not providing a detailed description for Memory Side 
> Cache use case. I will add more detailed description in next version of 
> patch.
> 
> As the commit message and /Documentation/admin-guide/mm/numaperf.rst of 
> Kernel HMAT(listed blow), Memory Side Cache Structure is used to provide 
> the cache information about System memory for the software to use. Then 
> the software can maximize the performance because it can choose the best 
> node to use.
> 
> Memory Side Cache Information Structure and System Locality Latency and 
> Bandwidth Information Structure can both provide more information than 
> numa distance for software to see. So back to the SMBIOS, in spec, 
> SMBIOS handles point to the memory side cache physical devices, but they 
> are also information and not contribute to the performance of the 
> described memory. The field "Proximity Domain for the Memory" can show 
> the described memory.
> 
> I am wondering if this explanation is clear? Thank you.

I didn't manage to find a definite answer in spec to what SMBIOS entry
should describe. Another use of 'Physical Memory Component' is in PMTT
table and it looks to me that it type 17 should reffer to DIMM device.

But well, considering spec isn't clear about subject and that linux
kernel doesn't seem to use this entries lets use it without SMBIOS
entries for now. Like you suggested, lets set number of SMBIOS handles to 0
and drop num_smbios_handles so that user won't be able to provide any.


> "System memory may be constructed in a hierarchy of elements with 
> various performance characteristics in order to provide large address 
> space of slower performing memory cached by a smaller higher performing 
> memory."
> 
> "An application does not need to know about caching attributes in order
> to use the system. Software may optionally query the memory cache
> attributes in order to maximize the performance out of such a setup.
> If the system provides a way for the kernel to discover this 
> information, for example with ACPI HMAT (Heterogeneous Memory Attribute 
> Table), the kernel will append these attributes to the NUMA node memory 
> target."
> 
> "Each cache level's directory provides its attributes. For example, the
> following shows a single cache level and the attributes available for
> software to query::
> 
> 	# tree sys/devices/system/node/node0/memory_side_cache/
> 	/sys/devices/system/node/node0/memory_side_cache/
> 	|-- index1
> 	|   |-- indexing
> 	|   |-- line_size
> 	|   |-- size
> 	|   `-- write_policy
> "
> 



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 09/11] numa: Extend the command-line to provide memory side cache information
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 09/11] numa: Extend the command-line to provide memory side cache information Tao Xu
@ 2019-06-16 19:52   ` Igor Mammedov
  0 siblings, 0 replies; 38+ messages in thread
From: Igor Mammedov @ 2019-06-16 19:52 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, pbonzini, rth, ehabkost

On Wed,  8 May 2019 14:17:24 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> Add -numa hmat-cache option to provide Memory Side Cache Information.
> These memory attributes help to build Memory Side Cache Information
> Structure(s) in ACPI Heterogeneous Memory Attribute Table (HMAT).
> 
> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - update the version tag from 4.0 to 4.1
> ---
>  numa.c         | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  qapi/misc.json | 72 ++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 145 insertions(+), 2 deletions(-)
> 
> diff --git a/numa.c b/numa.c
> index 1aecb7a2e9..4866736fc8 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -300,6 +300,75 @@ static void parse_numa_hmat_lb(MachineState *ms, NumaHmatLBOptions *node,
>      }
>  }
>  
> +static
> +void parse_numa_hmat_cache(MachineState *ms, NumaHmatCacheOptions *node,
> +                            Error **errp)
> +{
> +    int nb_numa_nodes = ms->numa_state->num_nodes;
> +    HMAT_Cache_Info *hmat_cache = NULL;
> +
> +    if (node->node_id >= nb_numa_nodes) {
> +        error_setg(errp, "Invalid node-id=%" PRIu32
> +                   ", it should be less than %d.",
> +                   node->node_id, nb_numa_nodes);
> +        return;
> +    }
> +    if (!ms->numa_state->nodes[node->node_id].is_target) {
> +        error_setg(errp, "Invalid node-id=%" PRIu32
> +                   ", it isn't a target proximity domain.",
> +                   node->node_id);
> +        return;
> +    }
> +
> +    if (node->total > MAX_HMAT_CACHE_LEVEL) {
> +        error_setg(errp, "Invalid total=%" PRIu8
> +                   ", it should be less than or equal to %d.",
> +                   node->total, MAX_HMAT_CACHE_LEVEL);
> +        return;
> +    }
> +    if (node->level > node->total) {
> +        error_setg(errp, "Invalid level=%" PRIu8
> +                   ", it should be less than or equal to"
> +                   " total=%" PRIu8 ".",
> +                   node->level, node->total);
> +        return;
> +    }
> +    if (ms->numa_state->hmat_cache[node->node_id][node->level]) {
> +        error_setg(errp, "Duplicate configuration of the side cache for "
> +                   "node-id=%" PRIu32 " and level=%" PRIu8 ".",
> +                   node->node_id, node->level);
> +        return;
> +    }
> +
> +    if ((node->level > 1) &&
> +        ms->numa_state->hmat_cache[node->node_id][node->level - 1] &&
> +        (node->size >=
> +            ms->numa_state->hmat_cache[node->node_id][node->level - 1]->size)) {
> +        error_setg(errp, "Invalid size=0x%" PRIx64
> +                   ", the size of level=%" PRIu8
> +                   " should be less than the size(0x%" PRIx64
> +                   ") of level=%" PRIu8 ".",
> +                   node->size, node->level,
> +                   ms->numa_state->hmat_cache[node->node_id]
> +                                             [node->level - 1]->size,
> +                   node->level - 1);
> +        return;
> +    }
> +
> +    hmat_cache = g_malloc0(sizeof(*hmat_cache));
> +
> +    hmat_cache->mem_proximity = node->node_id;
> +    hmat_cache->size = node->size;
> +    hmat_cache->total_levels = node->total;
> +    hmat_cache->level = node->level;
> +    hmat_cache->associativity = node->assoc;
> +    hmat_cache->write_policy = node->policy;
> +    hmat_cache->line_size = node->line;
> +    hmat_cache->num_smbios_handles = 0;
> +
> +    ms->numa_state->hmat_cache[node->node_id][node->level] = hmat_cache;
> +}
> +
>  static
>  void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>  {
> @@ -344,6 +413,12 @@ void set_numa_options(MachineState *ms, NumaOptions *object, Error **errp)
>              goto end;
>          }
>          break;
> +    case NUMA_OPTIONS_TYPE_HMAT_CACHE:
> +        parse_numa_hmat_cache(ms, &object->u.hmat_cache, &err);
> +        if (err) {
> +            goto end;
> +        }
> +        break;
>      default:
>          abort();
>      }
> diff --git a/qapi/misc.json b/qapi/misc.json
> index d7fce75702..2b7e34b469 100644
> --- a/qapi/misc.json
> +++ b/qapi/misc.json
> @@ -2541,10 +2541,12 @@
>  #
>  # @hmat-lb: memory latency and bandwidth information (Since: 4.1)
>  #
> +# @hmat-cache: memory side cache information (Since: 4.1)
> +#
>  # Since: 2.1
>  ##
>  { 'enum': 'NumaOptionsType',
> -  'data': [ 'node', 'dist', 'cpu', 'hmat-lb' ] }
> +   'data': [ 'node', 'dist', 'cpu', 'hmat-lb', 'hmat-cache' ] }
stray whitespace in front???


>  ##
>  # @NumaOptions:
> @@ -2560,7 +2562,8 @@
>      'node': 'NumaNodeOptions',
>      'dist': 'NumaDistOptions',
>      'cpu': 'NumaCpuOptions',
> -    'hmat-lb': 'NumaHmatLBOptions' }}
> +    'hmat-lb': 'NumaHmatLBOptions',
> +    'hmat-cache': 'NumaHmatCacheOptions' }}
>  
>  ##
>  # @NumaNodeOptions:
> @@ -2710,6 +2713,71 @@
>     '*latency': 'uint16',
>     '*bandwidth': 'uint16' }}
>  
> +##
> +# @HmatCacheAssociativity:
> +#
> +# Cache associativity in the Memory Side Cache
> +# Information Structure of HMAT
> +#
> +# @none: None
> +#
> +# @direct: Direct Mapped
> +#
> +# @complex: Complex Cache Indexing (implementation specific)
it would be good to add reference to spec, like we do for ACPI API functions.
So that reader would know where to look for values and their meaning.

PS:
It applies to all fields that come from spec (in this and previous patches that add QAPI structures)

> +#
> +# Since: 4.1
> +##
> +{ 'enum': 'HmatCacheAssociativity',
> +  'data': [ 'none', 'direct', 'complex' ] }
> +
> +##
> +# @HmatCacheWritePolicy:
> +#
> +# Cache write policy in the Memory Side Cache
> +# Information Structure of HMAT
> +#
> +# @none: None
> +#
> +# @write-back: Write Back (WB)
> +#
> +# @write-through: Write Through (WT)
> +#
> +# Since: 4.1
> +##
> +{ 'enum': 'HmatCacheWritePolicy',
> +  'data': [ 'none', 'write-back', 'write-through' ] }
> +
> +##
> +# @NumaHmatCacheOptions:
> +#
> +# Set the memory side cache information for a given memory domain.
> +#
> +# @node-id: the memory proximity domain to which the memory belongs.
> +#
> +# @size: the size of memory side cache in bytes.
> +#
> +# @total: the total cache levels for this memory proximity domain.
> +#
> +# @level: the cache level described in this structure.
> +#
> +# @assoc: the cache associativity, none/direct-mapped/complex(complex cache indexing).
> +
> +# @policy: the write policy, none/write-back/write-through.
> +#
> +# @line: the cache Line size in bytes.
> +#
> +# Since: 4.1
> +##
> +{ 'struct': 'NumaHmatCacheOptions',
> +  'data': {
> +   'node-id': 'uint32',
> +   'size': 'size',
> +   'total': 'uint8',
> +   'level': 'uint8',
> +   'assoc': 'HmatCacheAssociativity',
> +   'policy': 'HmatCacheWritePolicy',
> +   'line': 'uint16' }}
> +
>  ##
>  # @HostMemPolicy:
>  #



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] hmat acpi: Implement _HMA method to update HMAT at runtime
  2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 11/11] hmat acpi: Implement _HMA method to update HMAT at runtime Tao Xu
@ 2019-06-16 20:07   ` Igor Mammedov
  2019-06-17  7:19     ` Tao Xu
  0 siblings, 1 reply; 38+ messages in thread
From: Igor Mammedov @ 2019-06-16 20:07 UTC (permalink / raw)
  To: Tao Xu
  Cc: xiaoguangrong.eric, mst, jingqi.liu, qemu-devel, ehabkost, pbonzini, rth

On Wed,  8 May 2019 14:17:26 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> From: Liu Jingqi <jingqi.liu@intel.com>
> 
> OSPM evaluates HMAT only during system initialization.
> Any changes to the HMAT state at runtime or information
> regarding HMAT for hot plug are communicated using _HMA method.
> 
> _HMA is an optional object that enables the platform to provide
> the OS with updated Heterogeneous Memory Attributes information
> at runtime. _HMA provides OSPM with the latest HMAT in entirety
> overriding existing HMAT.

it seems that there aren't any user interface to actually introduce
new HMAT data during runtime. If it's so lets drop 10-11/11 for now,
you can add it later when/if you add QMP interface to update/replace
HMAT at runtime.

> Signed-off-by: Liu Jingqi <jingqi.liu@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> Changes in v4 -> v3:
>     - move AcpiHmaState from PCMachineState to MachineState
>     to make HMAT more generalic (Igor)
>     - use build_acpi_aml_common() introduced in patch 10/11 to
>     simplify hmat_build_aml (Igor)
> ---
>  hw/acpi/hmat.c          | 296 ++++++++++++++++++++++++++++++++++++++++
>  hw/acpi/hmat.h          |  72 ++++++++++
>  hw/core/machine.c       |   3 +
>  hw/i386/acpi-build.c    |   2 +
>  hw/i386/pc.c            |   3 +
>  hw/i386/pc_piix.c       |   4 +
>  hw/i386/pc_q35.c        |   4 +
>  include/hw/boards.h     |   1 +
>  include/qemu/typedefs.h |   1 +
>  9 files changed, 386 insertions(+)
> 
> diff --git a/hw/acpi/hmat.c b/hw/acpi/hmat.c
> index 3a8c41162d..bc2dffd079 100644
> --- a/hw/acpi/hmat.c
> +++ b/hw/acpi/hmat.c
> @@ -28,6 +28,7 @@
>  #include "hw/i386/pc.h"
>  #include "hw/acpi/hmat.h"
>  #include "hw/nvram/fw_cfg.h"
> +#include "hw/mem/nvdimm.h"
>  
>  static uint32_t initiator_pxm[MAX_NODES], target_pxm[MAX_NODES];
>  static uint32_t num_initiator, num_target;
> @@ -262,6 +263,270 @@ static void hmat_build_hma(GArray *table_data, MachineState *ms)
>      }
>  }
>  
> +static uint64_t
> +hmat_hma_method_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    printf("BUG: we never read _HMA IO Port.\n");
what real hardware would do in this case?

> +    return 0;
> +}
> +
> +/* _HMA Method: read HMA data. */
> +static void hmat_handle_hma_method(AcpiHmaState *state,
> +                                   HmatHmamIn *in, hwaddr hmam_mem_addr)
> +{
> +    HmatHmaBuffer *hma_buf = &state->hma_buf;
> +    HmatHmamOut *read_hma_out;
> +    GArray *hma;
> +    uint32_t read_len = 0, ret_status;
> +    int size;
> +
> +    if (in != NULL) {
> +        le32_to_cpus(&in->offset);
> +    }
> +
> +    hma = hma_buf->hma;
> +    if (in->offset > hma->len) {
> +        ret_status = HMAM_RET_STATUS_INVALID;
> +        goto exit;
> +    }
> +
> +   /* It is the first time to read HMA. */
> +    if (!in->offset) {
> +        hma_buf->dirty = false;
> +    } else if (hma_buf->dirty) {
> +        /* HMA has been changed during Reading HMA. */
> +        ret_status = HMAM_RET_STATUS_HMA_CHANGED;
> +        goto exit;
> +    }
> +
> +    ret_status = HMAM_RET_STATUS_SUCCESS;
> +    read_len = MIN(hma->len - in->offset,
> +                   HMAM_MEMORY_SIZE - 2 * sizeof(uint32_t));
> +exit:
> +    size = sizeof(HmatHmamOut) + read_len;
> +    read_hma_out = g_malloc(size);
> +
> +    read_hma_out->len = cpu_to_le32(size);
> +    read_hma_out->ret_status = cpu_to_le32(ret_status);
> +    memcpy(read_hma_out->data, hma->data + in->offset, read_len);
> +
> +    cpu_physical_memory_write(hmam_mem_addr, read_hma_out, size);
> +
> +    g_free(read_hma_out);
> +}
> +
> +static void
> +hmat_hma_method_write(void *opaque, hwaddr addr, uint64_t val, unsigned size)
> +{
> +    AcpiHmaState *state = opaque;
> +    hwaddr hmam_mem_addr = val;
> +    HmatHmamIn *in;
> +
> +    in = g_new(HmatHmamIn, 1);
> +    cpu_physical_memory_read(hmam_mem_addr, in, sizeof(*in));
> +
> +    hmat_handle_hma_method(state, in, hmam_mem_addr);
> +}
> +
> +static const MemoryRegionOps hmat_hma_method_ops = {
> +    .read = hmat_hma_method_read,
> +    .write = hmat_hma_method_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +    .valid = {
> +        .min_access_size = 4,
> +        .max_access_size = 4,
> +    },
> +};
> +
> +static void hmat_init_hma_buffer(HmatHmaBuffer *hma_buf)
> +{
> +    hma_buf->hma = g_array_new(false, true /* clear */, 1);
> +}
> +
> +static uint8_t hmat_acpi_table_checksum(uint8_t *buffer, uint32_t length)
> +{
> +    uint8_t sum = 0;
> +    uint8_t *end = buffer + length;
> +
> +    while (buffer < end) {
> +        sum = (uint8_t) (sum + *(buffer++));
> +    }
> +    return (uint8_t)(0 - sum);
> +}
> +
> +static void hmat_build_header(AcpiTableHeader *h,
> +             const char *sig, int len, uint8_t rev,
> +             const char *oem_id, const char *oem_table_id)
> +{
> +    memcpy(&h->signature, sig, 4);
> +    h->length = cpu_to_le32(len);
> +    h->revision = rev;
> +
> +    if (oem_id) {
> +        strncpy((char *)h->oem_id, oem_id, sizeof h->oem_id);
> +    } else {
> +        memcpy(h->oem_id, ACPI_BUILD_APPNAME6, 6);
> +    }
> +
> +    if (oem_table_id) {
> +        strncpy((char *)h->oem_table_id, oem_table_id, sizeof(h->oem_table_id));
> +    } else {
> +        memcpy(h->oem_table_id, ACPI_BUILD_APPNAME4, 4);
> +        memcpy(h->oem_table_id + 4, sig, 4);
> +    }
> +
> +    h->oem_revision = cpu_to_le32(1);
> +    memcpy(h->asl_compiler_id, ACPI_BUILD_APPNAME4, 4);
> +    h->asl_compiler_revision = cpu_to_le32(1);
> +
> +    /* Caculate the checksum of acpi table. */
> +    h->checksum = 0;
> +    h->checksum = hmat_acpi_table_checksum((uint8_t *)h, len);
> +}
> +
> +static void hmat_build_hma_buffer(MachineState *ms)
> +{
> +    HmatHmaBuffer *hma_buf = &(ms->acpi_hma_state->hma_buf);
> +
> +    /* Free the old hma buffer before new allocation. */
> +    g_array_free(hma_buf->hma, true);
> +
> +    hma_buf->hma = g_array_new(false, true /* clear */, 1);
> +    acpi_data_push(hma_buf->hma, 40);
> +
> +    /* build HMAT in a given buffer. */
> +    hmat_build_hma(hma_buf->hma, ms);
> +    hmat_build_header((void *)hma_buf->hma->data,
> +                      "HMAT", hma_buf->hma->len, 1, NULL, NULL);
> +    hma_buf->dirty = true;
> +}
> +
> +static void hmat_build_common_aml(Aml *dev)
> +{
> +    Aml *method, *ifctx, *hmam_mem;
> +    Aml *unsupport;
> +    Aml *pckg, *pckg_index, *pckg_buf, *field;
> +    Aml *hmam_out_buf, *hmam_out_buf_size;
> +    uint8_t byte_list[1];
> +
> +    method = aml_method(HMA_COMMON_METHOD, 1, AML_SERIALIZED);
> +    hmam_mem = aml_local(6);
> +    hmam_out_buf = aml_local(7);
> +
> +    aml_append(method, aml_store(aml_name(HMAM_ACPI_MEM_ADDR), hmam_mem));
> +
> +    /* map _HMA memory and IO into ACPI namespace. */
> +    aml_append(method, aml_operation_region(HMAM_IOPORT, AML_SYSTEM_IO,
> +               aml_int(HMAM_ACPI_IO_BASE), HMAM_ACPI_IO_LEN));
> +    aml_append(method, aml_operation_region(HMAM_MEMORY,
> +               AML_SYSTEM_MEMORY, hmam_mem, HMAM_MEMORY_SIZE));
> +
> +    /*
> +     * _HMAC notifier:
> +     * HMAM_NOTIFY: write the address of DSM memory and notify QEMU to
> +     *                    emulate the access.
> +     *
> +     * It is the IO port so that accessing them will cause VM-exit, the
> +     * control will be transferred to QEMU.
> +     */
> +    field = aml_field(HMAM_IOPORT, AML_DWORD_ACC, AML_NOLOCK,
> +                      AML_PRESERVE);
> +    aml_append(field, aml_named_field(HMAM_NOTIFY,
> +               sizeof(uint32_t) * BITS_PER_BYTE));
> +    aml_append(method, field);
> +
> +    /*
> +     * _HMAC input:
> +     * HMAM_OFFSET: store the current offset of _HMA buffer.
> +     *
> +     * They are RAM mapping on host so that these accesses never cause VMExit.
> +     */
> +    field = aml_field(HMAM_MEMORY, AML_DWORD_ACC, AML_NOLOCK,
> +                      AML_PRESERVE);
> +    aml_append(field, aml_named_field(HMAM_OFFSET,
> +               sizeof(typeof_field(HmatHmamIn, offset)) * BITS_PER_BYTE));
> +    aml_append(method, field);
> +
> +    /*
> +     * _HMAC output:
> +     * HMAM_OUT_BUF_SIZE: the size of the buffer filled by QEMU.
> +     * HMAM_OUT_BUF: the buffer QEMU uses to store the result.
> +     *
> +     * Since the page is reused by both input and out, the input data
> +     * will be lost after storing new result into ODAT so we should fetch
> +     * all the input data before writing the result.
> +     */
> +    field = aml_field(HMAM_MEMORY, AML_DWORD_ACC, AML_NOLOCK,
> +                      AML_PRESERVE);
> +    aml_append(field, aml_named_field(HMAM_OUT_BUF_SIZE,
> +               sizeof(typeof_field(HmatHmamOut, len)) * BITS_PER_BYTE));
> +    aml_append(field, aml_named_field(HMAM_OUT_BUF,
> +       (sizeof(HmatHmamOut) - sizeof(uint32_t)) * BITS_PER_BYTE));
> +    aml_append(method, field);
> +
> +    /*
> +     * do not support any method if HMA memory address has not been
> +     * patched.
> +     */
> +    unsupport = aml_if(aml_equal(hmam_mem, aml_int(0x0)));
> +    byte_list[0] = HMAM_RET_STATUS_UNSUPPORT;
> +    aml_append(unsupport, aml_return(aml_buffer(1, byte_list)));
> +    aml_append(method, unsupport);
> +
> +    /* The parameter (Arg0) of _HMAC is a package which contains a buffer. */
> +    pckg = aml_arg(0);
> +    ifctx = aml_if(aml_and(aml_equal(aml_object_type(pckg),
> +                   aml_int(4 /* Package */)) /* It is a Package? */,
> +                   aml_equal(aml_sizeof(pckg), aml_int(1)) /* 1 element */,
> +                   NULL));
> +
> +    pckg_index = aml_local(2);
> +    pckg_buf = aml_local(3);
> +    aml_append(ifctx, aml_store(aml_index(pckg, aml_int(0)), pckg_index));
> +    aml_append(ifctx, aml_store(aml_derefof(pckg_index), pckg_buf));
> +    aml_append(ifctx, aml_store(pckg_buf, aml_name(HMAM_OFFSET)));
> +    aml_append(method, ifctx);
> +
> +    /*
> +     * tell QEMU about the real address of HMA memory, then QEMU
> +     * gets the control and fills the result in _HMAC memory.
> +     */
> +    aml_append(method, aml_store(hmam_mem, aml_name(HMAM_NOTIFY)));
> +
> +    hmam_out_buf_size = aml_local(1);
> +    /* RLEN is not included in the payload returned to guest. */
> +    aml_append(method, aml_subtract(aml_name(HMAM_OUT_BUF_SIZE),
> +                                aml_int(4), hmam_out_buf_size));
> +    aml_append(method, aml_store(aml_shiftleft(hmam_out_buf_size, aml_int(3)),
> +                                 hmam_out_buf_size));
> +    aml_append(method, aml_create_field(aml_name(HMAM_OUT_BUF),
> +                                aml_int(0), hmam_out_buf_size, "OBUF"));
> +    aml_append(method, aml_concatenate(aml_buffer(0, NULL), aml_name("OBUF"),
> +                                hmam_out_buf));
> +    aml_append(method, aml_return(hmam_out_buf));
> +    aml_append(dev, method);
> +}
> +
> +void hmat_init_acpi_state(AcpiHmaState *state, MemoryRegion *io,
> +                          FWCfgState *fw_cfg, Object *owner)
> +{
> +    memory_region_init_io(&state->io_mr, owner, &hmat_hma_method_ops, state,
> +                          "hma-acpi-io", HMAM_ACPI_IO_LEN);
> +    memory_region_add_subregion(io, HMAM_ACPI_IO_BASE, &state->io_mr);
> +
> +    state->hmam_mem = g_array_new(false, true /* clear */, 1);
> +    fw_cfg_add_file(fw_cfg, HMAM_MEM_FILE, state->hmam_mem->data,
> +                    state->hmam_mem->len);
> +
> +    hmat_init_hma_buffer(&state->hma_buf);
> +}
> +
> +void hmat_update(MachineState *ms)
> +{
> +    /* build HMAT in a given buffer. */
> +    hmat_build_hma_buffer(ms);
> +}
> +
>  void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
>  {
>      uint64_t hmat_start, hmat_len;
> @@ -276,3 +541,34 @@ void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms)
>                   (void *)(table_data->data + hmat_start),
>                   "HMAT", hmat_len, 1, NULL, NULL);
>  }
> +
> +void hmat_build_aml(Aml *dev)
> +{
> +    Aml *method, *pkg, *buf, *buf_name, *buf_size, *call_result;
> +
> +    hmat_build_common_aml(dev);
> +
> +    buf = aml_local(0);
> +    buf_size = aml_local(1);
> +    buf_name = aml_local(2);
> +
> +    aml_append(dev, aml_name_decl(HMAM_RHMA_STATUS, aml_int(0)));
> +
> +    /* build helper function, RHMA. */
> +    method = aml_method("RHMA", 1, AML_SERIALIZED);
> +    aml_append(method, aml_name_decl("OFST", aml_int(0)));
> +
> +    /* prepare input package. */
> +    pkg = aml_package(1);
> +    aml_append(method, aml_store(aml_arg(0), aml_name("OFST")));
> +    aml_append(pkg, aml_name("OFST"));
> +
> +    /* call Read HMA function. */
> +    call_result = aml_call1(HMA_COMMON_METHOD, pkg);
> +
> +    build_acpi_aml_common(method, buf, buf_size,
> +                          call_result, buf_name, dev,
> +                          "RHMA", "_HMA",
> +                          HMAM_RET_STATUS_SUCCESS,
> +                          HMAM_RET_STATUS_HMA_CHANGED);
> +}
> diff --git a/hw/acpi/hmat.h b/hw/acpi/hmat.h
> index 8f563f19dd..7b24a3327f 100644
> --- a/hw/acpi/hmat.h
> +++ b/hw/acpi/hmat.h
> @@ -102,6 +102,78 @@ struct HMAT_Cache_Info {
>      uint16_t    num_smbios_handles;
>  };
>  
> +#define HMAM_MEMORY_SIZE    4096
> +#define HMAM_MEM_FILE       "etc/acpi/hma-mem"
> +
> +/*
> + * 32 bits IO port starting from 0x0a19 in guest is reserved for
> + * HMA ACPI emulation.
> + */
> +#define HMAM_ACPI_IO_BASE     0x0a19
> +#define HMAM_ACPI_IO_LEN      4
> +
> +#define HMAM_ACPI_MEM_ADDR  "HMTA"
> +#define HMAM_MEMORY         "HRAM"
> +#define HMAM_IOPORT         "HPIO"
> +
> +#define HMAM_NOTIFY         "NTFI"
> +#define HMAM_OUT_BUF_SIZE   "RLEN"
> +#define HMAM_OUT_BUF        "ODAT"
> +
> +#define HMAM_RHMA_STATUS    "RSTA"
> +#define HMA_COMMON_METHOD   "HMAC"
> +#define HMAM_OFFSET         "OFFT"
> +
> +#define HMAM_RET_STATUS_SUCCESS        0 /* Success */
> +#define HMAM_RET_STATUS_UNSUPPORT      1 /* Not Supported */
> +#define HMAM_RET_STATUS_INVALID        2 /* Invalid Input Parameters */
> +#define HMAM_RET_STATUS_HMA_CHANGED    0x100 /* HMA Changed */
> +
> +/*
> + * HmatHmaBuffer:
> + * @hma: HMA buffer with the updated HMAT. It is updated when
> + *   the memory device is plugged or unplugged.
> + * @dirty: It allows OSPM to detect changes and restart read if there is any.
> + */
> +struct HmatHmaBuffer {
> +    GArray *hma;
> +    bool dirty;
> +};
> +typedef struct HmatHmaBuffer HmatHmaBuffer;
> +
> +struct AcpiHmaState {
> +    /* detect if HMA support is enabled. */
> +    bool is_enabled;
> +
> +    /* the data of the fw_cfg file HMAM_MEM_FILE. */
> +    GArray *hmam_mem;
> +
> +    HmatHmaBuffer hma_buf;
> +
> +    /* the IO region used by OSPM to transfer control to QEMU. */
> +    MemoryRegion io_mr;
> +};
> +
> +typedef struct AcpiHmaState AcpiHmaState;
> +
> +struct HmatHmamIn {
> +    /* the offset in the _HMA buffer */
> +    uint32_t offset;
> +} QEMU_PACKED;
> +typedef struct HmatHmamIn HmatHmamIn;
> +
> +struct HmatHmamOut {
> +    /* the size of buffer filled by QEMU. */
> +    uint32_t len;
> +    uint32_t ret_status;   /* return status code. */
> +    uint8_t data[4088];
> +} QEMU_PACKED;
> +typedef struct HmatHmamOut HmatHmamOut;
> +
>  void hmat_build_acpi(GArray *table_data, BIOSLinker *linker, MachineState *ms);
> +void hmat_build_aml(Aml *dsdt);
> +void hmat_init_acpi_state(AcpiHmaState *state, MemoryRegion *io,
> +                          FWCfgState *fw_cfg, Object *owner);
> +void hmat_update(MachineState *ms);
>  
>  #endif
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 90bebb8d3a..f4a6dc5b2e 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -23,6 +23,7 @@
>  #include "sysemu/qtest.h"
>  #include "hw/pci/pci.h"
>  #include "hw/mem/nvdimm.h"
> +#include "hw/acpi/hmat.h"
>  
>  GlobalProperty hw_compat_4_0[] = {};
>  const size_t hw_compat_4_0_len = G_N_ELEMENTS(hw_compat_4_0);
> @@ -859,6 +860,7 @@ static void machine_initfn(Object *obj)
>  
>      if (mc->numa_supported) {
>          ms->numa_state = g_new0(NumaState, 1);
> +        ms->acpi_hma_state = g_new0(AcpiHmaState, 1);
>      } else {
>          ms->numa_state = NULL;
>      }
> @@ -883,6 +885,7 @@ static void machine_finalize(Object *obj)
>      g_free(ms->device_memory);
>      g_free(ms->nvdimms_state);
>      g_free(ms->numa_state);
> +    g_free(ms->acpi_hma_state);
>  }
>  
>  bool machine_usb(MachineState *machine)
> diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
> index d3d8c93631..d869c5ae7b 100644
> --- a/hw/i386/acpi-build.c
> +++ b/hw/i386/acpi-build.c
> @@ -1844,6 +1844,8 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
>          build_q35_pci0_int(dsdt);
>      }
>  
> +    hmat_build_aml(dsdt);
> +
>      if (pcmc->legacy_cpu_hotplug) {
>          build_legacy_cpu_hotplug_aml(dsdt, machine, pm->cpu_hp_io_base);
>      } else {
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 1c7b2a97bc..3021375144 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -77,6 +77,7 @@
>  #include "hw/i386/intel_iommu.h"
>  #include "hw/net/ne2000-isa.h"
>  #include "standard-headers/asm-x86/bootparam.h"
> +#include "hw/acpi/hmat.h"
>  
>  /* debug PC/ISA interrupts */
>  //#define DEBUG_IRQ
> @@ -2130,6 +2131,8 @@ static void pc_memory_plug(HotplugHandler *hotplug_dev,
>          nvdimm_plug(ms->nvdimms_state);
>      }
>  
> +    hmat_update(ms);
> +
>      hotplug_handler_plug(HOTPLUG_HANDLER(pcms->acpi_dev), dev, &error_abort);
>  out:
>      error_propagate(errp, local_err);
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index c07c4a5b38..966d98d619 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -58,6 +58,7 @@
>  #include "migration/misc.h"
>  #include "kvm_i386.h"
>  #include "sysemu/numa.h"
> +#include "hw/acpi/hmat.h"
>  
>  #define MAX_IDE_BUS 2
>  
> @@ -301,6 +302,9 @@ static void pc_init1(MachineState *machine,
>          nvdimm_init_acpi_state(machine->nvdimms_state, system_io,
>                                 pcms->fw_cfg, OBJECT(pcms));
>      }
> +
> +    hmat_init_acpi_state(machine->acpi_hma_state, system_io,
> +                         pcms->fw_cfg, OBJECT(pcms));
>  }
>  
>  /* Looking for a pc_compat_2_4() function? It doesn't exist.
> diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
> index 37dd350511..610b10467a 100644
> --- a/hw/i386/pc_q35.c
> +++ b/hw/i386/pc_q35.c
> @@ -54,6 +54,7 @@
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "sysemu/numa.h"
> +#include "hw/acpi/hmat.h"
>  
>  /* ICH9 AHCI has 6 ports */
>  #define MAX_SATA_PORTS     6
> @@ -333,6 +334,9 @@ static void pc_q35_init(MachineState *machine)
>          nvdimm_init_acpi_state(machine->nvdimms_state, system_io,
>                                 pcms->fw_cfg, OBJECT(pcms));
>      }
> +
> +    hmat_init_acpi_state(machine->acpi_hma_state, system_io,
> +                         pcms->fw_cfg, OBJECT(pcms));
>  }
>  
>  #define DEFINE_Q35_MACHINE(suffix, name, compatfn, optionfn) \
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 8609f923d9..e8d94a69b5 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -315,6 +315,7 @@ struct MachineState {
>      CPUArchIdList *possible_cpus;
>      struct NVDIMMState *nvdimms_state;
>      NumaState *numa_state;
> +    AcpiHmaState *acpi_hma_state;
>  };
>  
>  #define DEFINE_MACHINE(namestr, machine_initfn) \
> diff --git a/include/qemu/typedefs.h b/include/qemu/typedefs.h
> index d971f5109e..a207cc1f88 100644
> --- a/include/qemu/typedefs.h
> +++ b/include/qemu/typedefs.h
> @@ -5,6 +5,7 @@
>     pull in all the real definitions.  */
>  
>  /* Please keep this list in case-insensitive alphabetical order */
> +typedef struct AcpiHmaState AcpiHmaState;
>  typedef struct AdapterInfo AdapterInfo;
>  typedef struct AddressSpace AddressSpace;
>  typedef struct AioContext AioContext;



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/11] hmat acpi: Implement _HMA method to update HMAT at runtime
  2019-06-16 20:07   ` Igor Mammedov
@ 2019-06-17  7:19     ` Tao Xu
  0 siblings, 0 replies; 38+ messages in thread
From: Tao Xu @ 2019-06-17  7:19 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: xiaoguangrong.eric, mst, Liu, Jingqi, qemu-devel, ehabkost,
	pbonzini, rth

On 6/17/2019 4:07 AM, Igor Mammedov wrote:
> On Wed,  8 May 2019 14:17:26 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
>> From: Liu Jingqi <jingqi.liu@intel.com>
>>
>> OSPM evaluates HMAT only during system initialization.
>> Any changes to the HMAT state at runtime or information
>> regarding HMAT for hot plug are communicated using _HMA method.
>>
>> _HMA is an optional object that enables the platform to provide
>> the OS with updated Heterogeneous Memory Attributes information
>> at runtime. _HMA provides OSPM with the latest HMAT in entirety
>> overriding existing HMAT.
> 
> it seems that there aren't any user interface to actually introduce
> new HMAT data during runtime. If it's so lets drop 10-11/11 for now,
> you can add it later when/if you add QMP interface to update/replace
> HMAT at runtime.
> 

OK Thank you for your review, the v5 HMAT patches have been sent into 
QEMU mailing list without the _HMA part. This part I will add QMP 
interface for updating HMAT later.



^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-06-17  7:19 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-08  6:17 [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Tao Xu
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 01/11] numa: move numa global variable nb_numa_nodes into MachineState Tao Xu
2019-05-23 13:04   ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 02/11] numa: move numa global variable have_numa_distance " Tao Xu
2019-05-23 13:07   ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 03/11] numa: move numa global variable numa_info " Tao Xu
2019-05-23 13:47   ` Igor Mammedov
2019-05-28  7:43     ` Tao Xu
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 04/11] acpi: introduce AcpiDeviceIfClass.build_mem_ranges hook Tao Xu
2019-05-24 12:35   ` Igor Mammedov
2019-06-06  5:15     ` Tao Xu
2019-06-06 16:25       ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 05/11] hmat acpi: Build Memory Subsystem Address Range Structure(s) in ACPI HMAT Tao Xu
2019-05-24 14:16   ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 06/11] hmat acpi: Build System Locality Latency and Bandwidth Information " Tao Xu
2019-06-04 14:43   ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 07/11] hmat acpi: Build Memory Side Cache " Tao Xu
2019-06-04 15:04   ` Igor Mammedov
2019-06-05  6:04     ` Tao Xu
2019-06-05 12:12       ` Igor Mammedov
2019-06-06  3:00         ` Tao Xu
2019-06-06 16:45           ` Igor Mammedov
2019-06-10 13:39             ` Tao Xu
2019-06-16 19:41               ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 08/11] numa: Extend the command-line to provide memory latency and bandwidth information Tao Xu
2019-06-05 14:40   ` Igor Mammedov
2019-06-06  7:47     ` Tao Xu
2019-06-06 13:23       ` Eric Blake
2019-06-06 16:50         ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 09/11] numa: Extend the command-line to provide memory side cache information Tao Xu
2019-06-16 19:52   ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 10/11] acpi: introduce build_acpi_aml_common for NFIT generalizations Tao Xu
2019-06-06 17:00   ` Igor Mammedov
2019-05-08  6:17 ` [Qemu-devel] [PATCH v4 11/11] hmat acpi: Implement _HMA method to update HMAT at runtime Tao Xu
2019-06-16 20:07   ` Igor Mammedov
2019-06-17  7:19     ` Tao Xu
2019-05-31  4:55 ` [Qemu-devel] [PATCH v4 00/11] Build ACPI Heterogeneous Memory Attribute Table (HMAT) Dan Williams
2019-05-31  4:55   ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.