All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/4] vfio: report NUMA nodes for device memory
@ 2023-09-15  2:45 ankita
  2023-09-15  2:45 ` [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes ankita
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: ankita @ 2023-09-15  2:45 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

For devices which allow CPU to cache coherently access their memory,
it is sensible to expose such memory as NUMA nodes separate from
the sysmem node. Qemu currently do not provide a mechanism for creation
of NUMA nodes associated with a vfio-pci device.

Implement a mechanism to create and associate a set of unique NUMA nodes
with a vfio-pci device.

NUMA node is created by inserting a series of the unique proximity
domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
at the time of bootup by the kernel to determine the NUMA configuration
and is inflexible post that. Hence this feature is incompatible with
device hotplug. The added node range associated with the device is
communicated through ACPI DSD and can be fetched by the VM kernel or
kernel modules. QEMU's VM SRAT and DSD builder code is modified
accordingly.

New command line params are introduced for admin to have a control on
the NUMA node assignment.

It is expected for a vfio-pci driver to expose this feature through
sysfs. Presence of the feature is checked to enable these code changes.

Applied over v8.1.0-rc4.

Ankit Agrawal (4):
  vfio: new command line params for device memory NUMA nodes
  vfio: assign default values to node params
  hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  acpi/gpex: patch guest DSDT for dev mem information

 hw/arm/virt-acpi-build.c    |  54 +++++++++++++
 hw/pci-host/gpex-acpi.c     |  69 +++++++++++++++++
 hw/vfio/pci.c               | 146 ++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h               |   2 +
 include/hw/pci/pci_device.h |   3 +
 5 files changed, 274 insertions(+)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes
  2023-09-15  2:45 [PATCH v1 0/4] vfio: report NUMA nodes for device memory ankita
@ 2023-09-15  2:45 ` ankita
  2023-09-15 14:25   ` Jonathan Cameron via
  2023-09-15  2:45 ` [PATCH v1 2/4] vfio: assign default values to node params ankita
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: ankita @ 2023-09-15  2:45 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

The CPU cache coherent device memory can be added as a set of
NUMA nodes distinct from the system memory nodes. The Qemu currently
do not provide a mechanism to support node creation for a vfio-pci
device.

Introduce new command line parameters to allow host admin provide
the desired starting NUMA node id (pxm-ns) and the number of such
nodes (pxm-nc) associated with the device. In this implementation,
a numerically consecutive nodes from pxm-ns to pxm-ns + pxm-nc
is created. Also validate the requested range of nodes to check
for conflict with other nodes and to ensure that the id do not cross
QEMU limit.

Since the QEMU's SRAT and DST builder code needs the proximity
domain (PXM) id range, expose PXM start and count as device object
properties.

The device driver module communicates support for such feature through
sysfs. Check the presence of the feature to activate the code.

E.g. the following argument adds 8 PXM nodes starting from id 0x10.
-device vfio-pci-nohotplug,host=<pci-bdf>,pxm-ns=0x10,pxm-nc=8

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/vfio/pci.c               | 144 ++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h               |   2 +
 include/hw/pci/pci_device.h |   3 +
 3 files changed, 149 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a205c6b113..cc0c516161 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -42,6 +42,8 @@
 #include "qapi/error.h"
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
+#include "qapi/visitor.h"
+#include "include/hw/boards.h"
 
 #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
 
@@ -2955,6 +2957,22 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
     }
 }
 
+static void vfio_pci_get_dev_mem_pxm_start(Object *obj, Visitor *v,
+                                           const char *name,
+                                           void *opaque, Error **errp)
+{
+    uint64_t pxm_start = (uintptr_t) opaque;
+    visit_type_uint64(v, name, &pxm_start, errp);
+}
+
+static void vfio_pci_get_dev_mem_pxm_count(Object *obj, Visitor *v,
+                                           const char *name,
+                                           void *opaque, Error **errp)
+{
+    uint64_t pxm_count = (uintptr_t) opaque;
+    visit_type_uint64(v, name, &pxm_count, errp);
+}
+
 static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
 {
     Error *err = NULL;
@@ -2974,6 +2992,125 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
     vdev->req_enabled = false;
 }
 
+static int validate_dev_numa(uint32_t dev_node_start, uint32_t num_nodes)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    unsigned int i;
+
+    if (num_nodes >= MAX_NODES) {
+        return -EINVAL;
+    }
+
+    for (i = 0; i < num_nodes; i++) {
+        if (ms->numa_state->nodes[dev_node_start + i].present) {
+            return -EBUSY;
+        }
+    }
+
+    return 0;
+}
+
+static int mark_dev_node_present(uint32_t dev_node_start, uint32_t num_nodes)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    unsigned int i;
+
+    for (i = 0; i < num_nodes; i++) {
+        ms->numa_state->nodes[dev_node_start + i].present = true;
+    }
+
+    return 0;
+}
+
+
+static bool vfio_pci_read_cohmem_support_sysfs(VFIODevice *vdev)
+{
+    gchar *contents = NULL;
+    gsize length;
+    char *path;
+    bool ret = false;
+    uint32_t supported;
+
+    path = g_strdup_printf("%s/coherent_mem", vdev->sysfsdev);
+    if (g_file_get_contents(path, &contents, &length, NULL) && length > 0) {
+        if ((sscanf(contents, "%u", &supported) == 1) && supported) {
+            ret = true;
+        }
+    }
+
+    if (length) {
+        g_free(contents);
+    }
+    g_free(path);
+
+    return ret;
+}
+
+static int vfio_pci_dev_mem_probe(VFIOPCIDevice *vPciDev,
+                                         Error **errp)
+{
+    Object *obj = NULL;
+    VFIODevice *vdev = &vPciDev->vbasedev;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    int ret = 0;
+    uint32_t dev_node_start = vPciDev->dev_node_start;
+    uint32_t dev_node_count = vPciDev->dev_nodes;
+
+    if (!vdev->sysfsdev || !vfio_pci_read_cohmem_support_sysfs(vdev)) {
+        ret = -ENODEV;
+        goto done;
+    }
+
+    if (vdev->type == VFIO_DEVICE_TYPE_PCI) {
+        obj = vfio_pci_get_object(vdev);
+    }
+
+    /* Since this device creates new NUMA node, hotplug is not supported. */
+    if (!obj || DEVICE_CLASS(object_get_class(obj))->hotpluggable) {
+        ret = -EINVAL;
+        goto done;
+    }
+
+    /*
+     * This device has memory that is coherently accessible from the CPU.
+     * The memory can be represented seperate memory-only NUMA nodes.
+     */
+    vPciDev->pdev.has_coherent_memory = true;
+
+    /*
+     * The device can create several NUMA nodes with consecutive IDs
+     * from dev_node_start to dev_node_start + dev_node_count.
+     * Verify
+     * - whether any node ID is occupied in the desired range.
+     * - Node ID is not crossing MAX_NODE.
+     */
+    ret = validate_dev_numa(dev_node_start, dev_node_count);
+    if (ret) {
+        goto done;
+    }
+
+    /* Reserve the node by marking as present */
+    mark_dev_node_present(dev_node_start, dev_node_count);
+
+    /*
+     * To have multiple unique nodes in the VM, a series of PXM nodes are
+     * required to be added to VM's SRAT. Send the information about the
+     * starting node ID and the node count to the ACPI builder code.
+     */
+    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_start", "uint64",
+                        vfio_pci_get_dev_mem_pxm_start, NULL, NULL,
+                        (void *) (uintptr_t) dev_node_start);
+
+    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_count", "uint64",
+                        vfio_pci_get_dev_mem_pxm_count, NULL, NULL,
+                        (void *) (uintptr_t) dev_node_count);
+
+    ms->numa_state->num_nodes += dev_node_count;
+
+done:
+    return ret;
+}
+
 static void vfio_realize(PCIDevice *pdev, Error **errp)
 {
     VFIOPCIDevice *vdev = VFIO_PCI(pdev);
@@ -3291,6 +3428,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         }
     }
 
+    ret = vfio_pci_dev_mem_probe(vdev, errp);
+    if (ret && ret != -ENODEV) {
+        error_report("Failed to setup device memory with error %d", ret);
+    }
+
     vfio_register_err_notifier(vdev);
     vfio_register_req_notifier(vdev);
     vfio_setup_resetfn_quirk(vdev);
@@ -3454,6 +3596,8 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
                        sub_device_id, PCI_ANY_ID),
     DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
+    DEFINE_PROP_UINT32("pxm-ns", VFIOPCIDevice, dev_node_start, 0),
+    DEFINE_PROP_UINT32("pxm-nc", VFIOPCIDevice, dev_nodes, 0),
     DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
                                    nv_gpudirect_clique,
                                    qdev_prop_nv_gpudirect_clique, uint8_t),
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index a2771b9ff3..eef5ddfd06 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -158,6 +158,8 @@ struct VFIOPCIDevice {
     uint32_t display_yres;
     int32_t bootindex;
     uint32_t igd_gms;
+    uint32_t dev_node_start;
+    uint32_t dev_nodes;
     OffAutoPCIBAR msix_relo;
     uint8_t pm_cap;
     uint8_t nv_gpudirect_clique;
diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
index d3dd0f64b2..aacd2279ae 100644
--- a/include/hw/pci/pci_device.h
+++ b/include/hw/pci/pci_device.h
@@ -157,6 +157,9 @@ struct PCIDevice {
     MSIVectorReleaseNotifier msix_vector_release_notifier;
     MSIVectorPollNotifier msix_vector_poll_notifier;
 
+    /* GPU coherent memory */
+    bool has_coherent_memory;
+
     /* ID of standby device in net_failover pair */
     char *failover_pair_id;
     uint32_t acpi_index;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 2/4] vfio: assign default values to node params
  2023-09-15  2:45 [PATCH v1 0/4] vfio: report NUMA nodes for device memory ankita
  2023-09-15  2:45 ` [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes ankita
@ 2023-09-15  2:45 ` ankita
  2023-09-15  2:45 ` [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes ankita
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 41+ messages in thread
From: ankita @ 2023-09-15  2:45 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

It may be desirable for some deployments to have QEMU automatically
pick a range and create the NUMA nodes. So the admin need not care
about passing any additional params. Another advantage is that the
feature is not dependent on newer libvirt that support the new
parameters pxm-ns and pxm-nc.

Assign default values to pxm-ns (first available node) and pxm-nc (8).
This makes the new params optional and the feature will work on older
libvirt.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/vfio/pci.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index cc0c516161..0bba161172 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3053,8 +3053,10 @@ static int vfio_pci_dev_mem_probe(VFIOPCIDevice *vPciDev,
     VFIODevice *vdev = &vPciDev->vbasedev;
     MachineState *ms = MACHINE(qdev_get_machine());
     int ret = 0;
-    uint32_t dev_node_start = vPciDev->dev_node_start;
-    uint32_t dev_node_count = vPciDev->dev_nodes;
+    uint32_t dev_node_start = vPciDev->dev_node_start ?
+                              vPciDev->dev_node_start :
+                              ms->numa_state->num_nodes;
+    uint32_t dev_node_count = vPciDev->dev_nodes ? vPciDev->dev_nodes : 8;
 
     if (!vdev->sysfsdev || !vfio_pci_read_cohmem_support_sysfs(vdev)) {
         ret = -ENODEV;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-15  2:45 [PATCH v1 0/4] vfio: report NUMA nodes for device memory ankita
  2023-09-15  2:45 ` [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes ankita
  2023-09-15  2:45 ` [PATCH v1 2/4] vfio: assign default values to node params ankita
@ 2023-09-15  2:45 ` ankita
  2023-09-15 14:37   ` Jonathan Cameron via
  2023-09-15 14:52   ` Igor Mammedov
  2023-09-15  2:45 ` [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information ankita
  2023-09-15 14:19 ` [PATCH v1 0/4] vfio: report NUMA nodes for device memory Cédric Le Goater
  4 siblings, 2 replies; 41+ messages in thread
From: ankita @ 2023-09-15  2:45 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

During bootup, Linux kernel parse the ACPI SRAT to determine the PXM ids.
This allows for the creation of NUMA nodes for each unique id.

Insert a series of the unique PXM ids in the VM SRAT ACPI table. The
range of nodes can be determined from the "dev_mem_pxm_start" and
"dev_mem_pxm_count" object properties associated with the device. These
nodes as made MEM_AFFINITY_HOTPLUGGABLE. This allows the kernel to create
memory-less NUMA nodes on bootup to which a subrange (or entire range) of
device memory can be added/removed.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/arm/virt-acpi-build.c | 54 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 6b674231c2..6d1e3b6b8a 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -46,6 +46,7 @@
 #include "hw/acpi/hmat.h"
 #include "hw/pci/pcie_host.h"
 #include "hw/pci/pci.h"
+#include "hw/vfio/pci.h"
 #include "hw/pci/pci_bus.h"
 #include "hw/pci-host/gpex.h"
 #include "hw/arm/virt.h"
@@ -515,6 +516,57 @@ build_spcr(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     acpi_table_end(linker, &table);
 }
 
+static int devmem_device_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_VFIO_PCI)) {
+        *list = g_slist_append(*list, DEVICE(obj));
+    }
+
+    object_child_foreach(obj, devmem_device_list, opaque);
+    return 0;
+}
+
+static GSList *devmem_get_device_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(qdev_get_machine(), devmem_device_list, &list);
+    return list;
+}
+
+static void build_srat_devmem(GArray *table_data)
+{
+    GSList *device_list, *list = devmem_get_device_list();
+
+    for (device_list = list; device_list; device_list = device_list->next) {
+        DeviceState *dev = device_list->data;
+        Object *obj = OBJECT(dev);
+        VFIOPCIDevice *pcidev
+            = ((VFIOPCIDevice *)object_dynamic_cast(OBJECT(obj),
+               TYPE_VFIO_PCI));
+
+        if (pcidev->pdev.has_coherent_memory) {
+            uint64_t start_node = object_property_get_uint(obj,
+                                  "dev_mem_pxm_start", &error_abort);
+            uint64_t node_count = object_property_get_uint(obj,
+                                  "dev_mem_pxm_count", &error_abort);
+            uint64_t node_index;
+
+            /*
+             * Add the node_count PXM domains starting from start_node as
+             * hot pluggable. The VM kernel parse the PXM domains and
+             * creates NUMA nodes.
+             */
+            for (node_index = 0; node_index < node_count; node_index++)
+                build_srat_memory(table_data, 0, 0, start_node + node_index,
+                    MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE);
+        }
+    }
+    g_slist_free(list);
+}
+
 /*
  * ACPI spec, Revision 5.1
  * 5.2.16 System Resource Affinity Table (SRAT)
@@ -569,6 +621,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
                           MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
     }
 
+    build_srat_devmem(table_data);
+
     acpi_table_end(linker, &table);
 }
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information
  2023-09-15  2:45 [PATCH v1 0/4] vfio: report NUMA nodes for device memory ankita
                   ` (2 preceding siblings ...)
  2023-09-15  2:45 ` [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes ankita
@ 2023-09-15  2:45 ` ankita
  2023-09-15 15:13   ` Igor Mammedov
  2023-09-27 11:42   ` Jonathan Cameron via
  2023-09-15 14:19 ` [PATCH v1 0/4] vfio: report NUMA nodes for device memory Cédric Le Goater
  4 siblings, 2 replies; 41+ messages in thread
From: ankita @ 2023-09-15  2:45 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

To add the memory in the guest as NUMA nodes, it needs the PXM node index
and the total count of nodes associated with the memory. The range of
proximity domains are communicated to the VM as part of the guest ACPI
using the nvidia,gpu-mem-pxm-start and nvidia,gpu-mem-pxm-count DSD
properties. These value respectively represent the staring proximity
domain id and the count. Kernel modules can then fetch this information
and determine the numa node id using pxm_to_node().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index 7c7316bc96..0548feace1 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -49,6 +49,72 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq)
     }
 }
 
+static void acpi_dsdt_add_cohmem_device(Aml *dev, int32_t devfn,
+                                        uint64_t dev_mem_pxm_start,
+                                        uint64_t dev_mem_pxm_count)
+{
+    Aml *memdev = aml_device("CMD%X", PCI_SLOT(devfn));
+    Aml *pkg = aml_package(2);
+    Aml *pkg1 = aml_package(2);
+    Aml *pkg2 = aml_package(2);
+    Aml *dev_pkg = aml_package(2);
+    Aml *UUID;
+
+    aml_append(memdev, aml_name_decl("_ADR", aml_int(PCI_SLOT(devfn) << 16)));
+
+    aml_append(pkg1, aml_string("dev-mem-pxm-start"));
+    aml_append(pkg1, aml_int(dev_mem_pxm_start));
+
+    aml_append(pkg2, aml_string("dev-mem-pxm-count"));
+    aml_append(pkg2, aml_int(dev_mem_pxm_count));
+
+    aml_append(pkg, pkg1);
+    aml_append(pkg, pkg2);
+
+    UUID = aml_touuid("DAFFD814-6EBA-4D8C-8A91-BC9BBF4AA301");
+    aml_append(dev_pkg, UUID);
+    aml_append(dev_pkg, pkg);
+
+    aml_append(memdev, aml_name_decl("_DSD", dev_pkg));
+    aml_append(dev, memdev);
+}
+
+static void find_mem_device(PCIBus *bus, PCIDevice *pdev,
+                            void *opaque)
+{
+    Aml *dev = (Aml *)opaque;
+
+    if (bus == NULL) {
+        return;
+    }
+
+    if (pdev->has_coherent_memory) {
+        Object *po = OBJECT(pdev);
+
+        if (po == NULL) {
+            return;
+        }
+
+        uint64_t pxm_start
+           = object_property_get_uint(po, "dev_mem_pxm_start", NULL);
+        uint64_t pxm_count
+           = object_property_get_uint(po, "dev_mem_pxm_count", NULL);
+
+        acpi_dsdt_add_cohmem_device(dev, pdev->devfn, pxm_start, pxm_count);
+    }
+}
+
+static void acpi_dsdt_find_and_add_cohmem_device(PCIBus *bus, Aml *dev)
+{
+    if (bus == NULL) {
+        return;
+    }
+
+    pci_for_each_device_reverse(bus, pci_bus_num(bus),
+                                find_mem_device, dev);
+
+}
+
 static void acpi_dsdt_add_pci_osc(Aml *dev)
 {
     Aml *method, *UUID, *ifctx, *ifctx1, *elsectx, *buf;
@@ -207,7 +273,10 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
 
     acpi_dsdt_add_pci_route_table(dev, cfg->irq);
 
+    acpi_dsdt_find_and_add_cohmem_device(cfg->bus, dev);
+
     method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
+
     aml_append(method, aml_return(aml_int(cfg->ecam.base)));
     aml_append(dev, method);
 
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-15  2:45 [PATCH v1 0/4] vfio: report NUMA nodes for device memory ankita
                   ` (3 preceding siblings ...)
  2023-09-15  2:45 ` [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information ankita
@ 2023-09-15 14:19 ` Cédric Le Goater
  2023-09-15 14:47   ` Alex Williamson
  4 siblings, 1 reply; 41+ messages in thread
From: Cédric Le Goater @ 2023-09-15 14:19 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, shannon.zhaosl, peter.maydell, ani
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

Hello Ankit,

On 9/15/23 04:45, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> For devices which allow CPU to cache coherently access their memory,
> it is sensible to expose such memory as NUMA nodes separate from
> the sysmem node. Qemu currently do not provide a mechanism for creation
> of NUMA nodes associated with a vfio-pci device.
> 
> Implement a mechanism to create and associate a set of unique NUMA nodes
> with a vfio-pci device.>
> NUMA node is created by inserting a series of the unique proximity
> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
> at the time of bootup by the kernel to determine the NUMA configuration
> and is inflexible post that. Hence this feature is incompatible with
> device hotplug. The added node range associated with the device is
> communicated through ACPI DSD and can be fetched by the VM kernel or
> kernel modules. QEMU's VM SRAT and DSD builder code is modified
> accordingly.
> 
> New command line params are introduced for admin to have a control on
> the NUMA node assignment.

This approach seems to bypass the NUMA framework in place in QEMU and
will be a challenge for the upper layers. QEMU is generally used from
libvirt when dealing with KVM guests.

Typically, a command line for a virt machine with NUMA nodes would look
like :

   -object memory-backend-ram,id=ram-node0,size=1G \
   -numa node,nodeid=0,memdev=ram-node0 \
   -object memory-backend-ram,id=ram-node1,size=1G \
   -numa node,nodeid=1,cpus=0-3,memdev=ram-node1

which defines 2 nodes, one with memory and all CPUs and a second with
only memory.

   # numactl -H
   available: 2 nodes (0-1)
   node 0 cpus: 0 1 2 3
   node 0 size: 1003 MB
   node 0 free: 734 MB
   node 1 cpus:
   node 1 size: 975 MB
   node 1 free: 968 MB
   node distances:
   node   0   1
     0:  10  20
     1:  20  10

   
Could it be a new type of host memory backend ?  Have you considered
this approach ?

Thanks,

C.

> 
> It is expected for a vfio-pci driver to expose this feature through
> sysfs. Presence of the feature is checked to enable these code changes.
> 
> Applied over v8.1.0-rc4.
> 
> Ankit Agrawal (4):
>    vfio: new command line params for device memory NUMA nodes
>    vfio: assign default values to node params
>    hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
>    acpi/gpex: patch guest DSDT for dev mem information
> 
>   hw/arm/virt-acpi-build.c    |  54 +++++++++++++
>   hw/pci-host/gpex-acpi.c     |  69 +++++++++++++++++
>   hw/vfio/pci.c               | 146 ++++++++++++++++++++++++++++++++++++
>   hw/vfio/pci.h               |   2 +
>   include/hw/pci/pci_device.h |   3 +
>   5 files changed, 274 insertions(+)
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes
  2023-09-15  2:45 ` [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes ankita
@ 2023-09-15 14:25   ` Jonathan Cameron via
  2023-09-15 14:48     ` Igor Mammedov
  0 siblings, 1 reply; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-15 14:25 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

On Thu, 14 Sep 2023 19:45:56 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The CPU cache coherent device memory can be added as a set of
> NUMA nodes distinct from the system memory nodes. The Qemu currently
> do not provide a mechanism to support node creation for a vfio-pci
> device.
> 
> Introduce new command line parameters to allow host admin provide
> the desired starting NUMA node id (pxm-ns) and the number of such
> nodes (pxm-nc) associated with the device. In this implementation,
> a numerically consecutive nodes from pxm-ns to pxm-ns + pxm-nc
> is created. Also validate the requested range of nodes to check


Hi Ankit,

That's not a particularly intuitive bit of interface!

pxm-start
pxm-number
perhaps?  However, in QEMU commmand lines the node terminology is used so
numa-node-start
numa-node-number

Though in general this feels backwards compared to how the rest of
the numa definition is currently done.

Could the current interface be extended.

-numa node,vfio-device=X

at the cost of a bit of house keeping and lookup.

We need something similar for generic initiators, so maybe
vfio-mem=X? (might not even need to be vfio specific - even
if we only support it for now on VFIO devices).
leaving
initiator=X available for later...

Also, good to say why multiple nodes per device are needed.

Jonathan

> for conflict with other nodes and to ensure that the id do not cross
> QEMU limit.
> 
> Since the QEMU's SRAT and DST builder code needs the proximity
> domain (PXM) id range, expose PXM start and count as device object
> properties.
> 
> The device driver module communicates support for such feature through
> sysfs. Check the presence of the feature to activate the code.
> 
> E.g. the following argument adds 8 PXM nodes starting from id 0x10.
> -device vfio-pci-nohotplug,host=<pci-bdf>,pxm-ns=0x10,pxm-nc=8
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/vfio/pci.c               | 144 ++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h               |   2 +
>  include/hw/pci/pci_device.h |   3 +
>  3 files changed, 149 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a205c6b113..cc0c516161 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -42,6 +42,8 @@
>  #include "qapi/error.h"
>  #include "migration/blocker.h"
>  #include "migration/qemu-file.h"
> +#include "qapi/visitor.h"
> +#include "include/hw/boards.h"
>  
>  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
>  
> @@ -2955,6 +2957,22 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static void vfio_pci_get_dev_mem_pxm_start(Object *obj, Visitor *v,
> +                                           const char *name,
> +                                           void *opaque, Error **errp)
> +{
> +    uint64_t pxm_start = (uintptr_t) opaque;
> +    visit_type_uint64(v, name, &pxm_start, errp);
> +}
> +
> +static void vfio_pci_get_dev_mem_pxm_count(Object *obj, Visitor *v,
> +                                           const char *name,
> +                                           void *opaque, Error **errp)
> +{
> +    uint64_t pxm_count = (uintptr_t) opaque;
> +    visit_type_uint64(v, name, &pxm_count, errp);
> +}
> +
>  static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>  {
>      Error *err = NULL;
> @@ -2974,6 +2992,125 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
>      vdev->req_enabled = false;
>  }
>  
> +static int validate_dev_numa(uint32_t dev_node_start, uint32_t num_nodes)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    unsigned int i;
> +
> +    if (num_nodes >= MAX_NODES) {
> +        return -EINVAL;
> +    }
> +
> +    for (i = 0; i < num_nodes; i++) {
> +        if (ms->numa_state->nodes[dev_node_start + i].present) {
> +            return -EBUSY;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int mark_dev_node_present(uint32_t dev_node_start, uint32_t num_nodes)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    unsigned int i;
> +
> +    for (i = 0; i < num_nodes; i++) {
> +        ms->numa_state->nodes[dev_node_start + i].present = true;
> +    }
> +
> +    return 0;
> +}
> +
> +
> +static bool vfio_pci_read_cohmem_support_sysfs(VFIODevice *vdev)
> +{
> +    gchar *contents = NULL;
> +    gsize length;
> +    char *path;
> +    bool ret = false;
> +    uint32_t supported;
> +
> +    path = g_strdup_printf("%s/coherent_mem", vdev->sysfsdev);

If someone has asked for it, why should we care if they specify it on a
device where it doesn't do anything useful?  This to me feels like something
to check at a higher level of the stack.

> +    if (g_file_get_contents(path, &contents, &length, NULL) && length > 0) {
> +        if ((sscanf(contents, "%u", &supported) == 1) && supported) {
> +            ret = true;
> +        }
> +    }
> +
> +    if (length) {
> +        g_free(contents);

g_autofree will clean this up for you I think

> +    }
> +    g_free(path);

and this.

> +
> +    return ret;
> +}
> +
> +static int vfio_pci_dev_mem_probe(VFIOPCIDevice *vPciDev,
> +                                         Error **errp)
> +{
> +    Object *obj = NULL;
> +    VFIODevice *vdev = &vPciDev->vbasedev;
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    int ret = 0;
> +    uint32_t dev_node_start = vPciDev->dev_node_start;
> +    uint32_t dev_node_count = vPciDev->dev_nodes;
> +
> +    if (!vdev->sysfsdev || !vfio_pci_read_cohmem_support_sysfs(vdev)) {
> +        ret = -ENODEV;
return -ENODEV; 

and similar in all the other cases as no cleanup to do.

> +        goto done;
> +    }
> +
> +    if (vdev->type == VFIO_DEVICE_TYPE_PCI) {

nicer to handle one condition at a time.

    if (vdev->type != VFIO_DEVICE_TYPE_PCI) {
        return -EINVAL;
    }

    obj = vfio_pci_get_object(vdev);
this can't fail

Also get rid of assigning it to NULL above.

   if (DEVICE_CLASS(object...)) {
      return -EINVAL;
   }


> +        obj = vfio_pci_get_object(vdev);
> +    }
> +
> +    /* Since this device creates new NUMA node, hotplug is not supported. */
> +    if (!obj || DEVICE_CLASS(object_get_class(obj))->hotpluggable) {
> +        ret = -EINVAL;
> +        goto done;
> +    }
> +
> +    /*
> +     * This device has memory that is coherently accessible from the CPU.
> +     * The memory can be represented seperate memory-only NUMA nodes.
> +     */
> +    vPciDev->pdev.has_coherent_memory = true;
> +
> +    /*
> +     * The device can create several NUMA nodes with consecutive IDs
> +     * from dev_node_start to dev_node_start + dev_node_count.
> +     * Verify
> +     * - whether any node ID is occupied in the desired range.
> +     * - Node ID is not crossing MAX_NODE.
> +     */
> +    ret = validate_dev_numa(dev_node_start, dev_node_count);
> +    if (ret) {
> +        goto done;
        return ret;

> +    }
> +
> +    /* Reserve the node by marking as present */
> +    mark_dev_node_present(dev_node_start, dev_node_count);
> +
> +    /*
> +     * To have multiple unique nodes in the VM, a series of PXM nodes are
> +     * required to be added to VM's SRAT. Send the information about the
> +     * starting node ID and the node count to the ACPI builder code.
> +     */
> +    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_start", "uint64",
> +                        vfio_pci_get_dev_mem_pxm_start, NULL, NULL,
> +                        (void *) (uintptr_t) dev_node_start);
> +
> +    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_count", "uint64",
> +                        vfio_pci_get_dev_mem_pxm_count, NULL, NULL,
> +                        (void *) (uintptr_t) dev_node_count);
> +
> +    ms->numa_state->num_nodes += dev_node_count;
> +
> +done:
> +    return ret;
> +}
> +
>  static void vfio_realize(PCIDevice *pdev, Error **errp)
>  {
>      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> @@ -3291,6 +3428,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          }
>      }
>  
> +    ret = vfio_pci_dev_mem_probe(vdev, errp);
> +    if (ret && ret != -ENODEV) {
> +        error_report("Failed to setup device memory with error %d", ret);
> +    }
> +
>      vfio_register_err_notifier(vdev);
>      vfio_register_req_notifier(vdev);
>      vfio_setup_resetfn_quirk(vdev);
> @@ -3454,6 +3596,8 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
>                         sub_device_id, PCI_ANY_ID),
>      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> +    DEFINE_PROP_UINT32("pxm-ns", VFIOPCIDevice, dev_node_start, 0),
> +    DEFINE_PROP_UINT32("pxm-nc", VFIOPCIDevice, dev_nodes, 0),
>      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
>                                     nv_gpudirect_clique,
>                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index a2771b9ff3..eef5ddfd06 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -158,6 +158,8 @@ struct VFIOPCIDevice {
>      uint32_t display_yres;
>      int32_t bootindex;
>      uint32_t igd_gms;
> +    uint32_t dev_node_start;
> +    uint32_t dev_nodes;
>      OffAutoPCIBAR msix_relo;
>      uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
> diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> index d3dd0f64b2..aacd2279ae 100644
> --- a/include/hw/pci/pci_device.h
> +++ b/include/hw/pci/pci_device.h
> @@ -157,6 +157,9 @@ struct PCIDevice {
>      MSIVectorReleaseNotifier msix_vector_release_notifier;
>      MSIVectorPollNotifier msix_vector_poll_notifier;
>  
> +    /* GPU coherent memory */
> +    bool has_coherent_memory;
> +
>      /* ID of standby device in net_failover pair */
>      char *failover_pair_id;
>      uint32_t acpi_index;



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-15  2:45 ` [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes ankita
@ 2023-09-15 14:37   ` Jonathan Cameron via
  2023-09-22  5:49     ` Ankit Agrawal
  2023-09-15 14:52   ` Igor Mammedov
  1 sibling, 1 reply; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-15 14:37 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

On Thu, 14 Sep 2023 19:45:58 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> During bootup, Linux kernel parse the ACPI SRAT to determine the PXM ids.
> This allows for the creation of NUMA nodes for each unique id.
> 
> Insert a series of the unique PXM ids in the VM SRAT ACPI table. The
> range of nodes can be determined from the "dev_mem_pxm_start" and
> "dev_mem_pxm_count" object properties associated with the device. These
> nodes as made MEM_AFFINITY_HOTPLUGGABLE. This allows the kernel to create
> memory-less NUMA nodes on bootup to which a subrange (or entire range) of
> device memory can be added/removed.
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/arm/virt-acpi-build.c | 54 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6b674231c2..6d1e3b6b8a 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -46,6 +46,7 @@
>  #include "hw/acpi/hmat.h"
>  #include "hw/pci/pcie_host.h"
>  #include "hw/pci/pci.h"
> +#include "hw/vfio/pci.h"
>  #include "hw/pci/pci_bus.h"
>  #include "hw/pci-host/gpex.h"
>  #include "hw/arm/virt.h"
> @@ -515,6 +516,57 @@ build_spcr(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      acpi_table_end(linker, &table);
>  }
>  
> +static int devmem_device_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_VFIO_PCI)) {
> +        *list = g_slist_append(*list, DEVICE(obj));
> +    }
> +
> +    object_child_foreach(obj, devmem_device_list, opaque);
> +    return 0;
> +}
> +
> +static GSList *devmem_get_device_list(void)
> +{
> +    GSList *list = NULL;
> +
> +    object_child_foreach(qdev_get_machine(), devmem_device_list, &list);
> +    return list;
> +}
> +
> +static void build_srat_devmem(GArray *table_data)
> +{
> +    GSList *device_list, *list = devmem_get_device_list();

Why build a list for this purpose?  Easier to just do all this handling in the
callback.

	object_child_foreach_recursive(qdev_get_machine(), add_devmem_srat, table);
All the code below goes in the callback.

Though if you follow the suggestion I made in patch 1 this code all changes anyway
as the linkage is the other way.


> +
> +    for (device_list = list; device_list; device_list = device_list->next) {
> +        DeviceState *dev = device_list->data;
> +        Object *obj = OBJECT(dev);
> +        VFIOPCIDevice *pcidev
> +            = ((VFIOPCIDevice *)object_dynamic_cast(OBJECT(obj),
> +               TYPE_VFIO_PCI));
> +
> +        if (pcidev->pdev.has_coherent_memory) {
> +            uint64_t start_node = object_property_get_uint(obj,
> +                                  "dev_mem_pxm_start", &error_abort);
> +            uint64_t node_count = object_property_get_uint(obj,
> +                                  "dev_mem_pxm_count", &error_abort);
> +            uint64_t node_index;
> +
> +            /*
> +             * Add the node_count PXM domains starting from start_node as
> +             * hot pluggable. The VM kernel parse the PXM domains and
> +             * creates NUMA nodes.
> +             */
> +            for (node_index = 0; node_index < node_count; node_index++)
> +                build_srat_memory(table_data, 0, 0, start_node + node_index,
> +                    MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE);

0 size SRAT entries for memory? That's not valid.

Seems like you've run into the same issue CXL has with dynamic addition of
nodes to the kernel and all you want to do here is make sure it thinks there
are enough nodes so initializes various structures large enough.

I don't like the CXL solution (add one per CFWMS) either and think this needs
fixing in Linux.  This doesn't feel like a valid way to do that to me.

You 'could' let EDK2 or whatever you are running finish enumerating
all the PCI bus - at which point I'm guessing your coherent ranges are
visible?  At that point, if the description provided to QEMU is sufficiently
detailed (which BAR etc) then in the rebuild of tables that occurs via
virt_acpi_build_update()

I'll add that as you are want this in Virt, chances are Peter is going to
ask for DT support (welcome to my world of pain as I have no idea what
that would look like :)  I've still not figure out how to fixup the misisng
DT support for PXBs.

> +        }
> +    }
> +    g_slist_free(list);
> +}
> +
>  /*
>   * ACPI spec, Revision 5.1
>   * 5.2.16 System Resource Affinity Table (SRAT)
> @@ -569,6 +621,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>                            MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
>      }
>  
> +    build_srat_devmem(table_data);
> +
>      acpi_table_end(linker, &table);
>  }
>  



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-15 14:19 ` [PATCH v1 0/4] vfio: report NUMA nodes for device memory Cédric Le Goater
@ 2023-09-15 14:47   ` Alex Williamson
  2023-09-15 18:34     ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2023-09-15 14:47 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: ankita, jgg, shannon.zhaosl, peter.maydell, ani, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, qemu-arm, qemu-devel

On Fri, 15 Sep 2023 16:19:29 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> Hello Ankit,
> 
> On 9/15/23 04:45, ankita@nvidia.com wrote:
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > For devices which allow CPU to cache coherently access their memory,
> > it is sensible to expose such memory as NUMA nodes separate from
> > the sysmem node. Qemu currently do not provide a mechanism for creation
> > of NUMA nodes associated with a vfio-pci device.
> > 
> > Implement a mechanism to create and associate a set of unique NUMA nodes
> > with a vfio-pci device.>
> > NUMA node is created by inserting a series of the unique proximity
> > domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
> > at the time of bootup by the kernel to determine the NUMA configuration
> > and is inflexible post that. Hence this feature is incompatible with
> > device hotplug. The added node range associated with the device is
> > communicated through ACPI DSD and can be fetched by the VM kernel or
> > kernel modules. QEMU's VM SRAT and DSD builder code is modified
> > accordingly.
> > 
> > New command line params are introduced for admin to have a control on
> > the NUMA node assignment.  
> 
> This approach seems to bypass the NUMA framework in place in QEMU and
> will be a challenge for the upper layers. QEMU is generally used from
> libvirt when dealing with KVM guests.
> 
> Typically, a command line for a virt machine with NUMA nodes would look
> like :
> 
>    -object memory-backend-ram,id=ram-node0,size=1G \
>    -numa node,nodeid=0,memdev=ram-node0 \
>    -object memory-backend-ram,id=ram-node1,size=1G \
>    -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
> 
> which defines 2 nodes, one with memory and all CPUs and a second with
> only memory.
> 
>    # numactl -H
>    available: 2 nodes (0-1)
>    node 0 cpus: 0 1 2 3
>    node 0 size: 1003 MB
>    node 0 free: 734 MB
>    node 1 cpus:
>    node 1 size: 975 MB
>    node 1 free: 968 MB
>    node distances:
>    node   0   1
>      0:  10  20
>      1:  20  10
> 
>    
> Could it be a new type of host memory backend ?  Have you considered
> this approach ?

Good idea.  Fundamentally the device should not be creating NUMA nodes,
the VM should be configured with NUMA nodes and the device memory
associated with those nodes.

I think we're also dealing with a lot of very, very device specific
behavior, so I question whether we shouldn't create a separate device
for this beyond vifo-pci or vfio-pci-nohotplug.

In particular, a PCI device typically only has association to a single
proximity domain, so what sense does it make to describe the coherent
memory as a PCI BAR to only then create a confusing mapping where the
device has a proximity domain separate from a resources associated with
the device?

It's seeming like this device should create memory objects that can be
associated as memory backing for command line specified NUMA nodes.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes
  2023-09-15 14:25   ` Jonathan Cameron via
@ 2023-09-15 14:48     ` Igor Mammedov
  2023-09-22  5:44       ` Ankit Agrawal
  2023-09-25 14:08       ` Jonathan Cameron via
  0 siblings, 2 replies; 41+ messages in thread
From: Igor Mammedov @ 2023-09-15 14:48 UTC (permalink / raw)
  To: Jonathan Cameron via
  Cc: Jonathan Cameron, ankita, jgg, alex.williamson, clg,
	shannon.zhaosl, peter.maydell, ani, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, qemu-arm, mst

On Fri, 15 Sep 2023 15:25:09 +0100
Jonathan Cameron via <qemu-devel@nongnu.org> wrote:

> On Thu, 14 Sep 2023 19:45:56 -0700
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > The CPU cache coherent device memory can be added as a set of
> > NUMA nodes distinct from the system memory nodes. The Qemu currently
> > do not provide a mechanism to support node creation for a vfio-pci
> > device.
> > 
> > Introduce new command line parameters to allow host admin provide
> > the desired starting NUMA node id (pxm-ns) and the number of such
> > nodes (pxm-nc) associated with the device. In this implementation,
> > a numerically consecutive nodes from pxm-ns to pxm-ns + pxm-nc
> > is created. Also validate the requested range of nodes to check  
> 
> 
> Hi Ankit,
> 
> That's not a particularly intuitive bit of interface!
> 
> pxm-start
> pxm-number
> perhaps?  However, in QEMU commmand lines the node terminology is used so
> numa-node-start
> numa-node-number
> 
> Though in general this feels backwards compared to how the rest of
> the numa definition is currently done.

QEMU have already means to assign NUMA node affinity
to PCI hierarchies in generic way by using a PBX per node
(also done 'backwards') by setting node option on it.
So every device behind it should belong to that node as well
and guest OS shall pickup device affinity from PCI tree it belongs to.
(I'd suspect that CXL is supposed to work the same way).

PS:
But then, I don't know much about PCI
(ccing Michael)

> 
> Could the current interface be extended.
> 
> -numa node,vfio-device=X
> 
> at the cost of a bit of house keeping and lookup.
> 
> We need something similar for generic initiators, so maybe
> vfio-mem=X? (might not even need to be vfio specific - even
> if we only support it for now on VFIO devices).
> leaving
> initiator=X available for later...
> 
> Also, good to say why multiple nodes per device are needed.
> 
> Jonathan
> 
> > for conflict with other nodes and to ensure that the id do not cross
> > QEMU limit.
> > 
> > Since the QEMU's SRAT and DST builder code needs the proximity
> > domain (PXM) id range, expose PXM start and count as device object
> > properties.
> > 
> > The device driver module communicates support for such feature through
> > sysfs. Check the presence of the feature to activate the code.
> > 
> > E.g. the following argument adds 8 PXM nodes starting from id 0x10.
> > -device vfio-pci-nohotplug,host=<pci-bdf>,pxm-ns=0x10,pxm-nc=8
> > 
> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > ---
> >  hw/vfio/pci.c               | 144 ++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.h               |   2 +
> >  include/hw/pci/pci_device.h |   3 +
> >  3 files changed, 149 insertions(+)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index a205c6b113..cc0c516161 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -42,6 +42,8 @@
> >  #include "qapi/error.h"
> >  #include "migration/blocker.h"
> >  #include "migration/qemu-file.h"
> > +#include "qapi/visitor.h"
> > +#include "include/hw/boards.h"
> >  
> >  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
> >  
> > @@ -2955,6 +2957,22 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> >      }
> >  }
> >  
> > +static void vfio_pci_get_dev_mem_pxm_start(Object *obj, Visitor *v,
> > +                                           const char *name,
> > +                                           void *opaque, Error **errp)
> > +{
> > +    uint64_t pxm_start = (uintptr_t) opaque;
> > +    visit_type_uint64(v, name, &pxm_start, errp);
> > +}
> > +
> > +static void vfio_pci_get_dev_mem_pxm_count(Object *obj, Visitor *v,
> > +                                           const char *name,
> > +                                           void *opaque, Error **errp)
> > +{
> > +    uint64_t pxm_count = (uintptr_t) opaque;
> > +    visit_type_uint64(v, name, &pxm_count, errp);
> > +}
> > +
> >  static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> >  {
> >      Error *err = NULL;
> > @@ -2974,6 +2992,125 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> >      vdev->req_enabled = false;
> >  }
> >  
> > +static int validate_dev_numa(uint32_t dev_node_start, uint32_t num_nodes)
> > +{
> > +    MachineState *ms = MACHINE(qdev_get_machine());
> > +    unsigned int i;
> > +
> > +    if (num_nodes >= MAX_NODES) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    for (i = 0; i < num_nodes; i++) {
> > +        if (ms->numa_state->nodes[dev_node_start + i].present) {
> > +            return -EBUSY;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int mark_dev_node_present(uint32_t dev_node_start, uint32_t num_nodes)
> > +{
> > +    MachineState *ms = MACHINE(qdev_get_machine());
> > +    unsigned int i;
> > +
> > +    for (i = 0; i < num_nodes; i++) {
> > +        ms->numa_state->nodes[dev_node_start + i].present = true;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +
> > +static bool vfio_pci_read_cohmem_support_sysfs(VFIODevice *vdev)
> > +{
> > +    gchar *contents = NULL;
> > +    gsize length;
> > +    char *path;
> > +    bool ret = false;
> > +    uint32_t supported;
> > +
> > +    path = g_strdup_printf("%s/coherent_mem", vdev->sysfsdev);  
> 
> If someone has asked for it, why should we care if they specify it on a
> device where it doesn't do anything useful?  This to me feels like something
> to check at a higher level of the stack.
> 
> > +    if (g_file_get_contents(path, &contents, &length, NULL) && length > 0) {
> > +        if ((sscanf(contents, "%u", &supported) == 1) && supported) {
> > +            ret = true;
> > +        }
> > +    }
> > +
> > +    if (length) {
> > +        g_free(contents);  
> 
> g_autofree will clean this up for you I think
> 
> > +    }
> > +    g_free(path);  
> 
> and this.
> 
> > +
> > +    return ret;
> > +}
> > +
> > +static int vfio_pci_dev_mem_probe(VFIOPCIDevice *vPciDev,
> > +                                         Error **errp)
> > +{
> > +    Object *obj = NULL;
> > +    VFIODevice *vdev = &vPciDev->vbasedev;
> > +    MachineState *ms = MACHINE(qdev_get_machine());
> > +    int ret = 0;
> > +    uint32_t dev_node_start = vPciDev->dev_node_start;
> > +    uint32_t dev_node_count = vPciDev->dev_nodes;
> > +
> > +    if (!vdev->sysfsdev || !vfio_pci_read_cohmem_support_sysfs(vdev)) {
> > +        ret = -ENODEV;  
> return -ENODEV; 
> 
> and similar in all the other cases as no cleanup to do.
> 
> > +        goto done;
> > +    }
> > +
> > +    if (vdev->type == VFIO_DEVICE_TYPE_PCI) {  
> 
> nicer to handle one condition at a time.
> 
>     if (vdev->type != VFIO_DEVICE_TYPE_PCI) {
>         return -EINVAL;
>     }
> 
>     obj = vfio_pci_get_object(vdev);
> this can't fail
> 
> Also get rid of assigning it to NULL above.
> 
>    if (DEVICE_CLASS(object...)) {
>       return -EINVAL;
>    }
> 
> 
> > +        obj = vfio_pci_get_object(vdev);
> > +    }
> > +
> > +    /* Since this device creates new NUMA node, hotplug is not supported. */
> > +    if (!obj || DEVICE_CLASS(object_get_class(obj))->hotpluggable) {
> > +        ret = -EINVAL;
> > +        goto done;
> > +    }
> > +
> > +    /*
> > +     * This device has memory that is coherently accessible from the CPU.
> > +     * The memory can be represented seperate memory-only NUMA nodes.
> > +     */
> > +    vPciDev->pdev.has_coherent_memory = true;
> > +
> > +    /*
> > +     * The device can create several NUMA nodes with consecutive IDs
> > +     * from dev_node_start to dev_node_start + dev_node_count.
> > +     * Verify
> > +     * - whether any node ID is occupied in the desired range.
> > +     * - Node ID is not crossing MAX_NODE.
> > +     */
> > +    ret = validate_dev_numa(dev_node_start, dev_node_count);
> > +    if (ret) {
> > +        goto done;  
>         return ret;
> 
> > +    }
> > +
> > +    /* Reserve the node by marking as present */
> > +    mark_dev_node_present(dev_node_start, dev_node_count);
> > +
> > +    /*
> > +     * To have multiple unique nodes in the VM, a series of PXM nodes are
> > +     * required to be added to VM's SRAT. Send the information about the
> > +     * starting node ID and the node count to the ACPI builder code.
> > +     */
> > +    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_start", "uint64",
> > +                        vfio_pci_get_dev_mem_pxm_start, NULL, NULL,
> > +                        (void *) (uintptr_t) dev_node_start);
> > +
> > +    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_count", "uint64",
> > +                        vfio_pci_get_dev_mem_pxm_count, NULL, NULL,
> > +                        (void *) (uintptr_t) dev_node_count);
> > +
> > +    ms->numa_state->num_nodes += dev_node_count;
> > +
> > +done:
> > +    return ret;
> > +}
> > +
> >  static void vfio_realize(PCIDevice *pdev, Error **errp)
> >  {
> >      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> > @@ -3291,6 +3428,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >          }
> >      }
> >  
> > +    ret = vfio_pci_dev_mem_probe(vdev, errp);
> > +    if (ret && ret != -ENODEV) {
> > +        error_report("Failed to setup device memory with error %d", ret);
> > +    }
> > +
> >      vfio_register_err_notifier(vdev);
> >      vfio_register_req_notifier(vdev);
> >      vfio_setup_resetfn_quirk(vdev);
> > @@ -3454,6 +3596,8 @@ static Property vfio_pci_dev_properties[] = {
> >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> >                         sub_device_id, PCI_ANY_ID),
> >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > +    DEFINE_PROP_UINT32("pxm-ns", VFIOPCIDevice, dev_node_start, 0),
> > +    DEFINE_PROP_UINT32("pxm-nc", VFIOPCIDevice, dev_nodes, 0),
> >      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
> >                                     nv_gpudirect_clique,
> >                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index a2771b9ff3..eef5ddfd06 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -158,6 +158,8 @@ struct VFIOPCIDevice {
> >      uint32_t display_yres;
> >      int32_t bootindex;
> >      uint32_t igd_gms;
> > +    uint32_t dev_node_start;
> > +    uint32_t dev_nodes;
> >      OffAutoPCIBAR msix_relo;
> >      uint8_t pm_cap;
> >      uint8_t nv_gpudirect_clique;
> > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > index d3dd0f64b2..aacd2279ae 100644
> > --- a/include/hw/pci/pci_device.h
> > +++ b/include/hw/pci/pci_device.h
> > @@ -157,6 +157,9 @@ struct PCIDevice {
> >      MSIVectorReleaseNotifier msix_vector_release_notifier;
> >      MSIVectorPollNotifier msix_vector_poll_notifier;
> >  
> > +    /* GPU coherent memory */
> > +    bool has_coherent_memory;
> > +
> >      /* ID of standby device in net_failover pair */
> >      char *failover_pair_id;
> >      uint32_t acpi_index;  
> 
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-15  2:45 ` [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes ankita
  2023-09-15 14:37   ` Jonathan Cameron via
@ 2023-09-15 14:52   ` Igor Mammedov
  2023-09-15 15:49     ` David Hildenbrand
  1 sibling, 1 reply; 41+ messages in thread
From: Igor Mammedov @ 2023-09-15 14:52 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel, David Hildenbrand

On Thu, 14 Sep 2023 19:45:58 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> During bootup, Linux kernel parse the ACPI SRAT to determine the PXM ids.
> This allows for the creation of NUMA nodes for each unique id.
> 
> Insert a series of the unique PXM ids in the VM SRAT ACPI table. The
> range of nodes can be determined from the "dev_mem_pxm_start" and
> "dev_mem_pxm_count" object properties associated with the device. These
> nodes as made MEM_AFFINITY_HOTPLUGGABLE. This allows the kernel to create
> memory-less NUMA nodes on bootup to which a subrange (or entire range) of
> device memory can be added/removed.

QEMU already has 'memory devices'. perhaps this case belongs to the same class
CCing David.

> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/arm/virt-acpi-build.c | 54 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6b674231c2..6d1e3b6b8a 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -46,6 +46,7 @@
>  #include "hw/acpi/hmat.h"
>  #include "hw/pci/pcie_host.h"
>  #include "hw/pci/pci.h"
> +#include "hw/vfio/pci.h"
>  #include "hw/pci/pci_bus.h"
>  #include "hw/pci-host/gpex.h"
>  #include "hw/arm/virt.h"
> @@ -515,6 +516,57 @@ build_spcr(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      acpi_table_end(linker, &table);
>  }
>  
> +static int devmem_device_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_VFIO_PCI)) {
> +        *list = g_slist_append(*list, DEVICE(obj));
> +    }
> +
> +    object_child_foreach(obj, devmem_device_list, opaque);
> +    return 0;
> +}
> +
> +static GSList *devmem_get_device_list(void)
> +{
> +    GSList *list = NULL;
> +
> +    object_child_foreach(qdev_get_machine(), devmem_device_list, &list);
> +    return list;
> +}
> +
> +static void build_srat_devmem(GArray *table_data)
> +{
> +    GSList *device_list, *list = devmem_get_device_list();
> +
> +    for (device_list = list; device_list; device_list = device_list->next) {
> +        DeviceState *dev = device_list->data;
> +        Object *obj = OBJECT(dev);
> +        VFIOPCIDevice *pcidev
> +            = ((VFIOPCIDevice *)object_dynamic_cast(OBJECT(obj),
> +               TYPE_VFIO_PCI));
> +
> +        if (pcidev->pdev.has_coherent_memory) {
> +            uint64_t start_node = object_property_get_uint(obj,
> +                                  "dev_mem_pxm_start", &error_abort);
> +            uint64_t node_count = object_property_get_uint(obj,
> +                                  "dev_mem_pxm_count", &error_abort);
> +            uint64_t node_index;
> +
> +            /*
> +             * Add the node_count PXM domains starting from start_node as
> +             * hot pluggable. The VM kernel parse the PXM domains and
> +             * creates NUMA nodes.
> +             */
> +            for (node_index = 0; node_index < node_count; node_index++)
> +                build_srat_memory(table_data, 0, 0, start_node + node_index,
> +                    MEM_AFFINITY_ENABLED | MEM_AFFINITY_HOTPLUGGABLE);
> +        }
> +    }
> +    g_slist_free(list);
> +}
> +
>  /*
>   * ACPI spec, Revision 5.1
>   * 5.2.16 System Resource Affinity Table (SRAT)
> @@ -569,6 +621,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>                            MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
>      }
>  
> +    build_srat_devmem(table_data);
> +
>      acpi_table_end(linker, &table);
>  }
>  



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information
  2023-09-15  2:45 ` [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information ankita
@ 2023-09-15 15:13   ` Igor Mammedov
  2023-09-27 11:42   ` Jonathan Cameron via
  1 sibling, 0 replies; 41+ messages in thread
From: Igor Mammedov @ 2023-09-15 15:13 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

On Thu, 14 Sep 2023 19:45:59 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To add the memory in the guest as NUMA nodes, it needs the PXM node index
> and the total count of nodes associated with the memory. The range of
> proximity domains are communicated to the VM as part of the guest ACPI

> using the nvidia,gpu-mem-pxm-start and nvidia,gpu-mem-pxm-count DSD
above examples should use devices that are (or to be) available in QEMU,
not some out of tree ones.

> properties. These value respectively represent the staring proximity
> domain id and the count. Kernel modules can then fetch this information
> and determine the numa node id using pxm_to_node().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> index 7c7316bc96..0548feace1 100644
> --- a/hw/pci-host/gpex-acpi.c
> +++ b/hw/pci-host/gpex-acpi.c
> @@ -49,6 +49,72 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq)
>      }
>  }
>  
> +static void acpi_dsdt_add_cohmem_device(Aml *dev, int32_t devfn,
> +                                        uint64_t dev_mem_pxm_start,
> +                                        uint64_t dev_mem_pxm_count)
> +{
> +    Aml *memdev = aml_device("CMD%X", PCI_SLOT(devfn));
> +    Aml *pkg = aml_package(2);
> +    Aml *pkg1 = aml_package(2);
> +    Aml *pkg2 = aml_package(2);
> +    Aml *dev_pkg = aml_package(2);
> +    Aml *UUID;
> +
> +    aml_append(memdev, aml_name_decl("_ADR", aml_int(PCI_SLOT(devfn) << 16)));

PCI devices (especially endpoints) are typically enumerated by
bus specific means (i.e not by ACPI).

And whether OSPM will honor the remainder of AML here is very questionable.

> +
> +    aml_append(pkg1, aml_string("dev-mem-pxm-start"));
> +    aml_append(pkg1, aml_int(dev_mem_pxm_start));
> +
> +    aml_append(pkg2, aml_string("dev-mem-pxm-count"));
> +    aml_append(pkg2, aml_int(dev_mem_pxm_count));
> +
> +    aml_append(pkg, pkg1);
> +    aml_append(pkg, pkg2);
> +
> +    UUID = aml_touuid("DAFFD814-6EBA-4D8C-8A91-BC9BBF4AA301");

I'm not a fun of free form UUIDs and above one seems to be the case:
https://uefi.org/sites/default/files/resources/_DSD-device-properties-UUID.pdf

looking at above doc this UUID also requires HID/ACPI ID
to describe data structure definition which this patch is missing.
It's also questionable whether _HID and _ADR are allowed to go together.

PS:
Commit message and comments here should have appropriate pointers
to relevant specs.

> +    aml_append(dev_pkg, UUID);
> +    aml_append(dev_pkg, pkg);
> +
> +    aml_append(memdev, aml_name_decl("_DSD", dev_pkg));
> +    aml_append(dev, memdev);
> +}
> +
> +static void find_mem_device(PCIBus *bus, PCIDevice *pdev,
> +                            void *opaque)
> +{
> +    Aml *dev = (Aml *)opaque;
> +
> +    if (bus == NULL) {
> +        return;
> +    }
> +
> +    if (pdev->has_coherent_memory) {
> +        Object *po = OBJECT(pdev);
> +
> +        if (po == NULL) {
> +            return;
> +        }
> +
> +        uint64_t pxm_start
> +           = object_property_get_uint(po, "dev_mem_pxm_start", NULL);
> +        uint64_t pxm_count
> +           = object_property_get_uint(po, "dev_mem_pxm_count", NULL);
> +
> +        acpi_dsdt_add_cohmem_device(dev, pdev->devfn, pxm_start, pxm_count);
> +    }
> +}
> +
> +static void acpi_dsdt_find_and_add_cohmem_device(PCIBus *bus, Aml *dev)
> +{
> +    if (bus == NULL) {
> +        return;
> +    }
> +
> +    pci_for_each_device_reverse(bus, pci_bus_num(bus),
> +                                find_mem_device, dev);
> +
> +}
> +
>  static void acpi_dsdt_add_pci_osc(Aml *dev)
>  {
>      Aml *method, *UUID, *ifctx, *ifctx1, *elsectx, *buf;
> @@ -207,7 +273,10 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
>  
>      acpi_dsdt_add_pci_route_table(dev, cfg->irq);
>  
> +    acpi_dsdt_find_and_add_cohmem_device(cfg->bus, dev);
> +
>      method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
> +
>      aml_append(method, aml_return(aml_int(cfg->ecam.base)));
>      aml_append(dev, method);
>  



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-15 14:52   ` Igor Mammedov
@ 2023-09-15 15:49     ` David Hildenbrand
  0 siblings, 0 replies; 41+ messages in thread
From: David Hildenbrand @ 2023-09-15 15:49 UTC (permalink / raw)
  To: Igor Mammedov, ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

On 15.09.23 16:52, Igor Mammedov wrote:
> On Thu, 14 Sep 2023 19:45:58 -0700
> <ankita@nvidia.com> wrote:
> 
>> From: Ankit Agrawal <ankita@nvidia.com>
>>
>> During bootup, Linux kernel parse the ACPI SRAT to determine the PXM ids.
>> This allows for the creation of NUMA nodes for each unique id.
>>
>> Insert a series of the unique PXM ids in the VM SRAT ACPI table. The
>> range of nodes can be determined from the "dev_mem_pxm_start" and
>> "dev_mem_pxm_count" object properties associated with the device. These
>> nodes as made MEM_AFFINITY_HOTPLUGGABLE. This allows the kernel to create
>> memory-less NUMA nodes on bootup to which a subrange (or entire range) of
>> device memory can be added/removed.
> 
> QEMU already has 'memory devices'. perhaps this case belongs to the same class
> CCing David.

There is demand for something similar for memory devices (DIMMs, 
virtio-mem) as well.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-15 14:47   ` Alex Williamson
@ 2023-09-15 18:34     ` David Hildenbrand
  2023-09-22  8:11       ` Ankit Agrawal
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2023-09-15 18:34 UTC (permalink / raw)
  To: Alex Williamson, Cédric Le Goater
  Cc: ankita, jgg, shannon.zhaosl, peter.maydell, ani, aniketa, cjia,
	kwankhede, targupta, vsethi, acurrid, qemu-arm, qemu-devel

On 15.09.23 16:47, Alex Williamson wrote:
> On Fri, 15 Sep 2023 16:19:29 +0200
> Cédric Le Goater <clg@redhat.com> wrote:
> 
>> Hello Ankit,
>>
>> On 9/15/23 04:45, ankita@nvidia.com wrote:
>>> From: Ankit Agrawal <ankita@nvidia.com>
>>>
>>> For devices which allow CPU to cache coherently access their memory,
>>> it is sensible to expose such memory as NUMA nodes separate from
>>> the sysmem node. Qemu currently do not provide a mechanism for creation
>>> of NUMA nodes associated with a vfio-pci device.
>>>
>>> Implement a mechanism to create and associate a set of unique NUMA nodes
>>> with a vfio-pci device.>
>>> NUMA node is created by inserting a series of the unique proximity
>>> domains (PXM) in the VM SRAT ACPI table. The ACPI tables are read once
>>> at the time of bootup by the kernel to determine the NUMA configuration
>>> and is inflexible post that. Hence this feature is incompatible with
>>> device hotplug. The added node range associated with the device is
>>> communicated through ACPI DSD and can be fetched by the VM kernel or
>>> kernel modules. QEMU's VM SRAT and DSD builder code is modified
>>> accordingly.
>>>
>>> New command line params are introduced for admin to have a control on
>>> the NUMA node assignment.
>>
>> This approach seems to bypass the NUMA framework in place in QEMU and
>> will be a challenge for the upper layers. QEMU is generally used from
>> libvirt when dealing with KVM guests.
>>
>> Typically, a command line for a virt machine with NUMA nodes would look
>> like :
>>
>>     -object memory-backend-ram,id=ram-node0,size=1G \
>>     -numa node,nodeid=0,memdev=ram-node0 \
>>     -object memory-backend-ram,id=ram-node1,size=1G \
>>     -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
>>
>> which defines 2 nodes, one with memory and all CPUs and a second with
>> only memory.
>>
>>     # numactl -H
>>     available: 2 nodes (0-1)
>>     node 0 cpus: 0 1 2 3
>>     node 0 size: 1003 MB
>>     node 0 free: 734 MB
>>     node 1 cpus:
>>     node 1 size: 975 MB
>>     node 1 free: 968 MB
>>     node distances:
>>     node   0   1
>>       0:  10  20
>>       1:  20  10
>>
>>     
>> Could it be a new type of host memory backend ?  Have you considered
>> this approach ?
> 
> Good idea.  Fundamentally the device should not be creating NUMA nodes,
> the VM should be configured with NUMA nodes and the device memory
> associated with those nodes.

+1. That would also make it fly with DIMMs and virtio-mem, where you 
would want NUMA-less nodes ass well (imagine passing CXL memory to a VM 
using virtio-mem).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes
  2023-09-15 14:48     ` Igor Mammedov
@ 2023-09-22  5:44       ` Ankit Agrawal
  2023-09-25 14:08       ` Jonathan Cameron via
  1 sibling, 0 replies; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-22  5:44 UTC (permalink / raw)
  To: Igor Mammedov, Jonathan Cameron via, alex.williamson
  Cc: Jonathan Cameron, Jason Gunthorpe, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, mst

> Also, good to say why multiple nodes per device are needed.
This is to support the GPU's MIG (Mult-Instance GPUs) feature,
(https://www.nvidia.com/en-in/technologies/multi-instance-gpu/) which
allows partitioning of the GPU device resources (including device memory) into
several isolated instances. We are creating multiple NUMA nodes to give
each partition their own node. Now the partitions are not fixed and they 
can be created/deleted and updated in (memory) sizes at runtime. This is
the reason these nodes are tagged as MEM_AFFINITY_HOTPLUGGABLE. Such
setting allows flexibility in the VM to associate a desired partition/range 
of device memory to a node (that is adjustable). Note that we are replicating
the behavior on baremetal here.

I will also put this detail on the cover letter in the next version.

> QEMU have already means to assign NUMA node affinity
> to PCI hierarchies in generic way by using a PBX per node
> (also done 'backwards') by setting node option on it.
> So every device behind it should belong to that node as well
> and guest OS shall pickup device affinity from PCI tree it belongs to.

Yes, but the problem is that only one node may be associated this way
and we have several.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-15 14:37   ` Jonathan Cameron via
@ 2023-09-22  5:49     ` Ankit Agrawal
  2023-09-25 13:54       ` Jonathan Cameron via
  0 siblings, 1 reply; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-22  5:49 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jason Gunthorpe, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

Hi Jonathan

> > +        if (pcidev->pdev.has_coherent_memory) {
> > +            uint64_t start_node = object_property_get_uint(obj,
> > +                                  "dev_mem_pxm_start", &error_abort);
> > +            uint64_t node_count = object_property_get_uint(obj,
> > +                                  "dev_mem_pxm_count", &error_abort);
> > +            uint64_t node_index;
> > +
> > +            /*
> > +             * Add the node_count PXM domains starting from start_node as
> > +             * hot pluggable. The VM kernel parse the PXM domains and
> > +             * creates NUMA nodes.
> > +             */
> > +            for (node_index = 0; node_index < node_count; node_index++)
> > +                build_srat_memory(table_data, 0, 0, start_node + node_index,
> > +                    MEM_AFFINITY_ENABLED |
> > + MEM_AFFINITY_HOTPLUGGABLE);
> 
> 0 size SRAT entries for memory? That's not valid.

Can you explain in what sense are these invalid? The Linux kernel accepts
such setting and I had tested it.

> Seems like you've run into the same issue CXL has with dynamic addition of
> nodes to the kernel and all you want to do here is make sure it thinks there are
> enough nodes so initializes various structures large enough.
>
Yes, exactly.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-15 18:34     ` David Hildenbrand
@ 2023-09-22  8:11       ` Ankit Agrawal
  2023-09-22  8:15         ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-22  8:11 UTC (permalink / raw)
  To: David Hildenbrand, Alex Williamson, Cédric Le Goater
  Cc: Jason Gunthorpe, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

> >>
> >> Typically, a command line for a virt machine with NUMA nodes would
> >> look like :
> >>
> >>     -object memory-backend-ram,id=ram-node0,size=1G \
> >>     -numa node,nodeid=0,memdev=ram-node0 \
> >>     -object memory-backend-ram,id=ram-node1,size=1G \
> >>     -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
> >>
> >> which defines 2 nodes, one with memory and all CPUs and a second with
> >> only memory.
> >>
> >>     # numactl -H
> >>     available: 2 nodes (0-1)
> >>     node 0 cpus: 0 1 2 3
> >>     node 0 size: 1003 MB
> >>     node 0 free: 734 MB
> >>     node 1 cpus:
> >>     node 1 size: 975 MB
> >>     node 1 free: 968 MB
> >>     node distances:
> >>     node   0   1
> >>       0:  10  20
> >>       1:  20  10
> >>
> >>
> >> Could it be a new type of host memory backend ?  Have you considered
> >> this approach ?
> >
> > Good idea.  Fundamentally the device should not be creating NUMA
> > nodes, the VM should be configured with NUMA nodes and the device
> > memory associated with those nodes.
> 
> +1. That would also make it fly with DIMMs and virtio-mem, where you
> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
> using virtio-mem).
> 

We actually do not add the device memory on the host, instead
map it into the Qemu VMA using remap_pfn_range(). Please checkout the
mmap function in vfio-pci variant driver code managing the device.
https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
And I think host memory backend would need memory that is added on the
host.

Moreover since we want to passthrough the entire device memory, the
-object memory-backend-ram would have to be passed a size that is equal
to the device memory. I wonder if that would be too much of a trouble
for an admin (or libvirt) triggering the Qemu process.

Both these items are avoided by exposing the device memory as BAR as in the
current  implementation (referenced above) since it lets Qemu to naturally
discover the device memory region and do mmap.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-22  8:11       ` Ankit Agrawal
@ 2023-09-22  8:15         ` David Hildenbrand
  2023-09-26 14:52           ` Ankit Agrawal
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2023-09-22  8:15 UTC (permalink / raw)
  To: Ankit Agrawal, Alex Williamson, Cédric Le Goater
  Cc: Jason Gunthorpe, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On 22.09.23 10:11, Ankit Agrawal wrote:
>>>>
>>>> Typically, a command line for a virt machine with NUMA nodes would
>>>> look like :
>>>>
>>>>      -object memory-backend-ram,id=ram-node0,size=1G \
>>>>      -numa node,nodeid=0,memdev=ram-node0 \
>>>>      -object memory-backend-ram,id=ram-node1,size=1G \
>>>>      -numa node,nodeid=1,cpus=0-3,memdev=ram-node1
>>>>
>>>> which defines 2 nodes, one with memory and all CPUs and a second with
>>>> only memory.
>>>>
>>>>      # numactl -H
>>>>      available: 2 nodes (0-1)
>>>>      node 0 cpus: 0 1 2 3
>>>>      node 0 size: 1003 MB
>>>>      node 0 free: 734 MB
>>>>      node 1 cpus:
>>>>      node 1 size: 975 MB
>>>>      node 1 free: 968 MB
>>>>      node distances:
>>>>      node   0   1
>>>>        0:  10  20
>>>>        1:  20  10
>>>>
>>>>
>>>> Could it be a new type of host memory backend ?  Have you considered
>>>> this approach ?
>>>
>>> Good idea.  Fundamentally the device should not be creating NUMA
>>> nodes, the VM should be configured with NUMA nodes and the device
>>> memory associated with those nodes.
>>
>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>> using virtio-mem).
>>
> 
> We actually do not add the device memory on the host, instead
> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
> mmap function in vfio-pci variant driver code managing the device.
> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> And I think host memory backend would need memory that is added on the
> host.
> 
> Moreover since we want to passthrough the entire device memory, the
> -object memory-backend-ram would have to be passed a size that is equal
> to the device memory. I wonder if that would be too much of a trouble
> for an admin (or libvirt) triggering the Qemu process.
> 
> Both these items are avoided by exposing the device memory as BAR as in the
> current  implementation (referenced above) since it lets Qemu to naturally
> discover the device memory region and do mmap.
> 

Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured 
on the device, not on the memory backend.

e.g., -device pc-dimm,node=3,memdev=mem1,...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-22  5:49     ` Ankit Agrawal
@ 2023-09-25 13:54       ` Jonathan Cameron via
  2023-09-25 14:03         ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-25 13:54 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On Fri, 22 Sep 2023 05:49:46 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> Hi Jonathan

Hi Ankit,

> 
> > > +        if (pcidev->pdev.has_coherent_memory) {
> > > +            uint64_t start_node = object_property_get_uint(obj,
> > > +                                  "dev_mem_pxm_start", &error_abort);
> > > +            uint64_t node_count = object_property_get_uint(obj,
> > > +                                  "dev_mem_pxm_count", &error_abort);
> > > +            uint64_t node_index;
> > > +
> > > +            /*
> > > +             * Add the node_count PXM domains starting from start_node as
> > > +             * hot pluggable. The VM kernel parse the PXM domains and
> > > +             * creates NUMA nodes.
> > > +             */
> > > +            for (node_index = 0; node_index < node_count; node_index++)
> > > +                build_srat_memory(table_data, 0, 0, start_node + node_index,
> > > +                    MEM_AFFINITY_ENABLED |
> > > + MEM_AFFINITY_HOTPLUGGABLE);  
> > 
> > 0 size SRAT entries for memory? That's not valid.  
> 
> Can you explain in what sense are these invalid? The Linux kernel accepts
> such setting and I had tested it.

ACPI specification doesn't define any means of 'updating' the memory range,
so whilst I guess they are not specifically disallowed without a spec definition
of what it means this is walking into a mine field. In particular the 
description of the hot pluggable bit worries me:
"The system hardware supports hot-add and hot-remove of this memory region."
So I think your definition is calling out that you can hot plug memory into
a region of zero size. To me that's nonsensical so a paranoid OS writer
might just spit out firmware error message and refuse to boot.

There is no guarantee other operating systems won't blow up if they see one
of these. To be able to do this safely I think you probably need an ACPI
spec update to say what such a zero length, zero base region means.

Possible the ASWG folk would say this is fine and I'm reading too much into
the spec, but I'd definitely suggest asking them via the appropriate path,
or throwing in a code first proposal for a comment on this special case and
see what response you get - my guess is it will be 'fix Linux' :(

> 
> > Seems like you've run into the same issue CXL has with dynamic addition of
> > nodes to the kernel and all you want to do here is make sure it thinks there are
> > enough nodes so initializes various structures large enough.
> >  
> Yes, exactly.
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-25 13:54       ` Jonathan Cameron via
@ 2023-09-25 14:03         ` Jason Gunthorpe
  2023-09-25 14:53           ` Jonathan Cameron via
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2023-09-25 14:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ankit Agrawal, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On Mon, Sep 25, 2023 at 02:54:40PM +0100, Jonathan Cameron wrote:

> Possible the ASWG folk would say this is fine and I'm reading too much into
> the spec, but I'd definitely suggest asking them via the appropriate path,
> or throwing in a code first proposal for a comment on this special case and
> see what response you get - my guess is it will be 'fix Linux' :(

The goal here is for qemu to emulate what the bare metal environment
is doing.

There may be a legitimate question if what the bare metal FW has done
is legitimate (though let's face it, there are lots of creative ACPI
things around), but I don't quite see how this is a qemu question?

Unless you are taking the position that qemu should not emulate this
HW?

Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes
  2023-09-15 14:48     ` Igor Mammedov
  2023-09-22  5:44       ` Ankit Agrawal
@ 2023-09-25 14:08       ` Jonathan Cameron via
  1 sibling, 0 replies; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-25 14:08 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Jonathan Cameron via, ankita, jgg, alex.williamson, clg,
	shannon.zhaosl, peter.maydell, ani, aniketa, cjia, kwankhede,
	targupta, vsethi, acurrid, qemu-arm, mst

On Fri, 15 Sep 2023 16:48:41 +0200
Igor Mammedov <imammedo@redhat.com> wrote:

> On Fri, 15 Sep 2023 15:25:09 +0100
> Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> 
> > On Thu, 14 Sep 2023 19:45:56 -0700
> > <ankita@nvidia.com> wrote:
> >   
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > > 
> > > The CPU cache coherent device memory can be added as a set of
> > > NUMA nodes distinct from the system memory nodes. The Qemu currently
> > > do not provide a mechanism to support node creation for a vfio-pci
> > > device.
> > > 
> > > Introduce new command line parameters to allow host admin provide
> > > the desired starting NUMA node id (pxm-ns) and the number of such
> > > nodes (pxm-nc) associated with the device. In this implementation,
> > > a numerically consecutive nodes from pxm-ns to pxm-ns + pxm-nc
> > > is created. Also validate the requested range of nodes to check    
> > 
> > 
> > Hi Ankit,
> > 
> > That's not a particularly intuitive bit of interface!
> > 
> > pxm-start
> > pxm-number
> > perhaps?  However, in QEMU commmand lines the node terminology is used so
> > numa-node-start
> > numa-node-number
> > 
> > Though in general this feels backwards compared to how the rest of
> > the numa definition is currently done.  
> 
> QEMU have already means to assign NUMA node affinity
> to PCI hierarchies in generic way by using a PBX per node
> (also done 'backwards') by setting node option on it.
> So every device behind it should belong to that node as well
> and guest OS shall pickup device affinity from PCI tree it belongs to.
> (I'd suspect that CXL is supposed to work the same way).

Whilst QEMU doesn't yet support it (I think), there is nothing stopping
people using ACPI to assign _PXM to PCI devices. The early CXL specs specifically
call out that you should do this (before things were more discoverable) Note that the
node has to be defined in SRAT though. For most PCI devices that means
using a Generic Initiator entry.

CXL 2+ has some extra toys to play with, so of which apply to PCI as well:
* Generic Port Affinity Structures in ACPI.  The idea is you can
  define latency and bandwidth to the head of a discoverable PCI/CXL topology.
* Discoverable properties for all the CXL components via a mailbox interface:
  - Latency and bandwidth across switches
  - Latency and bandwidth to separate regions of memory.
  Note that whilst it came from CXL that stuff is in a UEFI maintained spec
  so 'could' be implemented on non CXL Devices.  See the CDAT specification.

Idea is that the OS can discover actual properties and then create appropriate
NUMA description if it wants to.

The Linux kernel currently relies on what I consider a hack: We also
have host memory windows (CXL Fixed Memory Windows) that have QoS associated
with them along with interleave etc. So a rough assumption is that if you map
CXL memory into a particular CFMWS then it is good enough to group it with
other memory mapped in that CFMWS and use a single NUMA node to describe it.
Hence the kernel creates a NUMA node per CFMWS entry and the properties can
be established by exploring the route to one of the devices being accesses
(and interleave etc).

This is an approximation I think we'll need to fix long term, but it seems
good enough for first generation or two of CXL devices where people tend
not to mix devices of different performances, or put some but not all of
them behind a switch.

So the problem for the devices being discussed here is to either fix
it properly (figure out how to do dynamic creation of NUMA nodes in Linux)
or find a good proxy to ensure enough spare nodes can be created early in boot.
That's early enough that it pretty much has to come from a static ACPI table.

Jonathan

> 
> PS:
> But then, I don't know much about PCI
> (ccing Michael)
> 
> > 
> > Could the current interface be extended.
> > 
> > -numa node,vfio-device=X
> > 
> > at the cost of a bit of house keeping and lookup.
> > 
> > We need something similar for generic initiators, so maybe
> > vfio-mem=X? (might not even need to be vfio specific - even
> > if we only support it for now on VFIO devices).
> > leaving
> > initiator=X available for later...
> > 
> > Also, good to say why multiple nodes per device are needed.
> > 
> > Jonathan
> >   
> > > for conflict with other nodes and to ensure that the id do not cross
> > > QEMU limit.
> > > 
> > > Since the QEMU's SRAT and DST builder code needs the proximity
> > > domain (PXM) id range, expose PXM start and count as device object
> > > properties.
> > > 
> > > The device driver module communicates support for such feature through
> > > sysfs. Check the presence of the feature to activate the code.
> > > 
> > > E.g. the following argument adds 8 PXM nodes starting from id 0x10.
> > > -device vfio-pci-nohotplug,host=<pci-bdf>,pxm-ns=0x10,pxm-nc=8
> > > 
> > > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > > ---
> > >  hw/vfio/pci.c               | 144 ++++++++++++++++++++++++++++++++++++
> > >  hw/vfio/pci.h               |   2 +
> > >  include/hw/pci/pci_device.h |   3 +
> > >  3 files changed, 149 insertions(+)
> > > 
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index a205c6b113..cc0c516161 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -42,6 +42,8 @@
> > >  #include "qapi/error.h"
> > >  #include "migration/blocker.h"
> > >  #include "migration/qemu-file.h"
> > > +#include "qapi/visitor.h"
> > > +#include "include/hw/boards.h"
> > >  
> > >  #define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
> > >  
> > > @@ -2955,6 +2957,22 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
> > >      }
> > >  }
> > >  
> > > +static void vfio_pci_get_dev_mem_pxm_start(Object *obj, Visitor *v,
> > > +                                           const char *name,
> > > +                                           void *opaque, Error **errp)
> > > +{
> > > +    uint64_t pxm_start = (uintptr_t) opaque;
> > > +    visit_type_uint64(v, name, &pxm_start, errp);
> > > +}
> > > +
> > > +static void vfio_pci_get_dev_mem_pxm_count(Object *obj, Visitor *v,
> > > +                                           const char *name,
> > > +                                           void *opaque, Error **errp)
> > > +{
> > > +    uint64_t pxm_count = (uintptr_t) opaque;
> > > +    visit_type_uint64(v, name, &pxm_count, errp);
> > > +}
> > > +
> > >  static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> > >  {
> > >      Error *err = NULL;
> > > @@ -2974,6 +2992,125 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
> > >      vdev->req_enabled = false;
> > >  }
> > >  
> > > +static int validate_dev_numa(uint32_t dev_node_start, uint32_t num_nodes)
> > > +{
> > > +    MachineState *ms = MACHINE(qdev_get_machine());
> > > +    unsigned int i;
> > > +
> > > +    if (num_nodes >= MAX_NODES) {
> > > +        return -EINVAL;
> > > +    }
> > > +
> > > +    for (i = 0; i < num_nodes; i++) {
> > > +        if (ms->numa_state->nodes[dev_node_start + i].present) {
> > > +            return -EBUSY;
> > > +        }
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int mark_dev_node_present(uint32_t dev_node_start, uint32_t num_nodes)
> > > +{
> > > +    MachineState *ms = MACHINE(qdev_get_machine());
> > > +    unsigned int i;
> > > +
> > > +    for (i = 0; i < num_nodes; i++) {
> > > +        ms->numa_state->nodes[dev_node_start + i].present = true;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +
> > > +static bool vfio_pci_read_cohmem_support_sysfs(VFIODevice *vdev)
> > > +{
> > > +    gchar *contents = NULL;
> > > +    gsize length;
> > > +    char *path;
> > > +    bool ret = false;
> > > +    uint32_t supported;
> > > +
> > > +    path = g_strdup_printf("%s/coherent_mem", vdev->sysfsdev);    
> > 
> > If someone has asked for it, why should we care if they specify it on a
> > device where it doesn't do anything useful?  This to me feels like something
> > to check at a higher level of the stack.
> >   
> > > +    if (g_file_get_contents(path, &contents, &length, NULL) && length > 0) {
> > > +        if ((sscanf(contents, "%u", &supported) == 1) && supported) {
> > > +            ret = true;
> > > +        }
> > > +    }
> > > +
> > > +    if (length) {
> > > +        g_free(contents);    
> > 
> > g_autofree will clean this up for you I think
> >   
> > > +    }
> > > +    g_free(path);    
> > 
> > and this.
> >   
> > > +
> > > +    return ret;
> > > +}
> > > +
> > > +static int vfio_pci_dev_mem_probe(VFIOPCIDevice *vPciDev,
> > > +                                         Error **errp)
> > > +{
> > > +    Object *obj = NULL;
> > > +    VFIODevice *vdev = &vPciDev->vbasedev;
> > > +    MachineState *ms = MACHINE(qdev_get_machine());
> > > +    int ret = 0;
> > > +    uint32_t dev_node_start = vPciDev->dev_node_start;
> > > +    uint32_t dev_node_count = vPciDev->dev_nodes;
> > > +
> > > +    if (!vdev->sysfsdev || !vfio_pci_read_cohmem_support_sysfs(vdev)) {
> > > +        ret = -ENODEV;    
> > return -ENODEV; 
> > 
> > and similar in all the other cases as no cleanup to do.
> >   
> > > +        goto done;
> > > +    }
> > > +
> > > +    if (vdev->type == VFIO_DEVICE_TYPE_PCI) {    
> > 
> > nicer to handle one condition at a time.
> > 
> >     if (vdev->type != VFIO_DEVICE_TYPE_PCI) {
> >         return -EINVAL;
> >     }
> > 
> >     obj = vfio_pci_get_object(vdev);
> > this can't fail
> > 
> > Also get rid of assigning it to NULL above.
> > 
> >    if (DEVICE_CLASS(object...)) {
> >       return -EINVAL;
> >    }
> > 
> >   
> > > +        obj = vfio_pci_get_object(vdev);
> > > +    }
> > > +
> > > +    /* Since this device creates new NUMA node, hotplug is not supported. */
> > > +    if (!obj || DEVICE_CLASS(object_get_class(obj))->hotpluggable) {
> > > +        ret = -EINVAL;
> > > +        goto done;
> > > +    }
> > > +
> > > +    /*
> > > +     * This device has memory that is coherently accessible from the CPU.
> > > +     * The memory can be represented seperate memory-only NUMA nodes.
> > > +     */
> > > +    vPciDev->pdev.has_coherent_memory = true;
> > > +
> > > +    /*
> > > +     * The device can create several NUMA nodes with consecutive IDs
> > > +     * from dev_node_start to dev_node_start + dev_node_count.
> > > +     * Verify
> > > +     * - whether any node ID is occupied in the desired range.
> > > +     * - Node ID is not crossing MAX_NODE.
> > > +     */
> > > +    ret = validate_dev_numa(dev_node_start, dev_node_count);
> > > +    if (ret) {
> > > +        goto done;    
> >         return ret;
> >   
> > > +    }
> > > +
> > > +    /* Reserve the node by marking as present */
> > > +    mark_dev_node_present(dev_node_start, dev_node_count);
> > > +
> > > +    /*
> > > +     * To have multiple unique nodes in the VM, a series of PXM nodes are
> > > +     * required to be added to VM's SRAT. Send the information about the
> > > +     * starting node ID and the node count to the ACPI builder code.
> > > +     */
> > > +    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_start", "uint64",
> > > +                        vfio_pci_get_dev_mem_pxm_start, NULL, NULL,
> > > +                        (void *) (uintptr_t) dev_node_start);
> > > +
> > > +    object_property_add(OBJECT(vPciDev), "dev_mem_pxm_count", "uint64",
> > > +                        vfio_pci_get_dev_mem_pxm_count, NULL, NULL,
> > > +                        (void *) (uintptr_t) dev_node_count);
> > > +
> > > +    ms->numa_state->num_nodes += dev_node_count;
> > > +
> > > +done:
> > > +    return ret;
> > > +}
> > > +
> > >  static void vfio_realize(PCIDevice *pdev, Error **errp)
> > >  {
> > >      VFIOPCIDevice *vdev = VFIO_PCI(pdev);
> > > @@ -3291,6 +3428,11 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> > >          }
> > >      }
> > >  
> > > +    ret = vfio_pci_dev_mem_probe(vdev, errp);
> > > +    if (ret && ret != -ENODEV) {
> > > +        error_report("Failed to setup device memory with error %d", ret);
> > > +    }
> > > +
> > >      vfio_register_err_notifier(vdev);
> > >      vfio_register_req_notifier(vdev);
> > >      vfio_setup_resetfn_quirk(vdev);
> > > @@ -3454,6 +3596,8 @@ static Property vfio_pci_dev_properties[] = {
> > >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> > >                         sub_device_id, PCI_ANY_ID),
> > >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > > +    DEFINE_PROP_UINT32("pxm-ns", VFIOPCIDevice, dev_node_start, 0),
> > > +    DEFINE_PROP_UINT32("pxm-nc", VFIOPCIDevice, dev_nodes, 0),
> > >      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
> > >                                     nv_gpudirect_clique,
> > >                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index a2771b9ff3..eef5ddfd06 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -158,6 +158,8 @@ struct VFIOPCIDevice {
> > >      uint32_t display_yres;
> > >      int32_t bootindex;
> > >      uint32_t igd_gms;
> > > +    uint32_t dev_node_start;
> > > +    uint32_t dev_nodes;
> > >      OffAutoPCIBAR msix_relo;
> > >      uint8_t pm_cap;
> > >      uint8_t nv_gpudirect_clique;
> > > diff --git a/include/hw/pci/pci_device.h b/include/hw/pci/pci_device.h
> > > index d3dd0f64b2..aacd2279ae 100644
> > > --- a/include/hw/pci/pci_device.h
> > > +++ b/include/hw/pci/pci_device.h
> > > @@ -157,6 +157,9 @@ struct PCIDevice {
> > >      MSIVectorReleaseNotifier msix_vector_release_notifier;
> > >      MSIVectorPollNotifier msix_vector_poll_notifier;
> > >  
> > > +    /* GPU coherent memory */
> > > +    bool has_coherent_memory;
> > > +
> > >      /* ID of standby device in net_failover pair */
> > >      char *failover_pair_id;
> > >      uint32_t acpi_index;    
> > 
> >   
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-25 14:03         ` Jason Gunthorpe
@ 2023-09-25 14:53           ` Jonathan Cameron via
  2023-09-25 16:00             ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-25 14:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ankit Agrawal, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On Mon, 25 Sep 2023 11:03:28 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Sep 25, 2023 at 02:54:40PM +0100, Jonathan Cameron wrote:
> 
> > Possible the ASWG folk would say this is fine and I'm reading too much into
> > the spec, but I'd definitely suggest asking them via the appropriate path,
> > or throwing in a code first proposal for a comment on this special case and
> > see what response you get - my guess is it will be 'fix Linux' :(  
> 
> The goal here is for qemu to emulate what the bare metal environment
> is doing.
> 
> There may be a legitimate question if what the bare metal FW has done
> is legitimate (though let's face it, there are lots of creative ACPI
> things around), but I don't quite see how this is a qemu question?
> 
> Unless you are taking the position that qemu should not emulate this
> HW?

Ok. I'd failed to register that the bare metal code was doing this though
with hindsight I guess that is obvious. Though without more info or
a bare metal example being given its also possible the BIOS was doing
enumeration etc (like CXL does for all < 2.0 devices) and hence was
building SRAT with the necessary memory ranges in place - even if the
driver then does the hot add dance later.

That's dubious and likely to break at some point unless the spec
comprehends this use case, but meh, so are lots of other things and
the hardware vendor gets to pick up the pieces and deal with grumpy
customers.

I don't currently see this as a safe solution for the proposed other
use cases however that are virtualization only.

Jonathan

> 
> Jason



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-25 14:53           ` Jonathan Cameron via
@ 2023-09-25 16:00             ` Jason Gunthorpe
  2023-09-25 17:00               ` Jonathan Cameron via
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2023-09-25 16:00 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ankit Agrawal, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On Mon, Sep 25, 2023 at 03:53:51PM +0100, Jonathan Cameron wrote:
> On Mon, 25 Sep 2023 11:03:28 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Sep 25, 2023 at 02:54:40PM +0100, Jonathan Cameron wrote:
> > 
> > > Possible the ASWG folk would say this is fine and I'm reading too much into
> > > the spec, but I'd definitely suggest asking them via the appropriate path,
> > > or throwing in a code first proposal for a comment on this special case and
> > > see what response you get - my guess is it will be 'fix Linux' :(  
> > 
> > The goal here is for qemu to emulate what the bare metal environment
> > is doing.
> > 
> > There may be a legitimate question if what the bare metal FW has done
> > is legitimate (though let's face it, there are lots of creative ACPI
> > things around), but I don't quite see how this is a qemu question?
> > 
> > Unless you are taking the position that qemu should not emulate this
> > HW?
> 
> Ok. I'd failed to register that the bare metal code was doing this though
> with hindsight I guess that is obvious. Though without more info or
> a bare metal example being given its also possible the BIOS was doing
> enumeration etc (like CXL does for all < 2.0 devices) and hence was
> building SRAT with the necessary memory ranges in place - even if the
> driver then does the hot add dance later.

Ankit, maybe you can share some relavent ACPI dumps from the physical
hardware and explain how this compares?

> That's dubious and likely to break at some point unless the spec
> comprehends this use case, but meh, so are lots of other things and
> the hardware vendor gets to pick up the pieces and deal with grumpy
> customers.

Yes.

> I don't currently see this as a safe solution for the proposed other
> use cases however that are virtualization only.

So, how should that translate into a command line experiance? Sounds
like the broad concept is general but this actual specific HW is not?

Thanks,
Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-25 16:00             ` Jason Gunthorpe
@ 2023-09-25 17:00               ` Jonathan Cameron via
  2023-09-26 14:54                 ` Ankit Agrawal
  0 siblings, 1 reply; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-25 17:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ankit Agrawal, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On Mon, 25 Sep 2023 13:00:43 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Sep 25, 2023 at 03:53:51PM +0100, Jonathan Cameron wrote:
> > On Mon, 25 Sep 2023 11:03:28 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Sep 25, 2023 at 02:54:40PM +0100, Jonathan Cameron wrote:
> > >   
> > > > Possible the ASWG folk would say this is fine and I'm reading too much into
> > > > the spec, but I'd definitely suggest asking them via the appropriate path,
> > > > or throwing in a code first proposal for a comment on this special case and
> > > > see what response you get - my guess is it will be 'fix Linux' :(    
> > > 
> > > The goal here is for qemu to emulate what the bare metal environment
> > > is doing.
> > > 
> > > There may be a legitimate question if what the bare metal FW has done
> > > is legitimate (though let's face it, there are lots of creative ACPI
> > > things around), but I don't quite see how this is a qemu question?
> > > 
> > > Unless you are taking the position that qemu should not emulate this
> > > HW?  
> > 
> > Ok. I'd failed to register that the bare metal code was doing this though
> > with hindsight I guess that is obvious. Though without more info or
> > a bare metal example being given its also possible the BIOS was doing
> > enumeration etc (like CXL does for all < 2.0 devices) and hence was
> > building SRAT with the necessary memory ranges in place - even if the
> > driver then does the hot add dance later.  
> 
> Ankit, maybe you can share some relavent ACPI dumps from the physical
> hardware and explain how this compares?
> 
> > That's dubious and likely to break at some point unless the spec
> > comprehends this use case, but meh, so are lots of other things and
> > the hardware vendor gets to pick up the pieces and deal with grumpy
> > customers.  
> 
> Yes.

With my grumpy hat on, good to actually go try and get this
clarification in the spec anyway.  We moan about people doing evil
things or spec gaps, but don't always circle back to fix them.
Obviously I'm not claiming to be good on this front either...

> 
> > I don't currently see this as a safe solution for the proposed other
> > use cases however that are virtualization only.  
> 
> So, how should that translate into a command line experiance? Sounds
> like the broad concept is general but this actual specific HW is not?

I'm not following.  I think, to enable this platform, this will need
to be vfio tied (or some new similar device). The other usecases
were things like virtio-mem where I'd just suggest leaving restricted
to only referring to existing nodes until a well defined solution is
in place to provide the memory only nodes + hotplug mix.

Without a specification clarification, we'd have to fix this in Linux
and enable dynamic NUMA node creation (or just enable a pool of extra
ones to grab from which is sort of a hacky way to do the same).

With an ACPI spec clarification agreed then I'm fine with
using this for all the cases that have come up in this thread.
Or a good argument that this is valid in under existing ACPI spec.

Jonathan


> 
> Thanks,
> Jason



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-22  8:15         ` David Hildenbrand
@ 2023-09-26 14:52           ` Ankit Agrawal
  2023-09-26 16:54             ` David Hildenbrand
  0 siblings, 1 reply; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-26 14:52 UTC (permalink / raw)
  To: David Hildenbrand, Alex Williamson, Cédric Le Goater
  Cc: Jason Gunthorpe, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

>>>> Good idea.  Fundamentally the device should not be creating NUMA
>>>> nodes, the VM should be configured with NUMA nodes and the device
>>>> memory associated with those nodes.
>>>
>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>>> using virtio-mem).
>>>
>>
>> We actually do not add the device memory on the host, instead
>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
>> mmap function in vfio-pci variant driver code managing the device.
>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
>> And I think host memory backend would need memory that is added on the
>> host.
>>
>> Moreover since we want to passthrough the entire device memory, the
>> -object memory-backend-ram would have to be passed a size that is equal
>> to the device memory. I wonder if that would be too much of a trouble
>> for an admin (or libvirt) triggering the Qemu process.
>>
>> Both these items are avoided by exposing the device memory as BAR as in the
>> current  implementation (referenced above) since it lets Qemu to naturally
>> discover the device memory region and do mmap.
>>
>
> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
> on the device, not on the memory backend.
> 
> e.g., -device pc-dimm,node=3,memdev=mem1,...

Agreed, but still we will have the aforementioned issues viz.
1. The backing memory for the memory device would need to be allocated
on the host. However, we do not add the device memory on the host in this
case. Instead the Qemu VMA is mapped to the device memory physical
address using remap_pfn_range().
2. The memory device need to be passed an allocation size such that all of
the device memory is mapped into the Qemu VMA. This may not be readily
available to the admin/libvirt.

Based on the suggestions here, can we consider something like the 
following?
1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
structure to make it hotpluggable.
2. Create several NUMA nodes with 'devnode' which are supposed to be
associated with the vfio-pci device.
3. Pass the numa node start and count to associate the nodes created.

So, the command would look something like the following.
...
        -numa node,nodeid=2,devnode=on \
        -numa node,nodeid=3,devnode=on \
        -numa node,nodeid=4,devnode=on \
        -numa node,nodeid=5,devnode=on \
        -numa node,nodeid=6,devnode=on \
        -numa node,nodeid=7,devnode=on \
        -numa node,nodeid=8,devnode=on \
        -numa node,nodeid=9,devnode=on \
        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \

Thoughts?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-25 17:00               ` Jonathan Cameron via
@ 2023-09-26 14:54                 ` Ankit Agrawal
  2023-09-27  7:06                   ` Ankit Agrawal
  2023-09-27 11:01                   ` Jonathan Cameron via
  0 siblings, 2 replies; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-26 14:54 UTC (permalink / raw)
  To: Jonathan Cameron, Jason Gunthorpe
  Cc: alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

> With an ACPI spec clarification agreed then I'm fine with
> using this for all the cases that have come up in this thread.
> Or a good argument that this is valid in under existing ACPI spec.

Hi Jonathan

I looked at the Section 5.2.16 in ACPI spec doc, but could not see
any mention of whether size == 0 is invalid or be avoided. 
https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf

If that is not convincing, do you have suggestions as to how I may
confirm this?




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-26 14:52           ` Ankit Agrawal
@ 2023-09-26 16:54             ` David Hildenbrand
  2023-09-26 19:14               ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2023-09-26 16:54 UTC (permalink / raw)
  To: Ankit Agrawal, Alex Williamson, Cédric Le Goater
  Cc: Jason Gunthorpe, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel, Gavin Shan

On 26.09.23 16:52, Ankit Agrawal wrote:
>>>>> Good idea.  Fundamentally the device should not be creating NUMA
>>>>> nodes, the VM should be configured with NUMA nodes and the device
>>>>> memory associated with those nodes.
>>>>
>>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
>>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
>>>> using virtio-mem).
>>>>
>>>
>>> We actually do not add the device memory on the host, instead
>>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
>>> mmap function in vfio-pci variant driver code managing the device.
>>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
>>> And I think host memory backend would need memory that is added on the
>>> host.
>>>
>>> Moreover since we want to passthrough the entire device memory, the
>>> -object memory-backend-ram would have to be passed a size that is equal
>>> to the device memory. I wonder if that would be too much of a trouble
>>> for an admin (or libvirt) triggering the Qemu process.
>>>
>>> Both these items are avoided by exposing the device memory as BAR as in the
>>> current  implementation (referenced above) since it lets Qemu to naturally
>>> discover the device memory region and do mmap.
>>>
>>
>> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
>> on the device, not on the memory backend.
>>
>> e.g., -device pc-dimm,node=3,memdev=mem1,...
> 

Alco CCing Gavin, I remember he once experimented with virtio-mem + 
multiple memory-less nodes and it was quite working (because of 
MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).

> Agreed, but still we will have the aforementioned issues viz.
> 1. The backing memory for the memory device would need to be allocated
> on the host. However, we do not add the device memory on the host in this
> case. Instead the Qemu VMA is mapped to the device memory physical
> address using remap_pfn_range().

I don't see why that would be necessary ...

> 2. The memory device need to be passed an allocation size such that all of
> the device memory is mapped into the Qemu VMA. This may not be readily
> available to the admin/libvirt.

... or that. But your proposal roughly looks like what I had in mind, so 
let's focus on that.

> 
> Based on the suggestions here, can we consider something like the
> following?
> 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> structure to make it hotpluggable.

Is that "devnode=on" parameter required? Can't we simply expose any node 
that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

Right now, with "ordinary", fixed-location memory devices 
(DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that 
covers the device memory region for these devices with 
MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, 
which does not quite work IIRC. All applicable nodes that don't have 
boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

In your example, which memory ranges would we use for these nodes in SRAT?

> 2. Create several NUMA nodes with 'devnode' which are supposed to be
> associated with the vfio-pci device.
> 3. Pass the numa node start and count to associate the nodes created.
> 
> So, the command would look something like the following.
> ...
>          -numa node,nodeid=2,devnode=on \
>          -numa node,nodeid=3,devnode=on \
>          -numa node,nodeid=4,devnode=on \
>          -numa node,nodeid=5,devnode=on \
>          -numa node,nodeid=6,devnode=on \
>          -numa node,nodeid=7,devnode=on \
>          -numa node,nodeid=8,devnode=on \
>          -numa node,nodeid=9,devnode=on \
>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \

Better an array/list like "numa-nodes=2-9"

... but how would the device actually use these nodes? (which for which?)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-26 16:54             ` David Hildenbrand
@ 2023-09-26 19:14               ` Alex Williamson
  2023-09-27  7:14                 ` Ankit Agrawal
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2023-09-26 19:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ankit Agrawal, Cédric Le Goater, Jason Gunthorpe,
	shannon.zhaosl, peter.maydell, ani, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel, Gavin Shan

On Tue, 26 Sep 2023 18:54:53 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 26.09.23 16:52, Ankit Agrawal wrote:
> >>>>> Good idea.  Fundamentally the device should not be creating NUMA
> >>>>> nodes, the VM should be configured with NUMA nodes and the device
> >>>>> memory associated with those nodes.  
> >>>>
> >>>> +1. That would also make it fly with DIMMs and virtio-mem, where you
> >>>> would want NUMA-less nodes ass well (imagine passing CXL memory to a VM
> >>>> using virtio-mem).
> >>>>  
> >>>
> >>> We actually do not add the device memory on the host, instead
> >>> map it into the Qemu VMA using remap_pfn_range(). Please checkout the
> >>> mmap function in vfio-pci variant driver code managing the device.
> >>> https://lore.kernel.org/all/20230915025415.6762-1-ankita@nvidia.com/
> >>> And I think host memory backend would need memory that is added on the
> >>> host.
> >>>
> >>> Moreover since we want to passthrough the entire device memory, the
> >>> -object memory-backend-ram would have to be passed a size that is equal
> >>> to the device memory. I wonder if that would be too much of a trouble
> >>> for an admin (or libvirt) triggering the Qemu process.
> >>>
> >>> Both these items are avoided by exposing the device memory as BAR as in the
> >>> current  implementation (referenced above) since it lets Qemu to naturally
> >>> discover the device memory region and do mmap.
> >>>  
> >>
> >> Just to clarify: nNUMA nodes for DIMMs/NVDIMMs/virtio-mem are configured
> >> on the device, not on the memory backend.
> >>
> >> e.g., -device pc-dimm,node=3,memdev=mem1,...  
> >   
> 
> Alco CCing Gavin, I remember he once experimented with virtio-mem + 
> multiple memory-less nodes and it was quite working (because of 
> MEM_AFFINITY_HOTPLUGGABLE only on the last node, below).
> 
> > Agreed, but still we will have the aforementioned issues viz.
> > 1. The backing memory for the memory device would need to be allocated
> > on the host. However, we do not add the device memory on the host in this
> > case. Instead the Qemu VMA is mapped to the device memory physical
> > address using remap_pfn_range().  
> 
> I don't see why that would be necessary ...
> 
> > 2. The memory device need to be passed an allocation size such that all of
> > the device memory is mapped into the Qemu VMA. This may not be readily
> > available to the admin/libvirt.  
> 
> ... or that. But your proposal roughly looks like what I had in mind, so 
> let's focus on that.
> 
> > 
> > Based on the suggestions here, can we consider something like the
> > following?
> > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > structure to make it hotpluggable.  
> 
> Is that "devnode=on" parameter required? Can't we simply expose any node 
> that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?
> 
> Right now, with "ordinary", fixed-location memory devices 
> (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that 
> covers the device memory region for these devices with 
> MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine, 
> which does not quite work IIRC. All applicable nodes that don't have 
> boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.
> 
> In your example, which memory ranges would we use for these nodes in SRAT?
> 
> > 2. Create several NUMA nodes with 'devnode' which are supposed to be
> > associated with the vfio-pci device.
> > 3. Pass the numa node start and count to associate the nodes created.
> > 
> > So, the command would look something like the following.
> > ...
> >          -numa node,nodeid=2,devnode=on \
> >          -numa node,nodeid=3,devnode=on \
> >          -numa node,nodeid=4,devnode=on \
> >          -numa node,nodeid=5,devnode=on \
> >          -numa node,nodeid=6,devnode=on \
> >          -numa node,nodeid=7,devnode=on \
> >          -numa node,nodeid=8,devnode=on \
> >          -numa node,nodeid=9,devnode=on \
> >          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,numa-node-start=2,numa-node-count=8 \  

I don't see how these numa-node args on a vfio-pci device have any
general utility.  They're only used to create a firmware table, so why
don't we be explicit about it and define the firmware table as an
object?  For example:

	-numa node,nodeid=2 \
	-numa node,nodeid=3 \
	-numa node,nodeid=4 \
	-numa node,nodeid=5 \
	-numa node,nodeid=6 \
	-numa node,nodeid=7 \
	-numa node,nodeid=8 \
	-numa node,nodeid=9 \
	-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
	-object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \

There are some suggestions in this thread that CXL could have similar
requirements, but I haven't found any evidence that these
dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
rather than an NVIDIA specific one.

It seems like we could almost meet the requirement for this table via
-acpitable, but I think we'd like to avoid the VM orchestration tool
from creating, compiling, and passing ACPI data blobs into the VM.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-26 14:54                 ` Ankit Agrawal
@ 2023-09-27  7:06                   ` Ankit Agrawal
  2023-09-27 11:01                   ` Jonathan Cameron via
  1 sibling, 0 replies; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-27  7:06 UTC (permalink / raw)
  To: Jonathan Cameron, Jason Gunthorpe
  Cc: alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

>> Ok. I'd failed to register that the bare metal code was doing this though
>> with hindsight I guess that is obvious. Though without more info or
>> a bare metal example being given its also possible the BIOS was doing
>> enumeration etc (like CXL does for all < 2.0 devices) and hence was
>> building SRAT with the necessary memory ranges in place - even if the
>> driver then does the hot add dance later.
>
> Ankit, maybe you can share some relavent ACPI dumps from the physical
> hardware and explain how this compares?

Yeah, we are pretty much emulating the baremetal behavior here (I should have
been clearer). The PXM domains 8-15 are the "device associated" NUMA nodes
on the host. Here are the kernel logs from the baremetal system related to these
NUMA nodes.

[    0.000000] ACPI: SRAT: Node 1 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 2 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 3 PXM 10 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 4 PXM 11 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 5 PXM 12 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 6 PXM 13 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 7 PXM 14 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 8 PXM 15 [mem 0x00000000-0xffffffffffffffff] hotplug

...

[    0.000000] Initmem setup node 2 as memoryless
[    0.000000] Initmem setup node 3 as memoryless
[    0.000000] Initmem setup node 4 as memoryless
[    0.000000] Initmem setup node 5 as memoryless
[    0.000000] Initmem setup node 6 as memoryless
[    0.000000] Initmem setup node 7 as memoryless
[    0.000000] Initmem setup node 8 as memoryless
[    0.000000] Initmem setup node 9 as memoryless



On the VM, it looks like the following with the PXM 2-9 associated with the device.

[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-26 19:14               ` Alex Williamson
@ 2023-09-27  7:14                 ` Ankit Agrawal
  2023-09-27 11:33                   ` Jonathan Cameron via
  0 siblings, 1 reply; 41+ messages in thread
From: Ankit Agrawal @ 2023-09-27  7:14 UTC (permalink / raw)
  To: Alex Williamson, David Hildenbrand
  Cc: Cédric Le Goater, Jason Gunthorpe, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel, Gavin Shan

> >
> > Based on the suggestions here, can we consider something like the
> > following?
> > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > structure to make it hotpluggable.
>
> Is that "devnode=on" parameter required? Can't we simply expose any node
> that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?
>
> Right now, with "ordinary", fixed-location memory devices
> (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that
> covers the device memory region for these devices with
> MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine,
> which does not quite work IIRC. All applicable nodes that don't have
> boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.

Yeah, you're right that it isn't required. Exposing the node without any memory as
MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using
"devnode=on".

> In your example, which memory ranges would we use for these nodes in SRAT?

We are setting the Base Address and the Size as 0 in the SRAT memory affinity
structures. This is done through the following:
build_srat_memory(table_data, 0, 0, i,
                  MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);

This results in the following logs in the VM from the Linux ACPI SRAT parsing code:
[    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
[    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug

I would re-iterate that we are just emulating the baremetal behavior here.


> I don't see how these numa-node args on a vfio-pci device have any
> general utility.  They're only used to create a firmware table, so why
> don't we be explicit about it and define the firmware table as an
> object?  For example:
>
>        -numa node,nodeid=2 \
>        -numa node,nodeid=3 \
>        -numa node,nodeid=4 \
>        -numa node,nodeid=5 \
>        -numa node,nodeid=6 \
>        -numa node,nodeid=7 \
>        -numa node,nodeid=8 \
>        -numa node,nodeid=9 \
>        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
>        -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \

Yeah, that is fine with me. If we agree with this approach, I can go
implement it.


> There are some suggestions in this thread that CXL could have similar
> requirements, but I haven't found any evidence that these
> dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
> any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
> rather than an NVIDIA specific one.

Maybe Jason, Jonathan can chime in on this?


> It seems like we could almost meet the requirement for this table via
> -acpitable, but I think we'd like to avoid the VM orchestration tool
> from creating, compiling, and passing ACPI data blobs into the VM.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes
  2023-09-26 14:54                 ` Ankit Agrawal
  2023-09-27  7:06                   ` Ankit Agrawal
@ 2023-09-27 11:01                   ` Jonathan Cameron via
  1 sibling, 0 replies; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-27 11:01 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel

On Tue, 26 Sep 2023 14:54:36 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> > With an ACPI spec clarification agreed then I'm fine with
> > using this for all the cases that have come up in this thread.
> > Or a good argument that this is valid in under existing ACPI spec.  
> 
> Hi Jonathan
> 
> I looked at the Section 5.2.16 in ACPI spec doc, but could not see
> any mention of whether size == 0 is invalid or be avoided. 
> https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf
> 
> If that is not convincing, do you have suggestions as to how I may
> confirm this?
>

It's not so much whether 0 size is valid that concerns me (as you
say it isn't ruled out, though given the explanatory text is
is non sensical) but rather whether you are allowed to use
such an entry to add memory that is not within the range later.

To my reading the statement under "Memory Affinity Structure" suggests
not.

"The Memory Affinity structure provides the following topology information
statically to the operating system:

* The association between a _memory range_ and the proximity to which it belongs.
* Information about whether the memory range can be hot-plugged.
"

That doesn't leave space for using it to define a proximity node without
providing the range.  With my 'occasional' contributor to ACPI spec hat on,
(obviously there are people way more versed in this than me!)
I'd suggest that ASWG will ask the obvious question of why does the ACPI
table needs to describe a software entity that is entirely discoverable by
other means?  After all your driver is clearly going to pick up these
nodes and assign them later - so it can just create them.  ACPI spec
doesn't care if Linux can do this or not :(

There are some hacks in place in the kernel (or at least under review)
to deal with a CXL case where there are BIOSes that assign part of the
CXL Fixed Memory Window (a bit of host PA space) to an SRAT entry, but
not the whole of it.  However, I think those are a workaround for a bug
(maybe not everyone agrees with that though).

Perhaps I'm being overly hopeful for clarity and it is fine to
do what you have here.

Jonathan


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27  7:14                 ` Ankit Agrawal
@ 2023-09-27 11:33                   ` Jonathan Cameron via
  2023-09-27 13:53                     ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-27 11:33 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Alex Williamson, David Hildenbrand, Cédric Le Goater,
	Jason Gunthorpe, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel, Gavin Shan,
	ira.weiny, navneet.singh

On Wed, 27 Sep 2023 07:14:28 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> > >
> > > Based on the suggestions here, can we consider something like the
> > > following?
> > > 1. Introduce a new -numa subparam 'devnode', which tells Qemu to mark
> > > the node with MEM_AFFINITY_HOTPLUGGABLE in the SRAT's memory affinity
> > > structure to make it hotpluggable.  
> >
> > Is that "devnode=on" parameter required? Can't we simply expose any node
> > that does *not* have any boot memory assigned as MEM_AFFINITY_HOTPLUGGABLE?

That needs some checking for what extra stuff we'll instantiate on CPU only
(or once we implement them) Generic Initiator / Generic Port nodes -
I'm definitely not keen on doing so for generic ports (which QEMU doesn't yet
do though there have been some RFCs I think).

> > Right now, with "ordinary", fixed-location memory devices
> > (DIMM/NVDIMM/virtio-mem/virtio-pmem), we create an srat entry that
> > covers the device memory region for these devices with
> > MEM_AFFINITY_HOTPLUGGABLE. We use the highest NUMA node in the machine,
> > which does not quite work IIRC. All applicable nodes that don't have
> > boot memory would need MEM_AFFINITY_HOTPLUGGABLE for Linux to create them.  
> 
> Yeah, you're right that it isn't required. Exposing the node without any memory as
> MEM_AFFINITY_HOTPLUGGABLE seems like a better approach than using
> "devnode=on".
> 
> > In your example, which memory ranges would we use for these nodes in SRAT?  
> 
> We are setting the Base Address and the Size as 0 in the SRAT memory affinity
> structures. This is done through the following:
> build_srat_memory(table_data, 0, 0, i,
>                   MEM_AFFINITY_HOTPLUGGABLE | MEM_AFFINITY_ENABLED);
> 
> This results in the following logs in the VM from the Linux ACPI SRAT parsing code:
> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 4 PXM 4 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 5 PXM 5 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 6 PXM 6 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 7 PXM 7 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 8 PXM 8 [mem 0x00000000-0xffffffffffffffff] hotplug
> [    0.000000] ACPI: SRAT: Node 9 PXM 9 [mem 0x00000000-0xffffffffffffffff] hotplug
> 
> I would re-iterate that we are just emulating the baremetal behavior here.
> 
> 
> > I don't see how these numa-node args on a vfio-pci device have any
> > general utility.  They're only used to create a firmware table, so why
> > don't we be explicit about it and define the firmware table as an
> > object?  For example:
> >
> >        -numa node,nodeid=2 \
> >        -numa node,nodeid=3 \
> >        -numa node,nodeid=4 \
> >        -numa node,nodeid=5 \
> >        -numa node,nodeid=6 \
> >        -numa node,nodeid=7 \
> >        -numa node,nodeid=8 \
> >        -numa node,nodeid=9 \
> >        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=nvgrace0 \
> >        -object nvidia-gpu-mem-acpi,devid=nvgrace0,nodeset=2-9 \  
> 
> Yeah, that is fine with me. If we agree with this approach, I can go
> implement it.
> 
> 
> > There are some suggestions in this thread that CXL could have similar
> > requirements,

For CXL side of things, if talking memory devices (type 3), I'm not
sure what the usecase will be of this feature.
Either we treat them as normal memory in which case it will all be static
at boot of the VM (for SRAT anyway - we might plug things in and
out of ranges), or it will be whole device hotplug and look like pc-dimm
hotplug (which should be into a statically defined range in SRAT).
 Longer term if we look at virtualizing dynamic capacity
devices (not sure we need to other that possibly to leverage
sparse DAX etc on top of them) then we might want to provide
emulated CXL Fixed memory windows in the guest (which get their own 
NUMA nodes anyway) + plug the memory into that. We'd probably hide
away interleaving etc in the host as all the guest should care about
is performance information and I doubt we'd want to emulate the complexity
of address routing complexities.

Similar to host PA ranges used in CXL fixed memory windows, I'm not sure
we wouldn't just allow for the guest to have 'all' possible setups that
might get plugged later by just burning a lot of HPA space and hence
just be able to use static SRAT nodes covering each region.
This would be less painful than for real PAs because as we are
emulating the CXL devices, probably as one emulated type 3 device per
potential set of real devices in an interleave set we can avoid
all the ordering constraints of CXL address decoders that end up eating
up Host PA space.

Virtualizing DCD is going to be a fun topic (that's next year's
plumbers CXL uconf session sorted ;), but I can see it might be done completely
differently and look nothing like a CXL device, in which case maybe
what you have here will make sense.

Come to think of it, you 'could' potentially do that for your use case
if the regions are reasonably bound in maximum size at the cost of
large GPA usage?

CXL accelerators / GPUs etc are a different question but who has one
of those anyway? :)


> > but I haven't found any evidence that these
> > dev-mem-pxm-{start,count} attributes in the _DSD are standardized in
> > any way.  If they are, maybe this would be a dev-mem-pxm-acpi object
> > rather than an NVIDIA specific one.  
> 
> Maybe Jason, Jonathan can chime in on this?

I'm not aware of anything general around this.  A PCI device
can have a _PXM and I think you could define subdevices each with a
_PXM of their own?  Those subdevices would need drivers to interpret
the structure anyway so not real benefit over a _DSD that I can
immediately think of...

If we think this will be common long term, anyone want to take
multiple _PXM per device support as a proposal to ACPI?

So agreed, it's not general, so if it's acceptable to have 0 length
NUMA nodes (and I think we have to emulate them given that's what real
hardware is doing even if some of us think the real hardware shouldn't
have done that!) then just spinning them up explicitly as nodes
+ device specific stuff for the NVIDIA device seems fine to me.

> 
> 
> > It seems like we could almost meet the requirement for this table via
> > -acpitable, but I think we'd like to avoid the VM orchestration tool
> > from creating, compiling, and passing ACPI data blobs into the VM.  
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information
  2023-09-15  2:45 ` [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information ankita
  2023-09-15 15:13   ` Igor Mammedov
@ 2023-09-27 11:42   ` Jonathan Cameron via
  1 sibling, 0 replies; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-27 11:42 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, qemu-arm,
	qemu-devel

On Thu, 14 Sep 2023 19:45:59 -0700
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> To add the memory in the guest as NUMA nodes, it needs the PXM node index
> and the total count of nodes associated with the memory. The range of
> proximity domains are communicated to the VM as part of the guest ACPI
> using the nvidia,gpu-mem-pxm-start and nvidia,gpu-mem-pxm-count DSD
> properties. These value respectively represent the staring proximity
> domain id and the count. Kernel modules can then fetch this information
> and determine the numa node id using pxm_to_node().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Hi Ankit,

I'm not a fan of reading AML blobs, but they are better than reading
code that is generating AML.  Can we have an example of the DSDT blob
this generates?

In particular I'm not sure what the relationship of this new device
is to the related PCI bits and pieces.  Looks like it's not doing what
I'd normally expect which would be to add the _DSD to an entry for the
PCI EP.  Obviously might be too late to do that given need to match
physical system, but I'd like the opportunity to moan about it anyway :)

As a note we have various _DSD and _DSM associated with PCI devices on our
Kunpeng servers (usually for odd power control or reset corner cases or
errata work arounds) and they work very well without needing to put
the stuff in a separate device - hence I'm not really understanding
why you'd do it this way...

Jonathan

> ---
>  hw/pci-host/gpex-acpi.c | 69 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 69 insertions(+)
> 
> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> index 7c7316bc96..0548feace1 100644
> --- a/hw/pci-host/gpex-acpi.c
> +++ b/hw/pci-host/gpex-acpi.c
> @@ -49,6 +49,72 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq)
>      }
>  }
>  
> +static void acpi_dsdt_add_cohmem_device(Aml *dev, int32_t devfn,
> +                                        uint64_t dev_mem_pxm_start,
> +                                        uint64_t dev_mem_pxm_count)
> +{
> +    Aml *memdev = aml_device("CMD%X", PCI_SLOT(devfn));
> +    Aml *pkg = aml_package(2);
> +    Aml *pkg1 = aml_package(2);
> +    Aml *pkg2 = aml_package(2);
> +    Aml *dev_pkg = aml_package(2);
> +    Aml *UUID;
> +
> +    aml_append(memdev, aml_name_decl("_ADR", aml_int(PCI_SLOT(devfn) << 16)));
> +
> +    aml_append(pkg1, aml_string("dev-mem-pxm-start"));
> +    aml_append(pkg1, aml_int(dev_mem_pxm_start));
> +
> +    aml_append(pkg2, aml_string("dev-mem-pxm-count"));
> +    aml_append(pkg2, aml_int(dev_mem_pxm_count));
> +
> +    aml_append(pkg, pkg1);
> +    aml_append(pkg, pkg2);
> +
> +    UUID = aml_touuid("DAFFD814-6EBA-4D8C-8A91-BC9BBF4AA301");
> +    aml_append(dev_pkg, UUID);
> +    aml_append(dev_pkg, pkg);
> +
> +    aml_append(memdev, aml_name_decl("_DSD", dev_pkg));
> +    aml_append(dev, memdev);
> +}
> +
> +static void find_mem_device(PCIBus *bus, PCIDevice *pdev,
> +                            void *opaque)
> +{
> +    Aml *dev = (Aml *)opaque;
> +
> +    if (bus == NULL) {
> +        return;
> +    }
> +
> +    if (pdev->has_coherent_memory) {
> +        Object *po = OBJECT(pdev);
> +
> +        if (po == NULL) {
> +            return;
> +        }
> +
> +        uint64_t pxm_start
> +           = object_property_get_uint(po, "dev_mem_pxm_start", NULL);
> +        uint64_t pxm_count
> +           = object_property_get_uint(po, "dev_mem_pxm_count", NULL);
> +
> +        acpi_dsdt_add_cohmem_device(dev, pdev->devfn, pxm_start, pxm_count);
> +    }
> +}
> +
> +static void acpi_dsdt_find_and_add_cohmem_device(PCIBus *bus, Aml *dev)
> +{
> +    if (bus == NULL) {
> +        return;
> +    }
> +
> +    pci_for_each_device_reverse(bus, pci_bus_num(bus),
> +                                find_mem_device, dev);
> +
> +}
> +
>  static void acpi_dsdt_add_pci_osc(Aml *dev)
>  {
>      Aml *method, *UUID, *ifctx, *ifctx1, *elsectx, *buf;
> @@ -207,7 +273,10 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
>  
>      acpi_dsdt_add_pci_route_table(dev, cfg->irq);
>  
> +    acpi_dsdt_find_and_add_cohmem_device(cfg->bus, dev);
> +
>      method = aml_method("_CBA", 0, AML_NOTSERIALIZED);
> +
>      aml_append(method, aml_return(aml_int(cfg->ecam.base)));
>      aml_append(dev, method);
>  



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 11:33                   ` Jonathan Cameron via
@ 2023-09-27 13:53                     ` Jason Gunthorpe
  2023-09-27 14:24                       ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2023-09-27 13:53 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Ankit Agrawal, Alex Williamson, David Hildenbrand,
	Cédric Le Goater, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel, Gavin Shan,
	ira.weiny, navneet.singh

On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:

> CXL accelerators / GPUs etc are a different question but who has one
> of those anyway? :)

That's exactly what I mean when I say CXL will need it too. I keep
describing this current Grace & Hopper as pre-CXL HW. You can easially
imagine draping CXL around it. CXL doesn't solve the problem that
motivates all this hackying - Linux can't dynamically create NUMA
nodes.

So even with CXL baremetal FW will have to create these nodes so Linux
can consume them. Hackity hack hack..

Broadly when you start to look at all of CXL, any device with coherent
memory and the ability to create SRIOV VF like things is going to have
a problem that Linux driver sofware will want a NUMA node for every
possible VF.

So we can view this as some generic NVIDIA hack, but really it is a
"coherent memory and SRIOV" hack that should be needed forany device
of this class.

I was hoping the CXL world would fix this stuff, but I was told they
also doubled down and are creating unnecessary ACPI subtly to avoid
needing dynamic NUMA nodes in Linux.. Sigh.

Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 13:53                     ` Jason Gunthorpe
@ 2023-09-27 14:24                       ` Alex Williamson
  2023-09-27 15:03                         ` Vikram Sethi
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2023-09-27 14:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jonathan Cameron, Ankit Agrawal, David Hildenbrand,
	Cédric Le Goater, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, qemu-arm, qemu-devel, Gavin Shan,
	ira.weiny, navneet.singh

On Wed, 27 Sep 2023 10:53:36 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> 
> > CXL accelerators / GPUs etc are a different question but who has one
> > of those anyway? :)  
> 
> That's exactly what I mean when I say CXL will need it too. I keep
> describing this current Grace & Hopper as pre-CXL HW. You can easially
> imagine draping CXL around it. CXL doesn't solve the problem that
> motivates all this hackying - Linux can't dynamically create NUMA
> nodes.

Why is that and why aren't we pushing towards a solution of removing
that barrier so that we don't require the machine topology to be
configured to support this use case and guest OS limitations?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 14:24                       ` Alex Williamson
@ 2023-09-27 15:03                         ` Vikram Sethi
  2023-09-27 15:42                           ` Jason Gunthorpe
                                             ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Vikram Sethi @ 2023-09-27 15:03 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Jonathan Cameron, Ankit Agrawal, David Hildenbrand,
	Cédric Le Goater, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Andy Currid, qemu-arm, qemu-devel, Gavin Shan, ira.weiny,
	navneet.singh

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, September 27, 2023 9:25 AM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> 
> 
> On Wed, 27 Sep 2023 10:53:36 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> >
> > > CXL accelerators / GPUs etc are a different question but who has one
> > > of those anyway? :)
> >
> > That's exactly what I mean when I say CXL will need it too. I keep
> > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > imagine draping CXL around it. CXL doesn't solve the problem that
> > motivates all this hackying - Linux can't dynamically create NUMA
> > nodes.
> 
> Why is that and why aren't we pushing towards a solution of removing that
> barrier so that we don't require the machine topology to be configured to
> support this use case and guest OS limitations?  Thanks,
>

Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
there is additional information FW knows that the kernel doesn't. For example,
what the distance/latency between CPU and the device NUMA node is. While CXL devices
present a CDAT table which gives latency type attributes within the device, there would still be some
guesswork needed in the kernel as to what the end to end latency/distance is. 
It's probably not the best outcome to just consider this generically far memory" because 
is it further than Intersocket memory access or not matters. 
Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
no idea if this latency/distance is less than or more than inter socket memory access latency
even.
So specially for devices present at boot, FW knows this information and should provide it. 
Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Thanks
> Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 15:03                         ` Vikram Sethi
@ 2023-09-27 15:42                           ` Jason Gunthorpe
  2023-09-28 16:15                             ` Jonathan Cameron via
  2023-09-27 16:37                           ` Alex Williamson
  2023-09-28 16:04                           ` Jonathan Cameron via
  2 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2023-09-27 15:42 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Alex Williamson, Jonathan Cameron, Ankit Agrawal,
	David Hildenbrand, Cédric Le Goater, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Andy Currid, qemu-arm, qemu-devel, Gavin Shan, ira.weiny,
	navneet.singh

On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,

I haven't looked myself, but I've been told this is very
hard. Probably the NUMA concept needs to be split a bit so that the
memory allocator pool handle is not tied to the ACPI.

> Even if Linux could create NUMA nodes dynamically for coherent CXL
> or CXL type devices, there is additional information FW knows that
> the kernel doesn't. For example, what the distance/latency between
> CPU and the device NUMA node is.

But that should just be the existing normal PCIy stuff to define
affinity of the PCI function. Since the memory is logically linked to
the PCI function we have no issue from a topology perspective.

> While CXL devices present a CDAT table which gives latency type
> attributes within the device, there would still be some guesswork
> needed in the kernel as to what the end to end latency/distance is.

Is it non-uniform per CXL function?

> knows this information and should provide it.  Similarly, QEMU
> should pass along this information to VMs for the best outcomes.

Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
so existing NUMA aware allocations in PCI drivers are optimized. eg we
put queues in memory close to the PCI function already.

I think it is important to keep seperated the need to know the
topology/affinity/etc and the need for the driver to have a Linux NUMA
node handle to operate the memory alocator pool APIs.

Regardless, it is what it is for the foreseeable future :(

Jason


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 15:03                         ` Vikram Sethi
  2023-09-27 15:42                           ` Jason Gunthorpe
@ 2023-09-27 16:37                           ` Alex Williamson
  2023-09-28 16:29                             ` Jonathan Cameron via
  2023-09-28 16:04                           ` Jonathan Cameron via
  2 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2023-09-27 16:37 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Jason Gunthorpe, Jonathan Cameron, Ankit Agrawal,
	David Hildenbrand, Cédric Le Goater, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Andy Currid, qemu-arm, qemu-devel, Gavin Shan, ira.weiny,
	navneet.singh

On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.
> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.  

Yeah, AFAICT we're not doing any of that in this series.  We're only
creating some number of nodes for the guest driver to make use of and
describing in the generated _DSD the set of nodes associated for use by
the device.  How many nodes are required, how the guest driver
partitions coherent memory among the nodes, and how the guest assigns a
distance to the nodes is unspecified.

A glance at the CDAT spec seems brilliant in this regard, is there such
a table for this device or could/should the vfio-pci variant driver
provide one?  I imagine this is how the VM admin or orchestration tool
would learn to configure nodes and the VMM would further virtualize
this table for the guest OS and drivers.  Thanks,

Alex



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 15:03                         ` Vikram Sethi
  2023-09-27 15:42                           ` Jason Gunthorpe
  2023-09-27 16:37                           ` Alex Williamson
@ 2023-09-28 16:04                           ` Jonathan Cameron via
  2 siblings, 0 replies; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-28 16:04 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Alex Williamson, Jason Gunthorpe, Ankit Agrawal,
	David Hildenbrand, Cédric Le Goater, shannon.zhaosl,
	peter.maydell, ani, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Andy Currid, qemu-arm, qemu-devel, Gavin Shan, ira.weiny,
	navneet.singh, Dave Jiang

On Wed, 27 Sep 2023 15:03:09 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, September 27, 2023 9:25 AM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > 
> > 
> > On Wed, 27 Sep 2023 10:53:36 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > >  
> > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > of those anyway? :)  
> > >
> > > That's exactly what I mean when I say CXL will need it too. I keep
> > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > motivates all this hackying - Linux can't dynamically create NUMA
> > > nodes.  
> > 
> > Why is that and why aren't we pushing towards a solution of removing that
> > barrier so that we don't require the machine topology to be configured to
> > support this use case and guest OS limitations?  Thanks,
> >  
> 
> Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> there is additional information FW knows that the kernel doesn't. For example,
> what the distance/latency between CPU and the device NUMA node is. While CXL devices
> present a CDAT table which gives latency type attributes within the device, there would still be some
> guesswork needed in the kernel as to what the end to end latency/distance is. 
FWIW

Shouldn't be guess work needed (for light load case anyway which is what would be
in HMAT). That's what the Generic Ports were added to SRAT for. Dave Jiang has a
patch set
https://lore.kernel.org/all/168695160531.3031571.4875512229068707023.stgit@djiang5-mobl3/
to do the maths...  For CXL there is no problem fully describing the access characteristics.

> It's probably not the best outcome to just consider this generically far memory" because 
> is it further than Intersocket memory access or not matters. 
> Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> no idea if this latency/distance is less than or more than inter socket memory access latency
> even.

Just because I'm feeling cheeky - you could emulate a DOE and CDAT :)?
Though I suppose you don't want to teach the guest driver about it.

> So specially for devices present at boot, FW knows this information and should provide it. 
> Similarly, QEMU should pass along this information to VMs for the best outcomes.
No problem with the argument that FW has the info and should provide it,
just on the 'how' part.

Jonathan

> 
> Thanks
> > Alex  
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 15:42                           ` Jason Gunthorpe
@ 2023-09-28 16:15                             ` Jonathan Cameron via
  0 siblings, 0 replies; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-28 16:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Vikram Sethi, Alex Williamson, Ankit Agrawal, David Hildenbrand,
	Cédric Le Goater, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Andy Currid, qemu-arm, qemu-devel, Gavin Shan, ira.weiny,
	navneet.singh

On Wed, 27 Sep 2023 12:42:40 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 27, 2023 at 03:03:09PM +0000, Vikram Sethi wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 27, 2023 9:25 AM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > > 
> > > 
> > > On Wed, 27 Sep 2023 10:53:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > > >  
> > > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > > of those anyway? :)  
> > > >
> > > > That's exactly what I mean when I say CXL will need it too. I keep
> > > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > > motivates all this hackying - Linux can't dynamically create NUMA
> > > > nodes.  
> > > 
> > > Why is that and why aren't we pushing towards a solution of removing that
> > > barrier so that we don't require the machine topology to be configured to
> > > support this use case and guest OS limitations?  Thanks,  
> 
> I haven't looked myself, but I've been told this is very
> hard. Probably the NUMA concept needs to be split a bit so that the
> memory allocator pool handle is not tied to the ACPI.
> 
> > Even if Linux could create NUMA nodes dynamically for coherent CXL
> > or CXL type devices, there is additional information FW knows that
> > the kernel doesn't. For example, what the distance/latency between
> > CPU and the device NUMA node is.  
> 
> But that should just be the existing normal PCIy stuff to define
> affinity of the PCI function. Since the memory is logically linked to
> the PCI function we have no issue from a topology perspective.

Agreed - if there is a node to associate it with, this is easy to cover.
If aim is just to make the info available, could just Generic Initiator
nodes instead.  Those were added for memory free accelerators when the
bios was describing them to fill the gap for things that aren't processors
or memory but for which bandwidth and latency numbers were useful.
This falls into your comment below on the need not being the topology
as such, but rather the number needed for the memory allocator.

> 
> > While CXL devices present a CDAT table which gives latency type
> > attributes within the device, there would still be some guesswork
> > needed in the kernel as to what the end to end latency/distance is.  
> 
> Is it non-uniform per CXL function?

Could be, but if so it is all discoverable. Though mapping is to
device PA range rather than 'function' as such.  You are certainly
allowed lots of different PA ranges with different characteristics.

> 
> > knows this information and should provide it.  Similarly, QEMU
> > should pass along this information to VMs for the best outcomes.  
> 
> Well yes, ideally qemu would pass vCPU affinity for the vPCI functions
> so existing NUMA aware allocations in PCI drivers are optimized. eg we
> put queues in memory close to the PCI function already.

I though we did that via PXBs?

> 
> I think it is important to keep seperated the need to know the
> topology/affinity/etc and the need for the driver to have a Linux NUMA
> node handle to operate the memory alocator pool APIs.
> 
> Regardless, it is what it is for the foreseeable future :(
:(

> 
> Jason



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
  2023-09-27 16:37                           ` Alex Williamson
@ 2023-09-28 16:29                             ` Jonathan Cameron via
  0 siblings, 0 replies; 41+ messages in thread
From: Jonathan Cameron via @ 2023-09-28 16:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vikram Sethi, Jason Gunthorpe, Ankit Agrawal, David Hildenbrand,
	Cédric Le Goater, shannon.zhaosl, peter.maydell, ani,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Andy Currid, qemu-arm, qemu-devel, Gavin Shan, ira.weiny,
	navneet.singh

On Wed, 27 Sep 2023 10:37:37 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Sep 2023 15:03:09 +0000
> Vikram Sethi <vsethi@nvidia.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, September 27, 2023 9:25 AM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>; Ankit Agrawal
> > > <ankita@nvidia.com>; David Hildenbrand <david@redhat.com>; Cédric Le
> > > Goater <clg@redhat.com>; shannon.zhaosl@gmail.com;
> > > peter.maydell@linaro.org; ani@anisinha.ca; Aniket Agashe
> > > <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <ACurrid@nvidia.com>;
> > > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Gavin Shan
> > > <gshan@redhat.com>; ira.weiny@intel.com; navneet.singh@intel.com
> > > Subject: Re: [PATCH v1 0/4] vfio: report NUMA nodes for device memory
> > > 
> > > 
> > > On Wed, 27 Sep 2023 10:53:36 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >     
> > > > On Wed, Sep 27, 2023 at 12:33:18PM +0100, Jonathan Cameron wrote:
> > > >    
> > > > > CXL accelerators / GPUs etc are a different question but who has one
> > > > > of those anyway? :)    
> > > >
> > > > That's exactly what I mean when I say CXL will need it too. I keep
> > > > describing this current Grace & Hopper as pre-CXL HW. You can easially
> > > > imagine draping CXL around it. CXL doesn't solve the problem that
> > > > motivates all this hackying - Linux can't dynamically create NUMA
> > > > nodes.    
> > > 
> > > Why is that and why aren't we pushing towards a solution of removing that
> > > barrier so that we don't require the machine topology to be configured to
> > > support this use case and guest OS limitations?  Thanks,
> > >    
> > 
> > Even if Linux could create NUMA nodes dynamically for coherent CXL or CXL type devices, 
> > there is additional information FW knows that the kernel doesn't. For example,
> > what the distance/latency between CPU and the device NUMA node is. While CXL devices
> > present a CDAT table which gives latency type attributes within the device, there would still be some
> > guesswork needed in the kernel as to what the end to end latency/distance is. 
> > It's probably not the best outcome to just consider this generically far memory" because 
> > is it further than Intersocket memory access or not matters. 
> > Pre CXL devices such as for this patchset don't even have CDAT so the kernel by itself has
> > no idea if this latency/distance is less than or more than inter socket memory access latency
> > even.
> > So specially for devices present at boot, FW knows this information and should provide it. 
> > Similarly, QEMU should pass along this information to VMs for the best outcomes.    
> 
> Yeah, AFAICT we're not doing any of that in this series.  We're only
> creating some number of nodes for the guest driver to make use of and
> describing in the generated _DSD the set of nodes associated for use by
> the device.  How many nodes are required, how the guest driver
> partitions coherent memory among the nodes, and how the guest assigns a
> distance to the nodes is unspecified.
> 
> A glance at the CDAT spec seems brilliant in this regard, is there such
> a table for this device or could/should the vfio-pci variant driver
> provide one?  I imagine this is how the VM admin or orchestration tool
> would learn to configure nodes and the VMM would further virtualize
> this table for the guest OS and drivers.  Thanks,

I like this proposal. Host bit seems doable.  However I'm not sure
virtualizing it would help for device drivers that have no awareness unless
we slap some generic layer in the driver core in Linux that can use this if
available?  It would certainly be easy enough to add emulation into the guest
to the PCI function config space (as long as we aren't out of room!)

For CXL I think we will have to do this anyway. For a complex topology we
could either:
1) Present a matching topology in the guest with all the components involved
   including switches with right numbers of ports etc so the CDAT tables
   can be read directly from underlying hardware, or...
2) Present a somewhat matching topology in the guest having just messed with
   switches so that we can present their existence but not all the ports etc.
   So emulate a cut down CDAT for the switch.
3) Allow flattening everything and present a single CDAT with aggregate numbers
   at the EP. That will get a bit messy as we may need the kernel to ignore some
   parts it would normally account for (link widths on CXL buses etc.)

So I'm thinking 1 is to restrictive, hence need to emulate the table anyway
(not mention it might not be on the function being assigned anyway as they
are normally only on Function 0 IIRC).

Gah. Too much of CXL emulation still to do in QEMU already....

Jonathan




> 
> Alex
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-09-28 16:30 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-15  2:45 [PATCH v1 0/4] vfio: report NUMA nodes for device memory ankita
2023-09-15  2:45 ` [PATCH v1 1/4] vfio: new command line params for device memory NUMA nodes ankita
2023-09-15 14:25   ` Jonathan Cameron via
2023-09-15 14:48     ` Igor Mammedov
2023-09-22  5:44       ` Ankit Agrawal
2023-09-25 14:08       ` Jonathan Cameron via
2023-09-15  2:45 ` [PATCH v1 2/4] vfio: assign default values to node params ankita
2023-09-15  2:45 ` [PATCH v1 3/4] hw/arm/virt-acpi-build: patch guest SRAT for NUMA nodes ankita
2023-09-15 14:37   ` Jonathan Cameron via
2023-09-22  5:49     ` Ankit Agrawal
2023-09-25 13:54       ` Jonathan Cameron via
2023-09-25 14:03         ` Jason Gunthorpe
2023-09-25 14:53           ` Jonathan Cameron via
2023-09-25 16:00             ` Jason Gunthorpe
2023-09-25 17:00               ` Jonathan Cameron via
2023-09-26 14:54                 ` Ankit Agrawal
2023-09-27  7:06                   ` Ankit Agrawal
2023-09-27 11:01                   ` Jonathan Cameron via
2023-09-15 14:52   ` Igor Mammedov
2023-09-15 15:49     ` David Hildenbrand
2023-09-15  2:45 ` [PATCH v1 4/4] acpi/gpex: patch guest DSDT for dev mem information ankita
2023-09-15 15:13   ` Igor Mammedov
2023-09-27 11:42   ` Jonathan Cameron via
2023-09-15 14:19 ` [PATCH v1 0/4] vfio: report NUMA nodes for device memory Cédric Le Goater
2023-09-15 14:47   ` Alex Williamson
2023-09-15 18:34     ` David Hildenbrand
2023-09-22  8:11       ` Ankit Agrawal
2023-09-22  8:15         ` David Hildenbrand
2023-09-26 14:52           ` Ankit Agrawal
2023-09-26 16:54             ` David Hildenbrand
2023-09-26 19:14               ` Alex Williamson
2023-09-27  7:14                 ` Ankit Agrawal
2023-09-27 11:33                   ` Jonathan Cameron via
2023-09-27 13:53                     ` Jason Gunthorpe
2023-09-27 14:24                       ` Alex Williamson
2023-09-27 15:03                         ` Vikram Sethi
2023-09-27 15:42                           ` Jason Gunthorpe
2023-09-28 16:15                             ` Jonathan Cameron via
2023-09-27 16:37                           ` Alex Williamson
2023-09-28 16:29                             ` Jonathan Cameron via
2023-09-28 16:04                           ` Jonathan Cameron via

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.