All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] acpi: report numa nodes for device memory using GI
@ 2023-10-07 20:17 ankita
  2023-10-07 20:17 ` [PATCH v2 1/3] qom: new object to associate device to numa node ankita
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: ankita @ 2023-10-07 20:17 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

There are upcoming devices which allow CPU to cache coherently access
their memory. It is sensible to expose such memory as NUMA nodes separate
from the sysmem node to the OS. The ACPI spec provides a scheme in SRAT
called Generic Initiator Affinity Structure [1] to allow an association
between a Proximity Domain (PXM) and a Generic Initiator (GI) (e.g.
heterogeneous processors and accelerators, GPUs, and I/O devices with
integrated compute or DMA engines).

Implement the mechanism to build the GI affinity structures as Qemu
currently does not. Introduce a new acpi-generic-initiator object
that links a node to a device BDF. During SRAT creation, all such
objected are identified and used to add the GI Affinity Structures.

A single node per BDF is insufficient for a full utilization of the NVIDIA
GPUs MIG (Mult-Instance GPUs) [2] feature. The feature allows partitioning
of the GPU device resources (including device memory) into several (upto 8)
isolated instances. Each of the partitioned memory requires a dedicated NUMA
node to operate. The partitions are not fixed and they can be created/deleted
at runtime.

Linux OS does not provide a means to dynamically create/destroy NUMA nodes
and such feature implementation is expected to be non-trivial. The nodes
that OS discovers at the boot time while parsing SRAT remains fixed. So we
utilize the GI Affinity structures that allows association between nodes
and devices. Multiple GI structures per BDF is possible, allowing creation
of multiple nodes in the VM by exposing unique PXM in each of these
structures. Implement a new nvidia-acpi-generic-initiator object to associate
a range of nodes with a device.

The admin will create a range of 8 nodes and associate that with the device
using the nvidia-acpi-generic-initiator object. While a configuration of less
than 8 nodes per device is allowed, such configuration will prevent
utilization of the feature to the fullest. This setting is applicable to
all the Grace+Hopper systems. The following is an example of the Qemu command
line arguments to create 8 nodes and link them to the device 'dev0':

-numa node,nodeid=2 \
-numa node,nodeid=3 \
-numa node,nodeid=4 \
-numa node,nodeid=5 \
-numa node,nodeid=6 \
-numa node,nodeid=7 \
-numa node,nodeid=8 \
-numa node,nodeid=9 \
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

The performance benefits can be realized by providing the NUMA node distances
appropriately (through libvirt tags or Qemu params). The admin can get the
distance among nodes in hardware using `numactl -H`.

This series goes along with the vfio-pci variant driver [3] under review.
It is expected for a vfio-pci driver to expose this feature through
sysfs. Presence of the feature is checked to enable these code changes.

Applied over v8.1.0-rc4.

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---

[1] ACPI Spec 6.5, Section 5.2.16.6
[2] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
[3] https://lore.kernel.org/all/20230912153032.19935-1-ankita@nvidia.com/

Link for v1:
https://lore.kernel.org/all/20230915024559.6565-1-ankita@nvidia.com/

v1 -> v2
- Removed dependency on sysfs to communicate the feature with variant module.
- Use GI Affinity SRAT structure instead of Memory Affinity.
- No DSDT entries needed to communicate the PXM for the device. SRAT GI
structure is used instead.
- New objects introduced to establish link between device and nodes.

Ankit Agrawal (3):
  qom: new object to associate device to numa node
  hw/acpi: Implement the SRAT GI affinity structure
  qom: Link multiple numa nodes to device using a new object

 hw/acpi/acpi-generic-initiator.c         | 213 +++++++++++++++++++++++
 hw/acpi/meson.build                      |   1 +
 hw/arm/virt-acpi-build.c                 |   3 +
 hw/vfio/pci.c                            |   2 -
 hw/vfio/pci.h                            |   2 +
 include/hw/acpi/acpi-generic-initiator.h |  64 +++++++
 qapi/qom.json                            |  40 ++++-
 7 files changed, 321 insertions(+), 4 deletions(-)
 create mode 100644 hw/acpi/acpi-generic-initiator.c
 create mode 100644 include/hw/acpi/acpi-generic-initiator.h

-- 
2.17.1



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/3] qom: new object to associate device to numa node
  2023-10-07 20:17 [PATCH v2 0/3] acpi: report numa nodes for device memory using GI ankita
@ 2023-10-07 20:17 ` ankita
  2023-10-09 12:26     ` Jonathan Cameron
                     ` (2 more replies)
  2023-10-07 20:17 ` [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
  2023-10-07 20:17 ` [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object ankita
  2 siblings, 3 replies; 26+ messages in thread
From: ankita @ 2023-10-07 20:17 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

The CPU cache coherent device memory can be added as NUMA nodes
distinct from the system memory nodes. These nodes are associated
with the device and Qemu needs a way to maintain this link.

Introduce a new acpi-generic-initiator object to allow host admin
provide the device and the corresponding NUMA node. Qemu maintain
this association and use this object to build the requisite GI
Affinity Structure.

The admin provides the id of the device and the NUMA node id such
as in the following example.
-device vfio-pci-nohotplug,host=<bdf>,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
-object acpi-generic-initiator,id=gi0,device=dev0,node=2 \

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/acpi/acpi-generic-initiator.c         | 74 ++++++++++++++++++++++++
 hw/acpi/meson.build                      |  1 +
 include/hw/acpi/acpi-generic-initiator.h | 30 ++++++++++
 qapi/qom.json                            | 20 ++++++-
 4 files changed, 123 insertions(+), 2 deletions(-)
 create mode 100644 hw/acpi/acpi-generic-initiator.c
 create mode 100644 include/hw/acpi/acpi-generic-initiator.h

diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
new file mode 100644
index 0000000000..6406736090
--- /dev/null
+++ b/hw/acpi/acpi-generic-initiator.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "qemu/osdep.h"
+#include "hw/qdev-properties.h"
+#include "qapi/error.h"
+#include "qapi/visitor.h"
+#include "qom/object_interfaces.h"
+#include "qom/object.h"
+#include "hw/qdev-core.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/pci.h"
+#include "hw/pci/pci_device.h"
+#include "sysemu/numa.h"
+#include "hw/acpi/acpi-generic-initiator.h"
+
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator,
+                   ACPI_GENERIC_INITIATOR, OBJECT,
+                   { TYPE_USER_CREATABLE },
+                   { NULL })
+
+OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
+
+static void acpi_generic_initiator_init(Object *obj)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    gi->device = NULL;
+    gi->node = MAX_NODES;
+    gi->node_count = 1;
+}
+
+static void acpi_generic_initiator_finalize(Object *obj)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+    g_free(gi->device);
+}
+
+static void acpi_generic_initiator_set_device(Object *obj, const char *value,
+                                              Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+
+    gi->device = g_strdup(value);
+}
+
+static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
+                                            const char *name, void *opaque,
+                                            Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    uint32_t value;
+
+    if (!visit_type_uint32(v, name, &value, errp)) {
+        return;
+    }
+
+    if (value >= MAX_NODES) {
+        return;
+    }
+
+    gi->node = value;
+}
+
+static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
+{
+    object_class_property_add_str(oc, ACPI_GENERIC_INITIATOR_DEVICE_PROP, NULL,
+                                  acpi_generic_initiator_set_device);
+    object_class_property_add(oc, ACPI_GENERIC_INITIATOR_NODE_PROP, "uint32",
+                              NULL, acpi_generic_initiator_set_node, NULL,
+                              NULL);
+}
diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
index fc1b952379..22252836ed 100644
--- a/hw/acpi/meson.build
+++ b/hw/acpi/meson.build
@@ -5,6 +5,7 @@ acpi_ss.add(files(
   'bios-linker-loader.c',
   'core.c',
   'utils.c',
+  'acpi-generic-initiator.c',
 ))
 acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
 acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
new file mode 100644
index 0000000000..e67e6e23b1
--- /dev/null
+++ b/include/hw/acpi/acpi-generic-initiator.h
@@ -0,0 +1,30 @@
+#ifndef ACPI_GENERIC_INITIATOR_H
+#define ACPI_GENERIC_INITIATOR_H
+
+#include "hw/mem/pc-dimm.h"
+#include "hw/acpi/bios-linker-loader.h"
+#include "qemu/uuid.h"
+#include "hw/acpi/aml-build.h"
+#include "qom/object.h"
+#include "qom/object_interfaces.h"
+
+#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+
+#define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
+#define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
+
+typedef struct AcpiGenericInitiator {
+    /* private */
+    Object parent;
+
+    /* public */
+    char *device;
+    uint32_t node;
+    uint32_t node_count;
+} AcpiGenericInitiator;
+
+typedef struct AcpiGenericInitiatorClass {
+        ObjectClass parent_class;
+} AcpiGenericInitiatorClass;
+
+#endif
diff --git a/qapi/qom.json b/qapi/qom.json
index fa3e88c8e6..86c87a161c 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -779,6 +779,20 @@
 { 'struct': 'VfioUserServerProperties',
   'data': { 'socket': 'SocketAddress', 'device': 'str' } }
 
+##
+# @AcpiGenericInitiatorProperties:
+#
+# Properties for acpi-generic-initiator objects.
+#
+# @device: the ID of the device to be associated with the node
+#
+# @node: the ID of the numa node
+#
+# Since: 8.0
+##
+{ 'struct': 'AcpiGenericInitiatorProperties',
+  'data': { 'device': 'str', 'node': 'uint32' } }
+
 ##
 # @RngProperties:
 #
@@ -947,7 +961,8 @@
     'tls-creds-x509',
     'tls-cipher-suites',
     { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
-    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
+    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
+    'acpi-generic-initiator'
   ] }
 
 ##
@@ -1014,7 +1029,8 @@
       'tls-creds-x509':             'TlsCredsX509Properties',
       'tls-cipher-suites':          'TlsCredsProperties',
       'x-remote-object':            'RemoteObjectProperties',
-      'x-vfio-user-server':         'VfioUserServerProperties'
+      'x-vfio-user-server':         'VfioUserServerProperties',
+      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
   } }
 
 ##
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure
  2023-10-07 20:17 [PATCH v2 0/3] acpi: report numa nodes for device memory using GI ankita
  2023-10-07 20:17 ` [PATCH v2 1/3] qom: new object to associate device to numa node ankita
@ 2023-10-07 20:17 ` ankita
  2023-10-09 21:16   ` Alex Williamson
  2023-10-07 20:17 ` [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object ankita
  2 siblings, 1 reply; 26+ messages in thread
From: ankita @ 2023-10-07 20:17 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

ACPI spec provides a scheme to associate "Generic Initiators" [1]
(e.g. heterogeneous processors and accelerators, GPUs, and I/O devices with
integrated compute or DMA engines GPUs) with Proximity Domains. This is
achieved using Generic Initiator Affinity Structure in SRAT. During bootup,
Linux kernel parse the ACPI SRAT to determine the PXM ids and create a NUMA
node for each unique PXM ID encountered. Qemu currently do not implement
these structures while building SRAT.

Add GI structures while building VM ACPI SRAT. The association between
devices and PXM are stored using acpi-generic-initiator object. Lookup
presence of all such objects and use them to build these structures.

The structure needs a PCI device handle [2] that consists of the device BDF.
The vfio-pci-nohotplug device corresponding to the acpi-generic-initiator
object is located to determine the BDF.

[1] ACPI Spec 6.5, Section 5.2.16.6
[2] ACPI Spec 6.5, Table 5.66

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/acpi/acpi-generic-initiator.c         | 78 ++++++++++++++++++++++++
 hw/arm/virt-acpi-build.c                 |  3 +
 hw/vfio/pci.c                            |  2 -
 hw/vfio/pci.h                            |  2 +
 include/hw/acpi/acpi-generic-initiator.h | 22 +++++++
 5 files changed, 105 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
index 6406736090..1ae79639be 100644
--- a/hw/acpi/acpi-generic-initiator.c
+++ b/hw/acpi/acpi-generic-initiator.c
@@ -72,3 +72,81 @@ static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
                               NULL, acpi_generic_initiator_set_node, NULL,
                               NULL);
 }
+
+static int acpi_generic_initiator_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_ACPI_GENERIC_INITIATOR)) {
+        *list = g_slist_append(*list, ACPI_GENERIC_INITIATOR(obj));
+    }
+
+    object_child_foreach(obj, acpi_generic_initiator_list, opaque);
+    return 0;
+}
+
+/*
+ * Identify Generic Initiator objects and link them into the list which is
+ * returned to the caller.
+ *
+ * Note: it is the caller's responsibility to free the list to avoid
+ * memory leak.
+ */
+static GSList *acpi_generic_initiator_get_list(void)
+{
+    GSList *list = NULL;
+
+    object_child_foreach(object_get_root(), acpi_generic_initiator_list, &list);
+    return list;
+}
+
+/*
+ * ACPI spec, Revision 6.5
+ * 5.2.16.6 Generic Initiator Affinity Structure
+ */
+static void build_srat_generic_initiator_affinity(GArray *table_data, int node,
+                                                  PCIDeviceHandle *handle,
+                                                  GenericAffinityFlags flags)
+{
+    build_append_int_noprefix(table_data, 5, 1);     /* Type */
+    build_append_int_noprefix(table_data, 32, 1);    /* Length */
+    build_append_int_noprefix(table_data, 0, 1);     /* Reserved */
+    build_append_int_noprefix(table_data, 1, 1);     /* Device Handle Type */
+    build_append_int_noprefix(table_data, node, 4);  /* Proximity Domain */
+    build_append_int_noprefix(table_data, handle->segment, 2);
+    build_append_int_noprefix(table_data, handle->bdf, 2);
+    build_append_int_noprefix(table_data, handle->res0, 4);
+    build_append_int_noprefix(table_data, handle->res1, 8);
+    build_append_int_noprefix(table_data, flags, 4); /* Flags */
+    build_append_int_noprefix(table_data, 0, 4);     /* Reserved */
+}
+
+void build_srat_generic_initiator(GArray *table_data)
+{
+    GSList *gi_list, *list = acpi_generic_initiator_get_list();
+    for (gi_list = list; gi_list; gi_list = gi_list->next) {
+        AcpiGenericInitiator *gi = gi_list->data;
+        Object *o;
+        int count;
+
+        if (gi->node == MAX_NODES) {
+            continue;
+        }
+
+        o = object_resolve_path_type(gi->device, TYPE_VFIO_PCI_NOHOTPLUG, NULL);
+        if (!o) {
+            continue;
+        }
+
+        for (count = 0; count < gi->node_count; count++) {
+            PCIDeviceHandle dev_handle = {0};
+            PCIDevice *pci_dev = PCI_DEVICE(o);
+
+            dev_handle.bdf = pci_dev->devfn;
+            build_srat_generic_initiator_affinity(table_data,
+                                                  gi->node + count, &dev_handle,
+                                                  GEN_AFFINITY_ENABLED);
+        }
+    }
+    g_slist_free(list);
+}
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 6b674231c2..7337d8076b 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -58,6 +58,7 @@
 #include "migration/vmstate.h"
 #include "hw/acpi/ghes.h"
 #include "hw/acpi/viot.h"
+#include "hw/acpi/acpi-generic-initiator.h"
 
 #define ARM_SPI_BASE 32
 
@@ -558,6 +559,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         }
     }
 
+    build_srat_generic_initiator(table_data);
+
     if (ms->nvdimms_state->is_enabled) {
         nvdimm_build_srat(table_data);
     }
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a205c6b113..5e2a7c650a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -43,8 +43,6 @@
 #include "migration/blocker.h"
 #include "migration/qemu-file.h"
 
-#define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
-
 /* Protected by BQL */
 static KVMRouteChange vfio_route_change;
 
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index a2771b9ff3..74ac77a260 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -118,6 +118,8 @@ typedef struct VFIOMSIXInfo {
 #define TYPE_VFIO_PCI "vfio-pci"
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
 
+#define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
+
 struct VFIOPCIDevice {
     PCIDevice pdev;
     VFIODevice vbasedev;
diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
index e67e6e23b1..e8e2670309 100644
--- a/include/hw/acpi/acpi-generic-initiator.h
+++ b/include/hw/acpi/acpi-generic-initiator.h
@@ -27,4 +27,26 @@ typedef struct AcpiGenericInitiatorClass {
         ObjectClass parent_class;
 } AcpiGenericInitiatorClass;
 
+/*
+ * ACPI 6.5: Table 5-68 Flags - Generic Initiator
+ */
+typedef enum {
+    GEN_AFFINITY_NOFLAGS = 0,
+    GEN_AFFINITY_ENABLED = (1 << 0),
+    GEN_AFFINITY_ARCH_TRANS = (1 << 1),
+} GenericAffinityFlags;
+
+/*
+ * ACPI 6.5: Table 5-66 Device Handle - PCI
+ * Device Handle definition
+ */
+typedef struct PCIDeviceHandle {
+    uint16_t segment;
+    uint16_t bdf;
+    uint32_t res0;
+    uint64_t res1;
+} PCIDeviceHandle;
+
+void build_srat_generic_initiator(GArray *table_data);
+
 #endif
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-07 20:17 [PATCH v2 0/3] acpi: report numa nodes for device memory using GI ankita
  2023-10-07 20:17 ` [PATCH v2 1/3] qom: new object to associate device to numa node ankita
  2023-10-07 20:17 ` [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
@ 2023-10-07 20:17 ` ankita
  2023-10-09 12:30     ` Jonathan Cameron
                     ` (2 more replies)
  2 siblings, 3 replies; 26+ messages in thread
From: ankita @ 2023-10-07 20:17 UTC (permalink / raw)
  To: ankita, jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell,
	ani, berrange, eduardo, imammedo, mst, eblake, armbru, david,
	gshan, Jonathan.Cameron
  Cc: aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

From: Ankit Agrawal <ankita@nvidia.com>

NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
partitioning of the GPU device resources (including device memory) into
several (upto 8) isolated instances. Each of the partitioned memory needs
a dedicated NUMA node to operate. The partitions are not fixed and they
can be created/deleted at runtime.

Unfortunately Linux OS does not provide a means to dynamically create/destroy
NUMA nodes and such feature implementation is not expected to be trivial. The
nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
we utilize the GI Affinity structures that allows association between nodes
and devices. Multiple GI structures per BDF is possible, allowing creation of
multiple nodes by exposing unique PXM in each of these structures.

Introducing a new nvidia-acpi-generic-initiator object, which inherits from
the generic acpi-generic-initiator object to allow a BDF to be associated with
more than 1 nodes.

An admin can provide the range of nodes using numa-node-start and
numa-node-count and link it to a device by providing its id. The following
sample creates 8 nodes and link them to the device dev0:

        -numa node,nodeid=2 \
        -numa node,nodeid=3 \
        -numa node,nodeid=4 \
        -numa node,nodeid=5 \
        -numa node,nodeid=6 \
        -numa node,nodeid=7 \
        -numa node,nodeid=8 \
        -numa node,nodeid=9 \
        -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
        -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

[1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
 include/hw/acpi/acpi-generic-initiator.h | 12 +++++
 qapi/qom.json                            | 24 +++++++++-
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
index 1ae79639be..8ef887c3a4 100644
--- a/hw/acpi/acpi-generic-initiator.c
+++ b/hw/acpi/acpi-generic-initiator.c
@@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
     }
     g_slist_free(list);
 }
+
+static void
+nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
+                                             const char *name, void *opaque,
+                                             Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    uint32_t value;
+
+    if (!visit_type_uint32(v, name, &value, errp)) {
+        return;
+    }
+
+    if (value >= MAX_NODES) {
+        return;
+    }
+
+    gi->node = value;
+}
+
+static void
+nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
+                                             const char *name, void *opaque,
+                                             Error **errp)
+{
+    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
+    uint32_t value;
+
+    if (!visit_type_uint32(v, name, &value, errp)) {
+        return;
+    }
+
+    gi->node_count = value;
+}
+
+static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
+{
+    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
+                              "uint32", NULL,
+                              nvidia_acpi_generic_initiator_set_node_start,
+                              NULL, NULL);
+    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
+                              "uint32", NULL,
+                              nvidia_acpi_generic_initiator_set_node_count,
+                              NULL, NULL);
+}
+
+static const TypeInfo nvidia_acpi_generic_initiator_info = {
+    .parent = TYPE_ACPI_GENERIC_INITIATOR,
+    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
+    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
+    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
+    .class_init = nvidia_acpi_generic_initiator_class_init,
+};
+
+static void
+nvidia_acpi_generic_initiator_register_types(void)
+{
+    type_register_static(&nvidia_acpi_generic_initiator_info);
+}
+type_init(nvidia_acpi_generic_initiator_register_types);
diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
index e8e2670309..3e4cf42064 100644
--- a/include/hw/acpi/acpi-generic-initiator.h
+++ b/include/hw/acpi/acpi-generic-initiator.h
@@ -9,10 +9,14 @@
 #include "qom/object_interfaces.h"
 
 #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
+#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
 
 #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
 #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
 
+#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
+#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
+
 typedef struct AcpiGenericInitiator {
     /* private */
     Object parent;
@@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
     uint64_t res1;
 } PCIDeviceHandle;
 
+typedef struct NvidiaAcpiGenericInitiator {
+    AcpiGenericInitiator parent;
+} NvidiaAcpiGenericInitiator;
+
+typedef struct NvidiaAcpiGenericInitiatorClass {
+        AcpiGenericInitiatorClass parent_class;
+} NvidiaAcpiGenericInitiatorClass;
+
 void build_srat_generic_initiator(GArray *table_data);
 
 #endif
diff --git a/qapi/qom.json b/qapi/qom.json
index 86c87a161c..c29ad1388d 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -793,6 +793,24 @@
 { 'struct': 'AcpiGenericInitiatorProperties',
   'data': { 'device': 'str', 'node': 'uint32' } }
 
+##
+# @NvidiaAcpiGenericInitiatorProperties:
+#
+# Properties for nvidia-acpi-generic-initiator objects.
+#
+# @device: the ID of the device to be associated with the nodes
+#
+# @numa-node-start: the ID of the numa node
+#
+# @numa-node-count: count of the numa nodes assocuated with the device
+#
+# Since: 8.0
+##
+{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
+  'data': { 'device': 'str',
+            'numa-node-start': 'uint32',
+            'numa-node-count': 'uint32' } }
+
 ##
 # @RngProperties:
 #
@@ -962,7 +980,8 @@
     'tls-cipher-suites',
     { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
     { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
-    'acpi-generic-initiator'
+    'acpi-generic-initiator',
+    'nvidia-acpi-generic-initiator'
   ] }
 
 ##
@@ -1030,7 +1049,8 @@
       'tls-cipher-suites':          'TlsCredsProperties',
       'x-remote-object':            'RemoteObjectProperties',
       'x-vfio-user-server':         'VfioUserServerProperties',
-      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
+      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
+      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
   } }
 
 ##
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
@ 2023-10-09 12:26     ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron via @ 2023-10-09 12:26 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel, Dave Jiang

On Sun, 8 Oct 2023 01:47:38 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The CPU cache coherent device memory can be added as NUMA nodes
> distinct from the system memory nodes. These nodes are associated
> with the device and Qemu needs a way to maintain this link.

Hi Ankit,

I'm not sure I'm convinced of the approach to creating nodes for memory
usage (or whether that works in Linux on all NUMA ACPI archs), but I
am keen to see Generic Initiator support in QEMU. I'd also
like to see it done in a way that naturally extends to Generic Ports
which are very similar (but don't hang memory off them! :)
Dave Jiang posted a PoC a while back for generic ports.
https://lore.kernel.org/qemu-devel/168185633821.899932.322047053764766056.stgit@djiang5-mobl3/

My concern with this approach is that it is using a side effect
of a Linux implementation detail that the infra structure to bring up
coherent memory is all present even for a GI only node (if it is
which I can't recall)
I'm also fairly sure we never tidied up the detail of going from the
GI to the device in Linux (because it's harder than a _PXM entry
for the device).  It requires stashing a better description than the BDF
before potentially doing reenumeration so that we can rebuild the
association after that is done.

> 
> Introduce a new acpi-generic-initiator object to allow host admin
> provide the device and the corresponding NUMA node. Qemu maintain
> this association and use this object to build the requisite GI
> Affinity Structure.
> 
> The admin provides the id of the device and the NUMA node id such
> as in the following example.
> -device vfio-pci-nohotplug,host=<bdf>,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,device=dev0,node=2 \

This seems more different to existing numa configuration than necessary.
The corner you have here of multiple GIs per PCI device is I hope the
corner case so I'd like to see something that is really simple for
single device, single node.  Note that, whilst we don't do CXL 1.1 etc
in QEMU yet, all CXL accelerators before CXL 2.0 are pretty much expected
to present a GI SRAT entry + SRAT memory entry if appropriate.
For CXL 2.0 and later everything can be left to be discovered by the OS
but those Generic Ports I mentioned are an important part of that.

Why not something like
-numa node,gi=dev0 \
-numa node,gi=dev0 \
etc or maybe even the slightly weird
-numa node,gi=dev0,gi=dev0,gi=dev0...

?

> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 74 ++++++++++++++++++++++++
>  hw/acpi/meson.build                      |  1 +
>  include/hw/acpi/acpi-generic-initiator.h | 30 ++++++++++
>  qapi/qom.json                            | 20 ++++++-
>  4 files changed, 123 insertions(+), 2 deletions(-)
>  create mode 100644 hw/acpi/acpi-generic-initiator.c
>  create mode 100644 include/hw/acpi/acpi-generic-initiator.h
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> new file mode 100644
> index 0000000000..6406736090
> --- /dev/null
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/qdev-properties.h"
> +#include "qapi/error.h"
> +#include "qapi/visitor.h"
> +#include "qom/object_interfaces.h"
> +#include "qom/object.h"
> +#include "hw/qdev-core.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/pci/pci_device.h"
> +#include "sysemu/numa.h"
> +#include "hw/acpi/acpi-generic-initiator.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator,
> +                   ACPI_GENERIC_INITIATOR, OBJECT,
> +                   { TYPE_USER_CREATABLE },
> +                   { NULL })
> +
> +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
> +
> +static void acpi_generic_initiator_init(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    gi->device = NULL;
> +    gi->node = MAX_NODES;
> +    gi->node_count = 1;
> +}
> +
> +static void acpi_generic_initiator_finalize(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    g_free(gi->device);
> +}
> +
> +static void acpi_generic_initiator_set_device(Object *obj, const char *value,
> +                                              Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    gi->device = g_strdup(value);
> +}
> +
> +static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
> +                                            const char *name, void *opaque,
> +                                            Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {
> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add_str(oc, ACPI_GENERIC_INITIATOR_DEVICE_PROP, NULL,
> +                                  acpi_generic_initiator_set_device);
> +    object_class_property_add(oc, ACPI_GENERIC_INITIATOR_NODE_PROP, "uint32",
> +                              NULL, acpi_generic_initiator_set_node, NULL,
> +                              NULL);
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index fc1b952379..22252836ed 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -5,6 +5,7 @@ acpi_ss.add(files(
>    'bios-linker-loader.c',
>    'core.c',
>    'utils.c',
> +  'acpi-generic-initiator.c',
>  ))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> new file mode 100644
> index 0000000000..e67e6e23b1
> --- /dev/null
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -0,0 +1,30 @@
> +#ifndef ACPI_GENERIC_INITIATOR_H
> +#define ACPI_GENERIC_INITIATOR_H
> +
> +#include "hw/mem/pc-dimm.h"
> +#include "hw/acpi/bios-linker-loader.h"
> +#include "qemu/uuid.h"
> +#include "hw/acpi/aml-build.h"
> +#include "qom/object.h"
> +#include "qom/object_interfaces.h"
> +
> +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +
> +#define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
> +#define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
> +
> +typedef struct AcpiGenericInitiator {
> +    /* private */
> +    Object parent;
> +
> +    /* public */
> +    char *device;
> +    uint32_t node;
> +    uint32_t node_count;
> +} AcpiGenericInitiator;
> +
> +typedef struct AcpiGenericInitiatorClass {
> +        ObjectClass parent_class;
> +} AcpiGenericInitiatorClass;
> +
> +#endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index fa3e88c8e6..86c87a161c 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -779,6 +779,20 @@
>  { 'struct': 'VfioUserServerProperties',
>    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>  
> +##
> +# @AcpiGenericInitiatorProperties:
> +#
> +# Properties for acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the node
> +#
> +# @node: the ID of the numa node
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'AcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str', 'node': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -947,7 +961,8 @@
>      'tls-creds-x509',
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> -    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
> +    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> +    'acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1014,7 +1029,8 @@
>        'tls-creds-x509':             'TlsCredsX509Properties',
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
> -      'x-vfio-user-server':         'VfioUserServerProperties'
> +      'x-vfio-user-server':         'VfioUserServerProperties',
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
@ 2023-10-09 12:26     ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2023-10-09 12:26 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel, Dave Jiang

On Sun, 8 Oct 2023 01:47:38 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The CPU cache coherent device memory can be added as NUMA nodes
> distinct from the system memory nodes. These nodes are associated
> with the device and Qemu needs a way to maintain this link.

Hi Ankit,

I'm not sure I'm convinced of the approach to creating nodes for memory
usage (or whether that works in Linux on all NUMA ACPI archs), but I
am keen to see Generic Initiator support in QEMU. I'd also
like to see it done in a way that naturally extends to Generic Ports
which are very similar (but don't hang memory off them! :)
Dave Jiang posted a PoC a while back for generic ports.
https://lore.kernel.org/qemu-devel/168185633821.899932.322047053764766056.stgit@djiang5-mobl3/

My concern with this approach is that it is using a side effect
of a Linux implementation detail that the infra structure to bring up
coherent memory is all present even for a GI only node (if it is
which I can't recall)
I'm also fairly sure we never tidied up the detail of going from the
GI to the device in Linux (because it's harder than a _PXM entry
for the device).  It requires stashing a better description than the BDF
before potentially doing reenumeration so that we can rebuild the
association after that is done.

> 
> Introduce a new acpi-generic-initiator object to allow host admin
> provide the device and the corresponding NUMA node. Qemu maintain
> this association and use this object to build the requisite GI
> Affinity Structure.
> 
> The admin provides the id of the device and the NUMA node id such
> as in the following example.
> -device vfio-pci-nohotplug,host=<bdf>,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,device=dev0,node=2 \

This seems more different to existing numa configuration than necessary.
The corner you have here of multiple GIs per PCI device is I hope the
corner case so I'd like to see something that is really simple for
single device, single node.  Note that, whilst we don't do CXL 1.1 etc
in QEMU yet, all CXL accelerators before CXL 2.0 are pretty much expected
to present a GI SRAT entry + SRAT memory entry if appropriate.
For CXL 2.0 and later everything can be left to be discovered by the OS
but those Generic Ports I mentioned are an important part of that.

Why not something like
-numa node,gi=dev0 \
-numa node,gi=dev0 \
etc or maybe even the slightly weird
-numa node,gi=dev0,gi=dev0,gi=dev0...

?

> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 74 ++++++++++++++++++++++++
>  hw/acpi/meson.build                      |  1 +
>  include/hw/acpi/acpi-generic-initiator.h | 30 ++++++++++
>  qapi/qom.json                            | 20 ++++++-
>  4 files changed, 123 insertions(+), 2 deletions(-)
>  create mode 100644 hw/acpi/acpi-generic-initiator.c
>  create mode 100644 include/hw/acpi/acpi-generic-initiator.h
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> new file mode 100644
> index 0000000000..6406736090
> --- /dev/null
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/qdev-properties.h"
> +#include "qapi/error.h"
> +#include "qapi/visitor.h"
> +#include "qom/object_interfaces.h"
> +#include "qom/object.h"
> +#include "hw/qdev-core.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/pci/pci_device.h"
> +#include "sysemu/numa.h"
> +#include "hw/acpi/acpi-generic-initiator.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator,
> +                   ACPI_GENERIC_INITIATOR, OBJECT,
> +                   { TYPE_USER_CREATABLE },
> +                   { NULL })
> +
> +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
> +
> +static void acpi_generic_initiator_init(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    gi->device = NULL;
> +    gi->node = MAX_NODES;
> +    gi->node_count = 1;
> +}
> +
> +static void acpi_generic_initiator_finalize(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    g_free(gi->device);
> +}
> +
> +static void acpi_generic_initiator_set_device(Object *obj, const char *value,
> +                                              Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    gi->device = g_strdup(value);
> +}
> +
> +static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
> +                                            const char *name, void *opaque,
> +                                            Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {
> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add_str(oc, ACPI_GENERIC_INITIATOR_DEVICE_PROP, NULL,
> +                                  acpi_generic_initiator_set_device);
> +    object_class_property_add(oc, ACPI_GENERIC_INITIATOR_NODE_PROP, "uint32",
> +                              NULL, acpi_generic_initiator_set_node, NULL,
> +                              NULL);
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index fc1b952379..22252836ed 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -5,6 +5,7 @@ acpi_ss.add(files(
>    'bios-linker-loader.c',
>    'core.c',
>    'utils.c',
> +  'acpi-generic-initiator.c',
>  ))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> new file mode 100644
> index 0000000000..e67e6e23b1
> --- /dev/null
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -0,0 +1,30 @@
> +#ifndef ACPI_GENERIC_INITIATOR_H
> +#define ACPI_GENERIC_INITIATOR_H
> +
> +#include "hw/mem/pc-dimm.h"
> +#include "hw/acpi/bios-linker-loader.h"
> +#include "qemu/uuid.h"
> +#include "hw/acpi/aml-build.h"
> +#include "qom/object.h"
> +#include "qom/object_interfaces.h"
> +
> +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +
> +#define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
> +#define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
> +
> +typedef struct AcpiGenericInitiator {
> +    /* private */
> +    Object parent;
> +
> +    /* public */
> +    char *device;
> +    uint32_t node;
> +    uint32_t node_count;
> +} AcpiGenericInitiator;
> +
> +typedef struct AcpiGenericInitiatorClass {
> +        ObjectClass parent_class;
> +} AcpiGenericInitiatorClass;
> +
> +#endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index fa3e88c8e6..86c87a161c 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -779,6 +779,20 @@
>  { 'struct': 'VfioUserServerProperties',
>    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>  
> +##
> +# @AcpiGenericInitiatorProperties:
> +#
> +# Properties for acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the node
> +#
> +# @node: the ID of the numa node
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'AcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str', 'node': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -947,7 +961,8 @@
>      'tls-creds-x509',
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> -    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
> +    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> +    'acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1014,7 +1029,8 @@
>        'tls-creds-x509':             'TlsCredsX509Properties',
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
> -      'x-vfio-user-server':         'VfioUserServerProperties'
> +      'x-vfio-user-server':         'VfioUserServerProperties',
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
@ 2023-10-09 12:30     ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron via @ 2023-10-09 12:30 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Sun, 8 Oct 2023 01:47:40 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
> 
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per BDF is possible, allowing creation of
> multiple nodes by exposing unique PXM in each of these structures.
> 
> Introducing a new nvidia-acpi-generic-initiator object, which inherits from
> the generic acpi-generic-initiator object to allow a BDF to be associated with
> more than 1 nodes.
> 
> An admin can provide the range of nodes using numa-node-start and
> numa-node-count and link it to a device by providing its id. The following
> sample creates 8 nodes and link them to the device dev0:
> 
>         -numa node,nodeid=2 \
>         -numa node,nodeid=3 \
>         -numa node,nodeid=4 \
>         -numa node,nodeid=5 \
>         -numa node,nodeid=6 \
>         -numa node,nodeid=7 \
>         -numa node,nodeid=8 \
>         -numa node,nodeid=9 \
>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

If you go this way, use an array of references to the numa nodes instead of a start and number.
There is no obvious reason why they should be contiguous that I can see.

I think it is simpler the other way around though - so have the numa nodes point at the
vfio-pci-nohotplug device. 

I've not looked closely at implementation as I think we want to address the fundamental
approach first.

Jonathan


> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
>  include/hw/acpi/acpi-generic-initiator.h | 12 +++++
>  qapi/qom.json                            | 24 +++++++++-
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> index 1ae79639be..8ef887c3a4 100644
> --- a/hw/acpi/acpi-generic-initiator.c
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
>      }
>      g_slist_free(list);
>  }
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {
> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    gi->node_count = value;
> +}
> +
> +static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_start,
> +                              NULL, NULL);
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_count,
> +                              NULL, NULL);
> +}
> +
> +static const TypeInfo nvidia_acpi_generic_initiator_info = {
> +    .parent = TYPE_ACPI_GENERIC_INITIATOR,
> +    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
> +    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
> +    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
> +    .class_init = nvidia_acpi_generic_initiator_class_init,
> +};
> +
> +static void
> +nvidia_acpi_generic_initiator_register_types(void)
> +{
> +    type_register_static(&nvidia_acpi_generic_initiator_info);
> +}
> +type_init(nvidia_acpi_generic_initiator_register_types);
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> index e8e2670309..3e4cf42064 100644
> --- a/include/hw/acpi/acpi-generic-initiator.h
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -9,10 +9,14 @@
>  #include "qom/object_interfaces.h"
>  
>  #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
>  
>  #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
>  #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
>  
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
> +
>  typedef struct AcpiGenericInitiator {
>      /* private */
>      Object parent;
> @@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
>      uint64_t res1;
>  } PCIDeviceHandle;
>  
> +typedef struct NvidiaAcpiGenericInitiator {
> +    AcpiGenericInitiator parent;
> +} NvidiaAcpiGenericInitiator;
> +
> +typedef struct NvidiaAcpiGenericInitiatorClass {
> +        AcpiGenericInitiatorClass parent_class;
> +} NvidiaAcpiGenericInitiatorClass;
> +
>  void build_srat_generic_initiator(GArray *table_data);
>  
>  #endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 86c87a161c..c29ad1388d 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -793,6 +793,24 @@
>  { 'struct': 'AcpiGenericInitiatorProperties',
>    'data': { 'device': 'str', 'node': 'uint32' } }
>  
> +##
> +# @NvidiaAcpiGenericInitiatorProperties:
> +#
> +# Properties for nvidia-acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the nodes
> +#
> +# @numa-node-start: the ID of the numa node
> +#
> +# @numa-node-count: count of the numa nodes assocuated with the device
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str',
> +            'numa-node-start': 'uint32',
> +            'numa-node-count': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -962,7 +980,8 @@
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
>      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> -    'acpi-generic-initiator'
> +    'acpi-generic-initiator',
> +    'nvidia-acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1030,7 +1049,8 @@
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
>        'x-vfio-user-server':         'VfioUserServerProperties',
> -      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
> +      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
@ 2023-10-09 12:30     ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2023-10-09 12:30 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Sun, 8 Oct 2023 01:47:40 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
> 
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per BDF is possible, allowing creation of
> multiple nodes by exposing unique PXM in each of these structures.
> 
> Introducing a new nvidia-acpi-generic-initiator object, which inherits from
> the generic acpi-generic-initiator object to allow a BDF to be associated with
> more than 1 nodes.
> 
> An admin can provide the range of nodes using numa-node-start and
> numa-node-count and link it to a device by providing its id. The following
> sample creates 8 nodes and link them to the device dev0:
> 
>         -numa node,nodeid=2 \
>         -numa node,nodeid=3 \
>         -numa node,nodeid=4 \
>         -numa node,nodeid=5 \
>         -numa node,nodeid=6 \
>         -numa node,nodeid=7 \
>         -numa node,nodeid=8 \
>         -numa node,nodeid=9 \
>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

If you go this way, use an array of references to the numa nodes instead of a start and number.
There is no obvious reason why they should be contiguous that I can see.

I think it is simpler the other way around though - so have the numa nodes point at the
vfio-pci-nohotplug device. 

I've not looked closely at implementation as I think we want to address the fundamental
approach first.

Jonathan


> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
>  include/hw/acpi/acpi-generic-initiator.h | 12 +++++
>  qapi/qom.json                            | 24 +++++++++-
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> index 1ae79639be..8ef887c3a4 100644
> --- a/hw/acpi/acpi-generic-initiator.c
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
>      }
>      g_slist_free(list);
>  }
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {
> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    gi->node_count = value;
> +}
> +
> +static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_start,
> +                              NULL, NULL);
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_count,
> +                              NULL, NULL);
> +}
> +
> +static const TypeInfo nvidia_acpi_generic_initiator_info = {
> +    .parent = TYPE_ACPI_GENERIC_INITIATOR,
> +    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
> +    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
> +    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
> +    .class_init = nvidia_acpi_generic_initiator_class_init,
> +};
> +
> +static void
> +nvidia_acpi_generic_initiator_register_types(void)
> +{
> +    type_register_static(&nvidia_acpi_generic_initiator_info);
> +}
> +type_init(nvidia_acpi_generic_initiator_register_types);
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> index e8e2670309..3e4cf42064 100644
> --- a/include/hw/acpi/acpi-generic-initiator.h
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -9,10 +9,14 @@
>  #include "qom/object_interfaces.h"
>  
>  #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
>  
>  #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
>  #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
>  
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
> +
>  typedef struct AcpiGenericInitiator {
>      /* private */
>      Object parent;
> @@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
>      uint64_t res1;
>  } PCIDeviceHandle;
>  
> +typedef struct NvidiaAcpiGenericInitiator {
> +    AcpiGenericInitiator parent;
> +} NvidiaAcpiGenericInitiator;
> +
> +typedef struct NvidiaAcpiGenericInitiatorClass {
> +        AcpiGenericInitiatorClass parent_class;
> +} NvidiaAcpiGenericInitiatorClass;
> +
>  void build_srat_generic_initiator(GArray *table_data);
>  
>  #endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 86c87a161c..c29ad1388d 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -793,6 +793,24 @@
>  { 'struct': 'AcpiGenericInitiatorProperties',
>    'data': { 'device': 'str', 'node': 'uint32' } }
>  
> +##
> +# @NvidiaAcpiGenericInitiatorProperties:
> +#
> +# Properties for nvidia-acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the nodes
> +#
> +# @numa-node-start: the ID of the numa node
> +#
> +# @numa-node-count: count of the numa nodes assocuated with the device
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str',
> +            'numa-node-start': 'uint32',
> +            'numa-node-count': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -962,7 +980,8 @@
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
>      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> -    'acpi-generic-initiator'
> +    'acpi-generic-initiator',
> +    'nvidia-acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1030,7 +1049,8 @@
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
>        'x-vfio-user-server':         'VfioUserServerProperties',
> -      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
> +      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-09 12:30     ` Jonathan Cameron
  (?)
@ 2023-10-09 12:57     ` David Hildenbrand
  -1 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2023-10-09 12:57 UTC (permalink / raw)
  To: Jonathan Cameron, ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, gshan, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, dnigam, udhoke,
	qemu-arm, qemu-devel

On 09.10.23 14:30, Jonathan Cameron wrote:
> On Sun, 8 Oct 2023 01:47:40 +0530
> <ankita@nvidia.com> wrote:
> 
>> From: Ankit Agrawal <ankita@nvidia.com>
>>
>> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
>> partitioning of the GPU device resources (including device memory) into
>> several (upto 8) isolated instances. Each of the partitioned memory needs
>> a dedicated NUMA node to operate. The partitions are not fixed and they
>> can be created/deleted at runtime.
>>
>> Unfortunately Linux OS does not provide a means to dynamically create/destroy
>> NUMA nodes and such feature implementation is not expected to be trivial. The
>> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
>> we utilize the GI Affinity structures that allows association between nodes
>> and devices. Multiple GI structures per BDF is possible, allowing creation of
>> multiple nodes by exposing unique PXM in each of these structures.
>>
>> Introducing a new nvidia-acpi-generic-initiator object, which inherits from
>> the generic acpi-generic-initiator object to allow a BDF to be associated with
>> more than 1 nodes.
>>
>> An admin can provide the range of nodes using numa-node-start and
>> numa-node-count and link it to a device by providing its id. The following
>> sample creates 8 nodes and link them to the device dev0:
>>
>>          -numa node,nodeid=2 \
>>          -numa node,nodeid=3 \
>>          -numa node,nodeid=4 \
>>          -numa node,nodeid=5 \
>>          -numa node,nodeid=6 \
>>          -numa node,nodeid=7 \
>>          -numa node,nodeid=8 \
>>          -numa node,nodeid=9 \
>>          -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>>          -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \
> 
> If you go this way, use an array of references to the numa nodes instead of a start and number.
> There is no obvious reason why they should be contiguous that I can see.

Right, a uint16List should do.


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-07 20:17 ` [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object ankita
  2023-10-09 12:30     ` Jonathan Cameron
@ 2023-10-09 21:16   ` Alex Williamson
  2023-10-17 14:00     ` Ankit Agrawal
  2023-10-13 13:17   ` Markus Armbruster
  2 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2023-10-09 21:16 UTC (permalink / raw)
  To: ankita
  Cc: jgg, clg, shannon.zhaosl, peter.maydell, ani, berrange, eduardo,
	imammedo, mst, eblake, armbru, david, gshan, Jonathan.Cameron,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Sun, 8 Oct 2023 01:47:40 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> partitioning of the GPU device resources (including device memory) into
> several (upto 8) isolated instances. Each of the partitioned memory needs
> a dedicated NUMA node to operate. The partitions are not fixed and they
> can be created/deleted at runtime.
> 
> Unfortunately Linux OS does not provide a means to dynamically create/destroy
> NUMA nodes and such feature implementation is not expected to be trivial. The
> nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> we utilize the GI Affinity structures that allows association between nodes
> and devices. Multiple GI structures per BDF is possible, allowing creation of
> multiple nodes by exposing unique PXM in each of these structures.
> 
> Introducing a new nvidia-acpi-generic-initiator object, which inherits from
> the generic acpi-generic-initiator object to allow a BDF to be associated with
> more than 1 nodes.
> 
> An admin can provide the range of nodes using numa-node-start and
> numa-node-count and link it to a device by providing its id. The following
> sample creates 8 nodes and link them to the device dev0:
> 
>         -numa node,nodeid=2 \
>         -numa node,nodeid=3 \
>         -numa node,nodeid=4 \
>         -numa node,nodeid=5 \
>         -numa node,nodeid=6 \
>         -numa node,nodeid=7 \
>         -numa node,nodeid=8 \
>         -numa node,nodeid=9 \
>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \

Why didn't we just implement start and count in the base object (or a
list)? It seems like this gives the nvidia-acpi-generic-initiator two
different ways to set gi->node, either node= of the parent or
numa-node-start= here.  Once we expose the implicit node count in the
base object, I'm not sure the purpose of this object.  I would have
thought it for keying the build of the NVIDIA specific _DSD, but that's
not implemented in this version.

I also don't see any programatic means for management tools to know how
many nodes to create.  For example what happens if there's a MIGv2 that
supports 16 partitions by default and makes use of the same vfio-pci
variant driver?  Thanks,

Alex

> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
>  include/hw/acpi/acpi-generic-initiator.h | 12 +++++
>  qapi/qom.json                            | 24 +++++++++-
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> index 1ae79639be..8ef887c3a4 100644
> --- a/hw/acpi/acpi-generic-initiator.c
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
>      }
>      g_slist_free(list);
>  }
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {
> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void
> +nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
> +                                             const char *name, void *opaque,
> +                                             Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    gi->node_count = value;
> +}
> +
> +static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_start,
> +                              NULL, NULL);
> +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
> +                              "uint32", NULL,
> +                              nvidia_acpi_generic_initiator_set_node_count,
> +                              NULL, NULL);
> +}
> +
> +static const TypeInfo nvidia_acpi_generic_initiator_info = {
> +    .parent = TYPE_ACPI_GENERIC_INITIATOR,
> +    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
> +    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
> +    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
> +    .class_init = nvidia_acpi_generic_initiator_class_init,
> +};
> +
> +static void
> +nvidia_acpi_generic_initiator_register_types(void)
> +{
> +    type_register_static(&nvidia_acpi_generic_initiator_info);
> +}
> +type_init(nvidia_acpi_generic_initiator_register_types);
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> index e8e2670309..3e4cf42064 100644
> --- a/include/hw/acpi/acpi-generic-initiator.h
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -9,10 +9,14 @@
>  #include "qom/object_interfaces.h"
>  
>  #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
>  
>  #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
>  #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
>  
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
> +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
> +
>  typedef struct AcpiGenericInitiator {
>      /* private */
>      Object parent;
> @@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
>      uint64_t res1;
>  } PCIDeviceHandle;
>  
> +typedef struct NvidiaAcpiGenericInitiator {
> +    AcpiGenericInitiator parent;
> +} NvidiaAcpiGenericInitiator;
> +
> +typedef struct NvidiaAcpiGenericInitiatorClass {
> +        AcpiGenericInitiatorClass parent_class;
> +} NvidiaAcpiGenericInitiatorClass;
> +
>  void build_srat_generic_initiator(GArray *table_data);
>  
>  #endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 86c87a161c..c29ad1388d 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -793,6 +793,24 @@
>  { 'struct': 'AcpiGenericInitiatorProperties',
>    'data': { 'device': 'str', 'node': 'uint32' } }
>  
> +##
> +# @NvidiaAcpiGenericInitiatorProperties:
> +#
> +# Properties for nvidia-acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the nodes
> +#
> +# @numa-node-start: the ID of the numa node
> +#
> +# @numa-node-count: count of the numa nodes assocuated with the device
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str',
> +            'numa-node-start': 'uint32',
> +            'numa-node-count': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -962,7 +980,8 @@
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
>      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> -    'acpi-generic-initiator'
> +    'acpi-generic-initiator',
> +    'nvidia-acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1030,7 +1049,8 @@
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
>        'x-vfio-user-server':         'VfioUserServerProperties',
> -      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
> +      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
  2023-10-07 20:17 ` [PATCH v2 1/3] qom: new object to associate device to numa node ankita
  2023-10-09 12:26     ` Jonathan Cameron
@ 2023-10-09 21:16   ` Alex Williamson
  2023-10-13 13:16   ` Markus Armbruster
  2 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2023-10-09 21:16 UTC (permalink / raw)
  To: ankita
  Cc: jgg, clg, shannon.zhaosl, peter.maydell, ani, berrange, eduardo,
	imammedo, mst, eblake, armbru, david, gshan, Jonathan.Cameron,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Sun, 8 Oct 2023 01:47:38 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The CPU cache coherent device memory can be added as NUMA nodes
> distinct from the system memory nodes. These nodes are associated
> with the device and Qemu needs a way to maintain this link.
> 
> Introduce a new acpi-generic-initiator object to allow host admin
> provide the device and the corresponding NUMA node. Qemu maintain
> this association and use this object to build the requisite GI
> Affinity Structure.
> 
> The admin provides the id of the device and the NUMA node id such
> as in the following example.
> -device vfio-pci-nohotplug,host=<bdf>,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,device=dev0,node=2 \
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 74 ++++++++++++++++++++++++
>  hw/acpi/meson.build                      |  1 +
>  include/hw/acpi/acpi-generic-initiator.h | 30 ++++++++++
>  qapi/qom.json                            | 20 ++++++-
>  4 files changed, 123 insertions(+), 2 deletions(-)
>  create mode 100644 hw/acpi/acpi-generic-initiator.c
>  create mode 100644 include/hw/acpi/acpi-generic-initiator.h
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> new file mode 100644
> index 0000000000..6406736090
> --- /dev/null
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -0,0 +1,74 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "qemu/osdep.h"
> +#include "hw/qdev-properties.h"
> +#include "qapi/error.h"
> +#include "qapi/visitor.h"
> +#include "qom/object_interfaces.h"
> +#include "qom/object.h"
> +#include "hw/qdev-core.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/pci.h"
> +#include "hw/pci/pci_device.h"
> +#include "sysemu/numa.h"
> +#include "hw/acpi/acpi-generic-initiator.h"
> +
> +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator, acpi_generic_initiator,
> +                   ACPI_GENERIC_INITIATOR, OBJECT,
> +                   { TYPE_USER_CREATABLE },
> +                   { NULL })
> +
> +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator, ACPI_GENERIC_INITIATOR)
> +
> +static void acpi_generic_initiator_init(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    gi->device = NULL;
> +    gi->node = MAX_NODES;
> +    gi->node_count = 1;
> +}
> +
> +static void acpi_generic_initiator_finalize(Object *obj)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    g_free(gi->device);
> +}
> +
> +static void acpi_generic_initiator_set_device(Object *obj, const char *value,
> +                                              Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +
> +    gi->device = g_strdup(value);
> +}
> +
> +static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
> +                                            const char *name, void *opaque,
> +                                            Error **errp)
> +{
> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> +    uint32_t value;
> +
> +    if (!visit_type_uint32(v, name, &value, errp)) {
> +        return;
> +    }
> +
> +    if (value >= MAX_NODES) {

error_setg() here?  Thanks,

Alex

> +        return;
> +    }
> +
> +    gi->node = value;
> +}
> +
> +static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> +{
> +    object_class_property_add_str(oc, ACPI_GENERIC_INITIATOR_DEVICE_PROP, NULL,
> +                                  acpi_generic_initiator_set_device);
> +    object_class_property_add(oc, ACPI_GENERIC_INITIATOR_NODE_PROP, "uint32",
> +                              NULL, acpi_generic_initiator_set_node, NULL,
> +                              NULL);
> +}
> diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build
> index fc1b952379..22252836ed 100644
> --- a/hw/acpi/meson.build
> +++ b/hw/acpi/meson.build
> @@ -5,6 +5,7 @@ acpi_ss.add(files(
>    'bios-linker-loader.c',
>    'core.c',
>    'utils.c',
> +  'acpi-generic-initiator.c',
>  ))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c', 'cpu_hotplug.c'))
>  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false: files('acpi-cpu-hotplug-stub.c'))
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> new file mode 100644
> index 0000000000..e67e6e23b1
> --- /dev/null
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -0,0 +1,30 @@
> +#ifndef ACPI_GENERIC_INITIATOR_H
> +#define ACPI_GENERIC_INITIATOR_H
> +
> +#include "hw/mem/pc-dimm.h"
> +#include "hw/acpi/bios-linker-loader.h"
> +#include "qemu/uuid.h"
> +#include "hw/acpi/aml-build.h"
> +#include "qom/object.h"
> +#include "qom/object_interfaces.h"
> +
> +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> +
> +#define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
> +#define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
> +
> +typedef struct AcpiGenericInitiator {
> +    /* private */
> +    Object parent;
> +
> +    /* public */
> +    char *device;
> +    uint32_t node;
> +    uint32_t node_count;
> +} AcpiGenericInitiator;
> +
> +typedef struct AcpiGenericInitiatorClass {
> +        ObjectClass parent_class;
> +} AcpiGenericInitiatorClass;
> +
> +#endif
> diff --git a/qapi/qom.json b/qapi/qom.json
> index fa3e88c8e6..86c87a161c 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -779,6 +779,20 @@
>  { 'struct': 'VfioUserServerProperties',
>    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>  
> +##
> +# @AcpiGenericInitiatorProperties:
> +#
> +# Properties for acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the node
> +#
> +# @node: the ID of the numa node
> +#
> +# Since: 8.0
> +##
> +{ 'struct': 'AcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str', 'node': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -947,7 +961,8 @@
>      'tls-creds-x509',
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> -    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
> +    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> +    'acpi-generic-initiator'
>    ] }
>  
>  ##
> @@ -1014,7 +1029,8 @@
>        'tls-creds-x509':             'TlsCredsX509Properties',
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
> -      'x-vfio-user-server':         'VfioUserServerProperties'
> +      'x-vfio-user-server':         'VfioUserServerProperties',
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure
  2023-10-07 20:17 ` [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
@ 2023-10-09 21:16   ` Alex Williamson
  2023-10-17 13:51     ` Ankit Agrawal
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2023-10-09 21:16 UTC (permalink / raw)
  To: ankita
  Cc: jgg, clg, shannon.zhaosl, peter.maydell, ani, berrange, eduardo,
	imammedo, mst, eblake, armbru, david, gshan, Jonathan.Cameron,
	aniketa, cjia, kwankhede, targupta, vsethi, acurrid, dnigam,
	udhoke, qemu-arm, qemu-devel

On Sun, 8 Oct 2023 01:47:39 +0530
<ankita@nvidia.com> wrote:

> From: Ankit Agrawal <ankita@nvidia.com>
> 
> ACPI spec provides a scheme to associate "Generic Initiators" [1]
> (e.g. heterogeneous processors and accelerators, GPUs, and I/O devices with
> integrated compute or DMA engines GPUs) with Proximity Domains. This is
> achieved using Generic Initiator Affinity Structure in SRAT. During bootup,
> Linux kernel parse the ACPI SRAT to determine the PXM ids and create a NUMA
> node for each unique PXM ID encountered. Qemu currently do not implement
> these structures while building SRAT.
> 
> Add GI structures while building VM ACPI SRAT. The association between
> devices and PXM are stored using acpi-generic-initiator object. Lookup
> presence of all such objects and use them to build these structures.
> 
> The structure needs a PCI device handle [2] that consists of the device BDF.
> The vfio-pci-nohotplug device corresponding to the acpi-generic-initiator
> object is located to determine the BDF.
> 
> [1] ACPI Spec 6.5, Section 5.2.16.6
> [2] ACPI Spec 6.5, Table 5.66
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> ---
>  hw/acpi/acpi-generic-initiator.c         | 78 ++++++++++++++++++++++++
>  hw/arm/virt-acpi-build.c                 |  3 +
>  hw/vfio/pci.c                            |  2 -
>  hw/vfio/pci.h                            |  2 +
>  include/hw/acpi/acpi-generic-initiator.h | 22 +++++++
>  5 files changed, 105 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> index 6406736090..1ae79639be 100644
> --- a/hw/acpi/acpi-generic-initiator.c
> +++ b/hw/acpi/acpi-generic-initiator.c
> @@ -72,3 +72,81 @@ static void acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
>                                NULL, acpi_generic_initiator_set_node, NULL,
>                                NULL);
>  }
> +
> +static int acpi_generic_initiator_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_ACPI_GENERIC_INITIATOR)) {
> +        *list = g_slist_append(*list, ACPI_GENERIC_INITIATOR(obj));
> +    }
> +
> +    object_child_foreach(obj, acpi_generic_initiator_list, opaque);
> +    return 0;
> +}
> +
> +/*
> + * Identify Generic Initiator objects and link them into the list which is
> + * returned to the caller.
> + *
> + * Note: it is the caller's responsibility to free the list to avoid
> + * memory leak.
> + */
> +static GSList *acpi_generic_initiator_get_list(void)
> +{
> +    GSList *list = NULL;
> +
> +    object_child_foreach(object_get_root(), acpi_generic_initiator_list, &list);
> +    return list;
> +}
> +
> +/*
> + * ACPI spec, Revision 6.5
> + * 5.2.16.6 Generic Initiator Affinity Structure
> + */
> +static void build_srat_generic_initiator_affinity(GArray *table_data, int node,
> +                                                  PCIDeviceHandle *handle,
> +                                                  GenericAffinityFlags flags)
> +{
> +    build_append_int_noprefix(table_data, 5, 1);     /* Type */
> +    build_append_int_noprefix(table_data, 32, 1);    /* Length */
> +    build_append_int_noprefix(table_data, 0, 1);     /* Reserved */
> +    build_append_int_noprefix(table_data, 1, 1);     /* Device Handle Type */
> +    build_append_int_noprefix(table_data, node, 4);  /* Proximity Domain */
> +    build_append_int_noprefix(table_data, handle->segment, 2);
> +    build_append_int_noprefix(table_data, handle->bdf, 2);
> +    build_append_int_noprefix(table_data, handle->res0, 4);
> +    build_append_int_noprefix(table_data, handle->res1, 8);

Why are we storing reserved fields in the PCIDeviceHandle?  This
function is already specific to building a PCI Device Handle, so we
could just loop build_append_byte() with a fixed zero value here.

> +    build_append_int_noprefix(table_data, flags, 4); /* Flags */
> +    build_append_int_noprefix(table_data, 0, 4);     /* Reserved */
> +}
> +
> +void build_srat_generic_initiator(GArray *table_data)
> +{
> +    GSList *gi_list, *list = acpi_generic_initiator_get_list();
> +    for (gi_list = list; gi_list; gi_list = gi_list->next) {
> +        AcpiGenericInitiator *gi = gi_list->data;
> +        Object *o;
> +        int count;
> +
> +        if (gi->node == MAX_NODES) {
> +            continue;
> +        }

Why do we have uninitialized AcpiGenericInitiator objects lingering?

> +
> +        o = object_resolve_path_type(gi->device, TYPE_VFIO_PCI_NOHOTPLUG, NULL);

TYPE_PCI_DEVICE?  Maybe you could check hotpluggable from the device
class, but certainly the generic code should not be dependent on being
a vfio-pci-nohotplug device.  The spec also supports an ACPI object
description, so should this be build_srat_generic_pci_initiator()?


> +        if (!o) {
> +            continue;
> +        }
> +
> +        for (count = 0; count < gi->node_count; count++) {
> +            PCIDeviceHandle dev_handle = {0};
> +            PCIDevice *pci_dev = PCI_DEVICE(o);
> +
> +            dev_handle.bdf = pci_dev->devfn;

Where does the bus part of the bdf get filled in?

> +            build_srat_generic_initiator_affinity(table_data,
> +                                                  gi->node + count, &dev_handle,
> +                                                  GEN_AFFINITY_ENABLED);

Seems like the code that built the AcpiGenericInitiator object should
supply the flags.  In fact the flag GEN_AFFINITY_ENABLED might be a
better indicator to populate the SRAT with the GI than the node value.
Thanks,

Alex

> +        }
> +    }
> +    g_slist_free(list);
> +}
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6b674231c2..7337d8076b 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -58,6 +58,7 @@
>  #include "migration/vmstate.h"
>  #include "hw/acpi/ghes.h"
>  #include "hw/acpi/viot.h"
> +#include "hw/acpi/acpi-generic-initiator.h"
>  
>  #define ARM_SPI_BASE 32
>  
> @@ -558,6 +559,8 @@ build_srat(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>          }
>      }
>  
> +    build_srat_generic_initiator(table_data);
> +
>      if (ms->nvdimms_state->is_enabled) {
>          nvdimm_build_srat(table_data);
>      }
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a205c6b113..5e2a7c650a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -43,8 +43,6 @@
>  #include "migration/blocker.h"
>  #include "migration/qemu-file.h"
>  
> -#define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
> -
>  /* Protected by BQL */
>  static KVMRouteChange vfio_route_change;
>  
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index a2771b9ff3..74ac77a260 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -118,6 +118,8 @@ typedef struct VFIOMSIXInfo {
>  #define TYPE_VFIO_PCI "vfio-pci"
>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
>  
> +#define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
> +
>  struct VFIOPCIDevice {
>      PCIDevice pdev;
>      VFIODevice vbasedev;
> diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> index e67e6e23b1..e8e2670309 100644
> --- a/include/hw/acpi/acpi-generic-initiator.h
> +++ b/include/hw/acpi/acpi-generic-initiator.h
> @@ -27,4 +27,26 @@ typedef struct AcpiGenericInitiatorClass {
>          ObjectClass parent_class;
>  } AcpiGenericInitiatorClass;
>  
> +/*
> + * ACPI 6.5: Table 5-68 Flags - Generic Initiator
> + */
> +typedef enum {
> +    GEN_AFFINITY_NOFLAGS = 0,
> +    GEN_AFFINITY_ENABLED = (1 << 0),
> +    GEN_AFFINITY_ARCH_TRANS = (1 << 1),
> +} GenericAffinityFlags;
> +
> +/*
> + * ACPI 6.5: Table 5-66 Device Handle - PCI
> + * Device Handle definition
> + */
> +typedef struct PCIDeviceHandle {
> +    uint16_t segment;
> +    uint16_t bdf;
> +    uint32_t res0;
> +    uint64_t res1;
> +} PCIDeviceHandle;
> +
> +void build_srat_generic_initiator(GArray *table_data);
> +
>  #endif



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-09 12:30     ` Jonathan Cameron
  (?)
  (?)
@ 2023-10-09 21:27     ` Alex Williamson
  2023-10-17 14:18       ` Ankit Agrawal
  -1 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2023-10-09 21:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ankita, jgg, clg, shannon.zhaosl, peter.maydell, ani, berrange,
	eduardo, imammedo, mst, eblake, armbru, david, gshan, aniketa,
	cjia, kwankhede, targupta, vsethi, acurrid, dnigam, udhoke,
	qemu-arm, qemu-devel

On Mon, 9 Oct 2023 13:30:48 +0100
Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote:

> On Sun, 8 Oct 2023 01:47:40 +0530
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> > 
> > NVIDIA GPU's support MIG (Mult-Instance GPUs) feature [1], which allows
> > partitioning of the GPU device resources (including device memory) into
> > several (upto 8) isolated instances. Each of the partitioned memory needs
> > a dedicated NUMA node to operate. The partitions are not fixed and they
> > can be created/deleted at runtime.
> > 
> > Unfortunately Linux OS does not provide a means to dynamically create/destroy
> > NUMA nodes and such feature implementation is not expected to be trivial. The
> > nodes that OS discovers at the boot time while parsing SRAT remains fixed. So
> > we utilize the GI Affinity structures that allows association between nodes
> > and devices. Multiple GI structures per BDF is possible, allowing creation of
> > multiple nodes by exposing unique PXM in each of these structures.
> > 
> > Introducing a new nvidia-acpi-generic-initiator object, which inherits from
> > the generic acpi-generic-initiator object to allow a BDF to be associated with
> > more than 1 nodes.
> > 
> > An admin can provide the range of nodes using numa-node-start and
> > numa-node-count and link it to a device by providing its id. The following
> > sample creates 8 nodes and link them to the device dev0:
> > 
> >         -numa node,nodeid=2 \
> >         -numa node,nodeid=3 \
> >         -numa node,nodeid=4 \
> >         -numa node,nodeid=5 \
> >         -numa node,nodeid=6 \
> >         -numa node,nodeid=7 \
> >         -numa node,nodeid=8 \
> >         -numa node,nodeid=9 \
> >         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> >         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 \  
> 
> If you go this way, use an array of references to the numa nodes instead of a start and number.
> There is no obvious reason why they should be contiguous that I can see.

Yup, I was looking for other places we allow a list syntax, I only
found fds=a:b:...:z, which is also used for vhostfds=.  I didn't find a
property beyond the string type to hold this though.

> I think it is simpler the other way around though - so have the numa nodes point at the
> vfio-pci-nohotplug device. 

Do you have a syntax you'd propose for this?  I'm having trouble seeing
how it makes things simpler.  Thanks,

Alex

> > [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu
> > 
> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > ---
> >  hw/acpi/acpi-generic-initiator.c         | 61 ++++++++++++++++++++++++
> >  include/hw/acpi/acpi-generic-initiator.h | 12 +++++
> >  qapi/qom.json                            | 24 +++++++++-
> >  3 files changed, 95 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/acpi/acpi-generic-initiator.c b/hw/acpi/acpi-generic-initiator.c
> > index 1ae79639be..8ef887c3a4 100644
> > --- a/hw/acpi/acpi-generic-initiator.c
> > +++ b/hw/acpi/acpi-generic-initiator.c
> > @@ -150,3 +150,64 @@ void build_srat_generic_initiator(GArray *table_data)
> >      }
> >      g_slist_free(list);
> >  }
> > +
> > +static void
> > +nvidia_acpi_generic_initiator_set_node_start(Object *obj, Visitor *v,
> > +                                             const char *name, void *opaque,
> > +                                             Error **errp)
> > +{
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +    uint32_t value;
> > +
> > +    if (!visit_type_uint32(v, name, &value, errp)) {
> > +        return;
> > +    }
> > +
> > +    if (value >= MAX_NODES) {
> > +        return;
> > +    }
> > +
> > +    gi->node = value;
> > +}
> > +
> > +static void
> > +nvidia_acpi_generic_initiator_set_node_count(Object *obj, Visitor *v,
> > +                                             const char *name, void *opaque,
> > +                                             Error **errp)
> > +{
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +    uint32_t value;
> > +
> > +    if (!visit_type_uint32(v, name, &value, errp)) {
> > +        return;
> > +    }
> > +
> > +    gi->node_count = value;
> > +}
> > +
> > +static void nvidia_acpi_generic_initiator_class_init(ObjectClass *oc, void *data)
> > +{
> > +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP,
> > +                              "uint32", NULL,
> > +                              nvidia_acpi_generic_initiator_set_node_start,
> > +                              NULL, NULL);
> > +    object_class_property_add(oc, NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP,
> > +                              "uint32", NULL,
> > +                              nvidia_acpi_generic_initiator_set_node_count,
> > +                              NULL, NULL);
> > +}
> > +
> > +static const TypeInfo nvidia_acpi_generic_initiator_info = {
> > +    .parent = TYPE_ACPI_GENERIC_INITIATOR,
> > +    .name = TYPE_NVIDIA_ACPI_GENERIC_INITIATOR,
> > +    .instance_size = sizeof(NvidiaAcpiGenericInitiator),
> > +    .class_size = sizeof(NvidiaAcpiGenericInitiatorClass),
> > +    .class_init = nvidia_acpi_generic_initiator_class_init,
> > +};
> > +
> > +static void
> > +nvidia_acpi_generic_initiator_register_types(void)
> > +{
> > +    type_register_static(&nvidia_acpi_generic_initiator_info);
> > +}
> > +type_init(nvidia_acpi_generic_initiator_register_types);
> > diff --git a/include/hw/acpi/acpi-generic-initiator.h b/include/hw/acpi/acpi-generic-initiator.h
> > index e8e2670309..3e4cf42064 100644
> > --- a/include/hw/acpi/acpi-generic-initiator.h
> > +++ b/include/hw/acpi/acpi-generic-initiator.h
> > @@ -9,10 +9,14 @@
> >  #include "qom/object_interfaces.h"
> >  
> >  #define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> > +#define TYPE_NVIDIA_ACPI_GENERIC_INITIATOR "nvidia-acpi-generic-initiator"
> >  
> >  #define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
> >  #define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
> >  
> > +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_START_PROP "numa-node-start"
> > +#define NVIDIA_ACPI_GENERIC_INITIATOR_NODE_COUNT_PROP "numa-node-count"
> > +
> >  typedef struct AcpiGenericInitiator {
> >      /* private */
> >      Object parent;
> > @@ -47,6 +51,14 @@ typedef struct PCIDeviceHandle {
> >      uint64_t res1;
> >  } PCIDeviceHandle;
> >  
> > +typedef struct NvidiaAcpiGenericInitiator {
> > +    AcpiGenericInitiator parent;
> > +} NvidiaAcpiGenericInitiator;
> > +
> > +typedef struct NvidiaAcpiGenericInitiatorClass {
> > +        AcpiGenericInitiatorClass parent_class;
> > +} NvidiaAcpiGenericInitiatorClass;
> > +
> >  void build_srat_generic_initiator(GArray *table_data);
> >  
> >  #endif
> > diff --git a/qapi/qom.json b/qapi/qom.json
> > index 86c87a161c..c29ad1388d 100644
> > --- a/qapi/qom.json
> > +++ b/qapi/qom.json
> > @@ -793,6 +793,24 @@
> >  { 'struct': 'AcpiGenericInitiatorProperties',
> >    'data': { 'device': 'str', 'node': 'uint32' } }
> >  
> > +##
> > +# @NvidiaAcpiGenericInitiatorProperties:
> > +#
> > +# Properties for nvidia-acpi-generic-initiator objects.
> > +#
> > +# @device: the ID of the device to be associated with the nodes
> > +#
> > +# @numa-node-start: the ID of the numa node
> > +#
> > +# @numa-node-count: count of the numa nodes assocuated with the device
> > +#
> > +# Since: 8.0
> > +##
> > +{ 'struct': 'NvidiaAcpiGenericInitiatorProperties',
> > +  'data': { 'device': 'str',
> > +            'numa-node-start': 'uint32',
> > +            'numa-node-count': 'uint32' } }
> > +
> >  ##
> >  # @RngProperties:
> >  #
> > @@ -962,7 +980,8 @@
> >      'tls-cipher-suites',
> >      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> >      { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> > -    'acpi-generic-initiator'
> > +    'acpi-generic-initiator',
> > +    'nvidia-acpi-generic-initiator'
> >    ] }
> >  
> >  ##
> > @@ -1030,7 +1049,8 @@
> >        'tls-cipher-suites':          'TlsCredsProperties',
> >        'x-remote-object':            'RemoteObjectProperties',
> >        'x-vfio-user-server':         'VfioUserServerProperties',
> > -      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> > +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties',
> > +      'nvidia-acpi-generic-initiator':     'NvidiaAcpiGenericInitiatorProperties'
> >    } }
> >  
> >  ##  
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH v2 1/3] qom: new object to associate device to numa node
  2023-10-09 12:26     ` Jonathan Cameron
  (?)
@ 2023-10-11 17:37     ` Vikram Sethi
  2023-10-12  8:59         ` Jonathan Cameron
  -1 siblings, 1 reply; 26+ messages in thread
From: Vikram Sethi @ 2023-10-11 17:37 UTC (permalink / raw)
  To: Jonathan Cameron, Ankit Agrawal
  Cc: Jason Gunthorpe, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, berrange, eduardo, imammedo, mst, eblake,
	armbru, david, gshan, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm, qemu-devel,
	Dave Jiang, Shanker Donthineni

Hi Jonathan,

> -----Original Message-----
> From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Sent: Monday, October 9, 2023 7:27 AM
> To: Ankit Agrawal <ankita@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; alex.williamson@redhat.com;
> clg@redhat.com; shannon.zhaosl@gmail.com; peter.maydell@linaro.org;
> ani@anisinha.ca; berrange@redhat.com; eduardo@habkost.net;
> imammedo@redhat.com; mst@redhat.com; eblake@redhat.com;
> armbru@redhat.com; david@redhat.com; gshan@redhat.com; Aniket
> Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Andy Currid <acurrid@nvidia.com>;
> Dheeraj Nigam <dnigam@nvidia.com>; Uday Dhoke <udhoke@nvidia.com>;
> qemu-arm@nongnu.org; qemu-devel@nongnu.org; Dave Jiang
> <dave.jiang@intel.com>
> Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa
> node
> 
> 
> On Sun, 8 Oct 2023 01:47:38 +0530
> <ankita@nvidia.com> wrote:
> 
> > From: Ankit Agrawal <ankita@nvidia.com>
> >
> > The CPU cache coherent device memory can be added as NUMA nodes
> > distinct from the system memory nodes. These nodes are associated with
> > the device and Qemu needs a way to maintain this link.
> 
> Hi Ankit,
> 
> I'm not sure I'm convinced of the approach to creating nodes for memory
> usage (or whether that works in Linux on all NUMA ACPI archs), but I am
> keen to see Generic Initiator support in QEMU. I'd also like to see it done in a
> way that naturally extends to Generic Ports which are very similar (but don't
> hang memory off them! :) Dave Jiang posted a PoC a while back for generic
> ports.
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fqemu-
> devel%2F168185633821.899932.322047053764766056.stgit%40djiang5-
> mobl3%2F&data=05%7C01%7Cvsethi%40nvidia.com%7C846b19f87bc5424d
> c33608dbc8c3015d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7
> C638324512146712954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA
> wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%
> 7C%7C&sdata=v318MXognoITHyv7AFqZAfvUi2hLy2ZUVnLvyQ2IAfY%3D&res
> erved=0
> 
> My concern with this approach is that it is using a side effect of a Linux
> implementation detail that the infra structure to bring up coherent memory
> is all present even for a GI only node (if it is which I can't recall) I'm also fairly
> sure we never tidied up the detail of going from the GI to the device in Linux
> (because it's harder than a _PXM entry for the device).  It requires stashing a
> better description than the BDF before potentially doing reenumeration so
> that we can rebuild the association after that is done.
> 

I'm not sure I understood the concern. Are you suggesting that the ACPI specification
somehow prohibits memory associated with a GI node in the same PXM? i.e whether the GI is memory-less
or with memory isn't mandated by the spec IIRC. Certainly seems perfectly normal
for an accelerator with memory to have a GI with memory and that memory be able to be associated with the same PXM. 
So what about this patch is using a Linux implementation detail? Even if Linux wasn't currently supporting
that use case, it is something that would have been reasonable to add IMO. What am I missing?

> >
> > Introduce a new acpi-generic-initiator object to allow host admin
> > provide the device and the corresponding NUMA node. Qemu maintain
> this
> > association and use this object to build the requisite GI Affinity
> > Structure.
> >
> > The admin provides the id of the device and the NUMA node id such as
> > in the following example.
> > -device
> > vfio-pci-nohotplug,host=<bdf>,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> > -object acpi-generic-initiator,id=gi0,device=dev0,node=2 \
> 
> This seems more different to existing numa configuration than necessary.
> The corner you have here of multiple GIs per PCI device is I hope the corner
> case so I'd like to see something that is really simple for single device, single
> node.  Note that, whilst we don't do CXL 1.1 etc in QEMU yet, all CXL
> accelerators before CXL 2.0 are pretty much expected to present a GI SRAT
> entry + SRAT memory entry if appropriate.
> For CXL 2.0 and later everything can be left to be discovered by the OS but
> those Generic Ports I mentioned are an important part of that.
> 
> Why not something like
> -numa node,gi=dev0 \
> -numa node,gi=dev0 \
> etc or maybe even the slightly weird
> -numa node,gi=dev0,gi=dev0,gi=dev0...
> 
> ?
> 
> >
> > Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
> > ---
> >  hw/acpi/acpi-generic-initiator.c         | 74 ++++++++++++++++++++++++
> >  hw/acpi/meson.build                      |  1 +
> >  include/hw/acpi/acpi-generic-initiator.h | 30 ++++++++++
> >  qapi/qom.json                            | 20 ++++++-
> >  4 files changed, 123 insertions(+), 2 deletions(-)  create mode
> > 100644 hw/acpi/acpi-generic-initiator.c  create mode 100644
> > include/hw/acpi/acpi-generic-initiator.h
> >
> > diff --git a/hw/acpi/acpi-generic-initiator.c
> > b/hw/acpi/acpi-generic-initiator.c
> > new file mode 100644
> > index 0000000000..6406736090
> > --- /dev/null
> > +++ b/hw/acpi/acpi-generic-initiator.c
> > @@ -0,0 +1,74 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights
> > +reserved  */
> > +
> > +#include "qemu/osdep.h"
> > +#include "hw/qdev-properties.h"
> > +#include "qapi/error.h"
> > +#include "qapi/visitor.h"
> > +#include "qom/object_interfaces.h"
> > +#include "qom/object.h"
> > +#include "hw/qdev-core.h"
> > +#include "hw/vfio/vfio-common.h"
> > +#include "hw/vfio/pci.h"
> > +#include "hw/pci/pci_device.h"
> > +#include "sysemu/numa.h"
> > +#include "hw/acpi/acpi-generic-initiator.h"
> > +
> > +OBJECT_DEFINE_TYPE_WITH_INTERFACES(AcpiGenericInitiator,
> acpi_generic_initiator,
> > +                   ACPI_GENERIC_INITIATOR, OBJECT,
> > +                   { TYPE_USER_CREATABLE },
> > +                   { NULL })
> > +
> > +OBJECT_DECLARE_SIMPLE_TYPE(AcpiGenericInitiator,
> > +ACPI_GENERIC_INITIATOR)
> > +
> > +static void acpi_generic_initiator_init(Object *obj) {
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +    gi->device = NULL;
> > +    gi->node = MAX_NODES;
> > +    gi->node_count = 1;
> > +}
> > +
> > +static void acpi_generic_initiator_finalize(Object *obj) {
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +
> > +    g_free(gi->device);
> > +}
> > +
> > +static void acpi_generic_initiator_set_device(Object *obj, const char
> *value,
> > +                                              Error **errp) {
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +
> > +    gi->device = g_strdup(value);
> > +}
> > +
> > +static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
> > +                                            const char *name, void *opaque,
> > +                                            Error **errp) {
> > +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
> > +    uint32_t value;
> > +
> > +    if (!visit_type_uint32(v, name, &value, errp)) {
> > +        return;
> > +    }
> > +
> > +    if (value >= MAX_NODES) {
> > +        return;
> > +    }
> > +
> > +    gi->node = value;
> > +}
> > +
> > +static void acpi_generic_initiator_class_init(ObjectClass *oc, void
> > +*data) {
> > +    object_class_property_add_str(oc,
> ACPI_GENERIC_INITIATOR_DEVICE_PROP, NULL,
> > +                                  acpi_generic_initiator_set_device);
> > +    object_class_property_add(oc,
> ACPI_GENERIC_INITIATOR_NODE_PROP, "uint32",
> > +                              NULL, acpi_generic_initiator_set_node, NULL,
> > +                              NULL);
> > +}
> > diff --git a/hw/acpi/meson.build b/hw/acpi/meson.build index
> > fc1b952379..22252836ed 100644
> > --- a/hw/acpi/meson.build
> > +++ b/hw/acpi/meson.build
> > @@ -5,6 +5,7 @@ acpi_ss.add(files(
> >    'bios-linker-loader.c',
> >    'core.c',
> >    'utils.c',
> > +  'acpi-generic-initiator.c',
> >  ))
> >  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_true: files('cpu.c',
> > 'cpu_hotplug.c'))
> >  acpi_ss.add(when: 'CONFIG_ACPI_CPU_HOTPLUG', if_false:
> > files('acpi-cpu-hotplug-stub.c')) diff --git
> > a/include/hw/acpi/acpi-generic-initiator.h
> > b/include/hw/acpi/acpi-generic-initiator.h
> > new file mode 100644
> > index 0000000000..e67e6e23b1
> > --- /dev/null
> > +++ b/include/hw/acpi/acpi-generic-initiator.h
> > @@ -0,0 +1,30 @@
> > +#ifndef ACPI_GENERIC_INITIATOR_H
> > +#define ACPI_GENERIC_INITIATOR_H
> > +
> > +#include "hw/mem/pc-dimm.h"
> > +#include "hw/acpi/bios-linker-loader.h"
> > +#include "qemu/uuid.h"
> > +#include "hw/acpi/aml-build.h"
> > +#include "qom/object.h"
> > +#include "qom/object_interfaces.h"
> > +
> > +#define TYPE_ACPI_GENERIC_INITIATOR "acpi-generic-initiator"
> > +
> > +#define ACPI_GENERIC_INITIATOR_DEVICE_PROP "device"
> > +#define ACPI_GENERIC_INITIATOR_NODE_PROP "node"
> > +
> > +typedef struct AcpiGenericInitiator {
> > +    /* private */
> > +    Object parent;
> > +
> > +    /* public */
> > +    char *device;
> > +    uint32_t node;
> > +    uint32_t node_count;
> > +} AcpiGenericInitiator;
> > +
> > +typedef struct AcpiGenericInitiatorClass {
> > +        ObjectClass parent_class;
> > +} AcpiGenericInitiatorClass;
> > +
> > +#endif
> > diff --git a/qapi/qom.json b/qapi/qom.json index
> > fa3e88c8e6..86c87a161c 100644
> > --- a/qapi/qom.json
> > +++ b/qapi/qom.json
> > @@ -779,6 +779,20 @@
> >  { 'struct': 'VfioUserServerProperties',
> >    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
> >
> > +##
> > +# @AcpiGenericInitiatorProperties:
> > +#
> > +# Properties for acpi-generic-initiator objects.
> > +#
> > +# @device: the ID of the device to be associated with the node # #
> > +@node: the ID of the numa node # # Since: 8.0 ## { 'struct':
> > +'AcpiGenericInitiatorProperties',
> > +  'data': { 'device': 'str', 'node': 'uint32' } }
> > +
> >  ##
> >  # @RngProperties:
> >  #
> > @@ -947,7 +961,8 @@
> >      'tls-creds-x509',
> >      'tls-cipher-suites',
> >      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> > -    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
> > +    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> > +    'acpi-generic-initiator'
> >    ] }
> >
> >  ##
> > @@ -1014,7 +1029,8 @@
> >        'tls-creds-x509':             'TlsCredsX509Properties',
> >        'tls-cipher-suites':          'TlsCredsProperties',
> >        'x-remote-object':            'RemoteObjectProperties',
> > -      'x-vfio-user-server':         'VfioUserServerProperties'
> > +      'x-vfio-user-server':         'VfioUserServerProperties',
> > +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
> >    } }
> >
> >  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
@ 2023-10-12  8:59         ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron via @ 2023-10-12  8:59 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Ankit Agrawal, Jason Gunthorpe, alex.williamson, clg,
	shannon.zhaosl, peter.maydell, ani, berrange, eduardo, imammedo,
	mst, eblake, armbru, david, gshan, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm, qemu-devel,
	Dave Jiang, Shanker Donthineni

On Wed, 11 Oct 2023 17:37:11 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> Hi Jonathan,
> 
> > -----Original Message-----
> > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> > Sent: Monday, October 9, 2023 7:27 AM
> > To: Ankit Agrawal <ankita@nvidia.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>; alex.williamson@redhat.com;
> > clg@redhat.com; shannon.zhaosl@gmail.com; peter.maydell@linaro.org;
> > ani@anisinha.ca; berrange@redhat.com; eduardo@habkost.net;
> > imammedo@redhat.com; mst@redhat.com; eblake@redhat.com;
> > armbru@redhat.com; david@redhat.com; gshan@redhat.com; Aniket
> > Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <acurrid@nvidia.com>;
> > Dheeraj Nigam <dnigam@nvidia.com>; Uday Dhoke <udhoke@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Dave Jiang
> > <dave.jiang@intel.com>
> > Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa
> > node
> > 
> > 
> > On Sun, 8 Oct 2023 01:47:38 +0530
> > <ankita@nvidia.com> wrote:
> >   
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > >
> > > The CPU cache coherent device memory can be added as NUMA nodes
> > > distinct from the system memory nodes. These nodes are associated with
> > > the device and Qemu needs a way to maintain this link.  
> > 
> > Hi Ankit,
> > 
> > I'm not sure I'm convinced of the approach to creating nodes for memory
> > usage (or whether that works in Linux on all NUMA ACPI archs), but I am
> > keen to see Generic Initiator support in QEMU. I'd also like to see it done in a
> > way that naturally extends to Generic Ports which are very similar (but don't
> > hang memory off them! :) Dave Jiang posted a PoC a while back for generic
> > ports.
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> > kernel.org%2Fqemu-
> > devel%2F168185633821.899932.322047053764766056.stgit%40djiang5-
> > mobl3%2F&data=05%7C01%7Cvsethi%40nvidia.com%7C846b19f87bc5424d
> > c33608dbc8c3015d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7
> > C638324512146712954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA
> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%
> > 7C%7C&sdata=v318MXognoITHyv7AFqZAfvUi2hLy2ZUVnLvyQ2IAfY%3D&res
> > erved=0
> > 
> > My concern with this approach is that it is using a side effect of a Linux
> > implementation detail that the infra structure to bring up coherent memory
> > is all present even for a GI only node (if it is which I can't recall) I'm also fairly
> > sure we never tidied up the detail of going from the GI to the device in Linux
> > (because it's harder than a _PXM entry for the device).  It requires stashing a
> > better description than the BDF before potentially doing reenumeration so
> > that we can rebuild the association after that is done.
> >   
> 
> I'm not sure I understood the concern. Are you suggesting that the ACPI specification
> somehow prohibits memory associated with a GI node in the same PXM? i.e whether the GI is memory-less
> or with memory isn't mandated by the spec IIRC. Certainly seems perfectly normal
> for an accelerator with memory to have a GI with memory and that memory be able to be associated with the same PXM. 

Indeed reasonable that a GI would have associated memory, but if it's
"normal memory" (i.e. coherent and not device private memory accessed by PCI bar
etc) then expectation would be that memory is in SRAT as a memory entry.
Which brings us back to the original question of whether 0 sized memory nodes
are fine.

> So what about this patch is using a Linux implementation detail? Even if Linux wasn't currently supporting
> that use case, it is something that would have been reasonable to add IMO. What am I missing?

Linux is careful to only bring up the infrastructure for specific types of 
roximity node. It works its way through SRAT and sets appropriate bitmap bits
to say which combination of PXM node types a given node is. (CPU, Memory, GI etc)

After that walk is done it then brings up various infrastructure.
What I can't remember (you'll need to experiment) is if there is anything not
brought up for a non Memory node that you would need.  Might be fine, but that
doesn't mean it will remain fine.  Maybe we just need to make sure the documentation
/ comments in Linux cover this use case.  You are on your own for what other
OSes decided is valid here as the specifcation does not mention this AFAIK.
If it does then add a reference.

There is a non trivial (potential) cost to enabling facilities on NUMA nodes that
will never make use of them - a bunch of longer searches etc when looking
for memory.  For GIs we enable pretty much everything a CPU node uses.
That was controversial though only well after support was already in - the
controversy being that it added costs to paths that didn't care about GIs.

Basically it boils down to using unexpected corners of specifications may
prove fragile.

For one thing I'm doubtful if the NUMA description the kernel exposes (coming
from a subset of HMAT) won't deal with this case.  Not tried it though
so you may be lucky.

Jonathan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
@ 2023-10-12  8:59         ` Jonathan Cameron
  0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2023-10-12  8:59 UTC (permalink / raw)
  To: Vikram Sethi
  Cc: Ankit Agrawal, Jason Gunthorpe, alex.williamson, clg,
	shannon.zhaosl, peter.maydell, ani, berrange, eduardo, imammedo,
	mst, eblake, armbru, david, gshan, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm, qemu-devel,
	Dave Jiang, Shanker Donthineni

On Wed, 11 Oct 2023 17:37:11 +0000
Vikram Sethi <vsethi@nvidia.com> wrote:

> Hi Jonathan,
> 
> > -----Original Message-----
> > From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> > Sent: Monday, October 9, 2023 7:27 AM
> > To: Ankit Agrawal <ankita@nvidia.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>; alex.williamson@redhat.com;
> > clg@redhat.com; shannon.zhaosl@gmail.com; peter.maydell@linaro.org;
> > ani@anisinha.ca; berrange@redhat.com; eduardo@habkost.net;
> > imammedo@redhat.com; mst@redhat.com; eblake@redhat.com;
> > armbru@redhat.com; david@redhat.com; gshan@redhat.com; Aniket
> > Agashe <aniketa@nvidia.com>; Neo Jia <cjia@nvidia.com>; Kirti Wankhede
> > <kwankhede@nvidia.com>; Tarun Gupta (SW-GPU) <targupta@nvidia.com>;
> > Vikram Sethi <vsethi@nvidia.com>; Andy Currid <acurrid@nvidia.com>;
> > Dheeraj Nigam <dnigam@nvidia.com>; Uday Dhoke <udhoke@nvidia.com>;
> > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Dave Jiang
> > <dave.jiang@intel.com>
> > Subject: Re: [PATCH v2 1/3] qom: new object to associate device to numa
> > node
> > 
> > 
> > On Sun, 8 Oct 2023 01:47:38 +0530
> > <ankita@nvidia.com> wrote:
> >   
> > > From: Ankit Agrawal <ankita@nvidia.com>
> > >
> > > The CPU cache coherent device memory can be added as NUMA nodes
> > > distinct from the system memory nodes. These nodes are associated with
> > > the device and Qemu needs a way to maintain this link.  
> > 
> > Hi Ankit,
> > 
> > I'm not sure I'm convinced of the approach to creating nodes for memory
> > usage (or whether that works in Linux on all NUMA ACPI archs), but I am
> > keen to see Generic Initiator support in QEMU. I'd also like to see it done in a
> > way that naturally extends to Generic Ports which are very similar (but don't
> > hang memory off them! :) Dave Jiang posted a PoC a while back for generic
> > ports.
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> > kernel.org%2Fqemu-
> > devel%2F168185633821.899932.322047053764766056.stgit%40djiang5-
> > mobl3%2F&data=05%7C01%7Cvsethi%40nvidia.com%7C846b19f87bc5424d
> > c33608dbc8c3015d%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7
> > C638324512146712954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjA
> > wMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%
> > 7C%7C&sdata=v318MXognoITHyv7AFqZAfvUi2hLy2ZUVnLvyQ2IAfY%3D&res
> > erved=0
> > 
> > My concern with this approach is that it is using a side effect of a Linux
> > implementation detail that the infra structure to bring up coherent memory
> > is all present even for a GI only node (if it is which I can't recall) I'm also fairly
> > sure we never tidied up the detail of going from the GI to the device in Linux
> > (because it's harder than a _PXM entry for the device).  It requires stashing a
> > better description than the BDF before potentially doing reenumeration so
> > that we can rebuild the association after that is done.
> >   
> 
> I'm not sure I understood the concern. Are you suggesting that the ACPI specification
> somehow prohibits memory associated with a GI node in the same PXM? i.e whether the GI is memory-less
> or with memory isn't mandated by the spec IIRC. Certainly seems perfectly normal
> for an accelerator with memory to have a GI with memory and that memory be able to be associated with the same PXM. 

Indeed reasonable that a GI would have associated memory, but if it's
"normal memory" (i.e. coherent and not device private memory accessed by PCI bar
etc) then expectation would be that memory is in SRAT as a memory entry.
Which brings us back to the original question of whether 0 sized memory nodes
are fine.

> So what about this patch is using a Linux implementation detail? Even if Linux wasn't currently supporting
> that use case, it is something that would have been reasonable to add IMO. What am I missing?

Linux is careful to only bring up the infrastructure for specific types of 
roximity node. It works its way through SRAT and sets appropriate bitmap bits
to say which combination of PXM node types a given node is. (CPU, Memory, GI etc)

After that walk is done it then brings up various infrastructure.
What I can't remember (you'll need to experiment) is if there is anything not
brought up for a non Memory node that you would need.  Might be fine, but that
doesn't mean it will remain fine.  Maybe we just need to make sure the documentation
/ comments in Linux cover this use case.  You are on your own for what other
OSes decided is valid here as the specifcation does not mention this AFAIK.
If it does then add a reference.

There is a non trivial (potential) cost to enabling facilities on NUMA nodes that
will never make use of them - a bunch of longer searches etc when looking
for memory.  For GIs we enable pretty much everything a CPU node uses.
That was controversial though only well after support was already in - the
controversy being that it added costs to paths that didn't care about GIs.

Basically it boils down to using unexpected corners of specifications may
prove fragile.

For one thing I'm doubtful if the NUMA description the kernel exposes (coming
from a subset of HMAT) won't deal with this case.  Not tried it though
so you may be lucky.

Jonathan




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
  2023-10-07 20:17 ` [PATCH v2 1/3] qom: new object to associate device to numa node ankita
  2023-10-09 12:26     ` Jonathan Cameron
  2023-10-09 21:16   ` Alex Williamson
@ 2023-10-13 13:16   ` Markus Armbruster
  2023-10-17 13:44     ` Ankit Agrawal
  2 siblings, 1 reply; 26+ messages in thread
From: Markus Armbruster @ 2023-10-13 13:16 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, dnigam, udhoke, qemu-arm, qemu-devel

<ankita@nvidia.com> writes:

> From: Ankit Agrawal <ankita@nvidia.com>
>
> The CPU cache coherent device memory can be added as NUMA nodes
> distinct from the system memory nodes. These nodes are associated
> with the device and Qemu needs a way to maintain this link.
>
> Introduce a new acpi-generic-initiator object to allow host admin
> provide the device and the corresponding NUMA node. Qemu maintain
> this association and use this object to build the requisite GI
> Affinity Structure.
>
> The admin provides the id of the device and the NUMA node id such
> as in the following example.
> -device vfio-pci-nohotplug,host=<bdf>,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> -object acpi-generic-initiator,id=gi0,device=dev0,node=2 \
>
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

[...]

> diff --git a/qapi/qom.json b/qapi/qom.json
> index fa3e88c8e6..86c87a161c 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -779,6 +779,20 @@
>  { 'struct': 'VfioUserServerProperties',
>    'data': { 'socket': 'SocketAddress', 'device': 'str' } }
>  
> +##
> +# @AcpiGenericInitiatorProperties:
> +#
> +# Properties for acpi-generic-initiator objects.
> +#
> +# @device: the ID of the device to be associated with the node
> +#
> +# @node: the ID of the numa node
> +#
> +# Since: 8.0

Since 8.2

> +##
> +{ 'struct': 'AcpiGenericInitiatorProperties',
> +  'data': { 'device': 'str', 'node': 'uint32' } }
> +
>  ##
>  # @RngProperties:
>  #
> @@ -947,7 +961,8 @@
>      'tls-creds-x509',
>      'tls-cipher-suites',
>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
> -    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
> +    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
> +    'acpi-generic-initiator'

Please keep the object types sorted alphabetically.

>    ] }
>  
>  ##
> @@ -1014,7 +1029,8 @@
>        'tls-creds-x509':             'TlsCredsX509Properties',
>        'tls-cipher-suites':          'TlsCredsProperties',
>        'x-remote-object':            'RemoteObjectProperties',
> -      'x-vfio-user-server':         'VfioUserServerProperties'
> +      'x-vfio-user-server':         'VfioUserServerProperties',
> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'

Likewise.

>    } }
>  
>  ##



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-07 20:17 ` [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object ankita
  2023-10-09 12:30     ` Jonathan Cameron
  2023-10-09 21:16   ` Alex Williamson
@ 2023-10-13 13:17   ` Markus Armbruster
  2 siblings, 0 replies; 26+ messages in thread
From: Markus Armbruster @ 2023-10-13 13:17 UTC (permalink / raw)
  To: ankita
  Cc: jgg, alex.williamson, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, david, gshan,
	Jonathan.Cameron, aniketa, cjia, kwankhede, targupta, vsethi,
	acurrid, dnigam, udhoke, qemu-arm, qemu-devel

Same comments as for PATCH 1.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/3] qom: new object to associate device to numa node
  2023-10-13 13:16   ` Markus Armbruster
@ 2023-10-17 13:44     ` Ankit Agrawal
  0 siblings, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2023-10-17 13:44 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Jason Gunthorpe, alex.williamson, clg, shannon.zhaosl,
	peter.maydell, ani, berrange, eduardo, imammedo, mst, eblake,
	david, gshan, Jonathan.Cameron, Aniket Agashe, Neo Jia,
	Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel

>> +static void acpi_generic_initiator_set_node(Object *obj, Visitor *v,
>> +                                            const char *name, void *opaque,
>> +                                            Error **errp)
>> +{
>> +    AcpiGenericInitiator *gi = ACPI_GENERIC_INITIATOR(obj);
>> +    uint32_t value;
>> +
>> +    if (!visit_type_uint32(v, name, &value, errp)) {
>> +        return;
>> +    }
>> +
>> +    if (value >= MAX_NODES) {
>
> error_setg() here?  Thanks,

Thanks for pointing out, will make the change in the next version.

>> +##
>> +# @AcpiGenericInitiatorProperties:
>> +#
>> +# Properties for acpi-generic-initiator objects.
>> +#
>> +# @device: the ID of the device to be associated with the node
>> +#
>> +# @node: the ID of the numa node
>> +#
>> +# Since: 8.0
>
> Since 8.2

Thanks, will make the change.

>>      'tls-creds-x509',
>>      'tls-cipher-suites',
>>      { 'name': 'x-remote-object', 'features': [ 'unstable' ] },
>> -    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] }
>> +    { 'name': 'x-vfio-user-server', 'features': [ 'unstable' ] },
>> +    'acpi-generic-initiator'
>
> Please keep the object types sorted alphabetically.

Ack.

>> @@ -1014,7 +1029,8 @@
>>        'tls-creds-x509':             'TlsCredsX509Properties',
>>        'tls-cipher-suites':          'TlsCredsProperties',
>>        'x-remote-object':            'RemoteObjectProperties',
>> -      'x-vfio-user-server':         'VfioUserServerProperties'
>> +      'x-vfio-user-server':         'VfioUserServerProperties',
>> +      'acpi-generic-initiator':     'AcpiGenericInitiatorProperties'
>
> Likewise.

Ack, will make the change.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure
  2023-10-09 21:16   ` Alex Williamson
@ 2023-10-17 13:51     ` Ankit Agrawal
  0 siblings, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2023-10-17 13:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel

>> +static void build_srat_generic_initiator_affinity(GArray *table_data, int node,
>> +                                                  PCIDeviceHandle *handle,
>> +                                                  GenericAffinityFlags flags)
>> +{
>> +    build_append_int_noprefix(table_data, 5, 1);     /* Type */
>> +    build_append_int_noprefix(table_data, 32, 1);    /* Length */
>> +    build_append_int_noprefix(table_data, 0, 1);     /* Reserved */
>> +    build_append_int_noprefix(table_data, 1, 1);     /* Device Handle Type */
>> +    build_append_int_noprefix(table_data, node, 4);  /* Proximity Domain */
>> +    build_append_int_noprefix(table_data, handle->segment, 2);
>> +    build_append_int_noprefix(table_data, handle->bdf, 2);
>> +    build_append_int_noprefix(table_data, handle->res0, 4);
>> +    build_append_int_noprefix(table_data, handle->res1, 8);
>
> Why are we storing reserved fields in the PCIDeviceHandle?  This
> function is already specific to building a PCI Device Handle, so we
> could just loop build_append_byte() with a fixed zero value here.

Good point, will make the change.


>> +void build_srat_generic_initiator(GArray *table_data)
>> +{
>> +    GSList *gi_list, *list = acpi_generic_initiator_get_list();
>> +    for (gi_list = list; gi_list; gi_list = gi_list->next) {
>> +        AcpiGenericInitiator *gi = gi_list->data;
>> +        Object *o;
>> +        int count;
>> +
>> +        if (gi->node == MAX_NODES) {
>> +            continue;
>> +        }
>
> Why do we have uninitialized AcpiGenericInitiator objects lingering?

Right, we don't need the check.

>> +
>> +        o = object_resolve_path_type(gi->device, TYPE_VFIO_PCI_NOHOTPLUG, NULL);
>
> TYPE_PCI_DEVICE?  Maybe you could check hotpluggable from the device
> class, but certainly the generic code should not be dependent on being
> a vfio-pci-nohotplug device.  

Understood.

> The spec also supports an ACPI object
> description, so should this be build_srat_generic_pci_initiator()?

Sure, makes sense.

>> +        if (!o) {
>> +            continue;
>> +        }
>> +
>> +        for (count = 0; count < gi->node_count; count++) {
>> +            PCIDeviceHandle dev_handle = {0};
>> +            PCIDevice *pci_dev = PCI_DEVICE(o);
>> +
>> +            dev_handle.bdf = pci_dev->devfn;
>
> Where does the bus part of the bdf get filled in?

Good catch, should have code to added the bus.

>> +            build_srat_generic_initiator_affinity(table_data,
>> +                                                  gi->node + count, &dev_handle,
>> +                                                  GEN_AFFINITY_ENABLED);
>
> Seems like the code that built the AcpiGenericInitiator object should
> supply the flags.  In fact the flag GEN_AFFINITY_ENABLED might be a
> better indicator to populate the SRAT with the GI than the node value.

Yeah, sure.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-09 21:16   ` Alex Williamson
@ 2023-10-17 14:00     ` Ankit Agrawal
  2023-10-17 15:21       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Ankit Agrawal @ 2023-10-17 14:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel

>>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8
>
> Why didn't we just implement start and count in the base object (or a
> list)? It seems like this gives the nvidia-acpi-generic-initiator two
> different ways to set gi->node, either node= of the parent or
> numa-node-start= here.  Once we expose the implicit node count in the
> base object, I'm not sure the purpose of this object.  I would have
> thought it for keying the build of the NVIDIA specific _DSD, but that's
> not implemented in this version.

Agree, allowing a list of nodes to be provided to the acpi-generic-initiator
will remove the need for the nvidia-acpi-generic-initiator object. 

> I also don't see any programatic means for management tools to know how
> many nodes to create.  For example what happens if there's a MIGv2 that
> supports 16 partitions by default and makes use of the same vfio-pci
> variant driver?  Thanks,

It is supposed to stay at 8 for all the G+H devices. Maybe this can be managed
through proper documentation in the user manual?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-09 21:27     ` Alex Williamson
@ 2023-10-17 14:18       ` Ankit Agrawal
  0 siblings, 0 replies; 26+ messages in thread
From: Ankit Agrawal @ 2023-10-17 14:18 UTC (permalink / raw)
  To: Alex Williamson, Jonathan Cameron
  Cc: Jason Gunthorpe, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Aniket Agashe, Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel

>> An admin can provide the range of nodes using numa-node-start and
>> numa-node-count and link it to a device by providing its id. The following
>> sample creates 8 nodes and link them to the device dev0:
>> 
>>         -numa node,nodeid=2 \
>>         -numa node,nodeid=3 \
>>         -numa node,nodeid=4 \
>>         -numa node,nodeid=5 \
>>         -numa node,nodeid=6 \
>>         -numa node,nodeid=7 \
>>         -numa node,nodeid=8 \
>>         -numa node,nodeid=9 \
>>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
>>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8 
>
> If you go this way, use an array of references to the numa nodes instead of a start and number.
> There is no obvious reason why they should be contiguous that I can see.

Right.

> I think it is simpler the other way around though - so have the numa nodes point at the
> vfio-pci-nohotplug device. 
>
> Why not something like
> -numa node,gi=dev0 \
> -numa node,gi=dev0 \

That could be one way to do this vs the other suggestion on this thread to provide
a node list through the acpi-generic-initiator object.

Just that the Qemu init numa nodes before the vfio-pci device, which could affect
our ability to fail in case the user provides the wrong vfio-pci id.

I am fine with both approach. Will implement the node list option if there aren't
any strong opinion.

>> If you go this way, use an array of references to the numa nodes instead of a start and number.
>> There is no obvious reason why they should be contiguous that I can see.
>
> Yup, I was looking for other places we allow a list syntax, I only
> found fds=a:b:...:z, which is also used for vhostfds=.  I didn't find a
> property beyond the string type to hold this though.

Maybe something like 'node=1-3,8-10'?

> Right, a uint16List should do.

Sure, thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-17 14:00     ` Ankit Agrawal
@ 2023-10-17 15:21       ` Alex Williamson
  2023-10-17 15:28         ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2023-10-17 15:21 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Jason Gunthorpe, clg, shannon.zhaosl, peter.maydell, ani,
	berrange, eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel, libvir-list, Laine Stump

On Tue, 17 Oct 2023 14:00:54 +0000
Ankit Agrawal <ankita@nvidia.com> wrote:

> >>         -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \
> >>         -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8  
> >
> > Why didn't we just implement start and count in the base object (or a
> > list)? It seems like this gives the nvidia-acpi-generic-initiator two
> > different ways to set gi->node, either node= of the parent or
> > numa-node-start= here.  Once we expose the implicit node count in the
> > base object, I'm not sure the purpose of this object.  I would have
> > thought it for keying the build of the NVIDIA specific _DSD, but that's
> > not implemented in this version.  
> 
> Agree, allowing a list of nodes to be provided to the acpi-generic-initiator
> will remove the need for the nvidia-acpi-generic-initiator object. 

And what happened to the _DSD?  Is it no longer needed?  Why?

> > I also don't see any programatic means for management tools to know how
> > many nodes to create.  For example what happens if there's a MIGv2 that
> > supports 16 partitions by default and makes use of the same vfio-pci
> > variant driver?  Thanks,  
> 
> It is supposed to stay at 8 for all the G+H devices. Maybe this can be managed
> through proper documentation in the user manual?

I thought the intention here was that a management tool would
automatically configure the VM with these nodes and GI object in
support of the device.  Planning only for Grace-Hopper isn't looking
too far into the future and it's difficult to make software that can
reference a user manual.  This leads to a higher maintenance burden
where the management tool needs to recognize not only the driver, but
the device bound to the driver and update as new devices are released.
The management tool will never automatically support new devices without
making an assumption about the node configuration.

Do we therefore need some programatic means for the kernel driver to
expose the node configuration to userspace?  What interfaces would
libvirt like to see here?  Is there an opportunity that this could
begin to define flavors or profiles for variant devices like we have
types for mdev devices where the node configuration would be
encompassed in a device profile?  Thanks,

Alex



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-17 15:21       ` Alex Williamson
@ 2023-10-17 15:28         ` Jason Gunthorpe
  2023-10-17 16:54           ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2023-10-17 15:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ankit Agrawal, clg, shannon.zhaosl, peter.maydell, ani, berrange,
	eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel, libvir-list, Laine Stump

On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:

> Do we therefore need some programatic means for the kernel driver to
> expose the node configuration to userspace?  What interfaces would
> libvirt like to see here?  Is there an opportunity that this could
> begin to define flavors or profiles for variant devices like we have
> types for mdev devices where the node configuration would be
> encompassed in a device profile? 

I don't think we should shift this mess into the kernel..

We have a wide range of things now that the orchestration must do in
order to prepare that are fairly device specific. I understand in K8S
configurations the preference is using operators (aka user space
drivers) to trigger these things.

Supplying a few extra qemu command line options seems minor compared
to all the profile and provisioning work that has to happen for other
device types.

Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-17 15:28         ` Jason Gunthorpe
@ 2023-10-17 16:54           ` Alex Williamson
  2023-10-17 17:24             ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2023-10-17 16:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ankit Agrawal, clg, shannon.zhaosl, peter.maydell, ani, berrange,
	eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel, libvir-list, Laine Stump

On Tue, 17 Oct 2023 12:28:30 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
> 
> > Do we therefore need some programatic means for the kernel driver to
> > expose the node configuration to userspace?  What interfaces would
> > libvirt like to see here?  Is there an opportunity that this could
> > begin to define flavors or profiles for variant devices like we have
> > types for mdev devices where the node configuration would be
> > encompassed in a device profile?   
> 
> I don't think we should shift this mess into the kernel..
> 
> We have a wide range of things now that the orchestration must do in
> order to prepare that are fairly device specific. I understand in K8S
> configurations the preference is using operators (aka user space
> drivers) to trigger these things.
> 
> Supplying a few extra qemu command line options seems minor compared
> to all the profile and provisioning work that has to happen for other
> device types.

This seems to be a growing problem for things like mlx5-vfio-pci where
there's non-trivial device configuration necessary to enable migration
support.  It's not super clear to me how those devices are actually
expected to be used in practice with that configuration burden.

Are we simply saying here that it's implicit knowledge that the
orchestration must posses that when assigning devices exactly matching
10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver
that 8 additional NUMA nodes should be added to the VM and an ACPI
generic initiator object created linking those additional nodes to the
assigned GPU?

Is libvirt ok with that specification or are we simply going to bubble
this up as a user problem? Thanks,

Alex



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object
  2023-10-17 16:54           ` Alex Williamson
@ 2023-10-17 17:24             ` Jason Gunthorpe
  0 siblings, 0 replies; 26+ messages in thread
From: Jason Gunthorpe @ 2023-10-17 17:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ankit Agrawal, clg, shannon.zhaosl, peter.maydell, ani, berrange,
	eduardo, imammedo, mst, eblake, armbru, david, gshan,
	Jonathan.Cameron, Aniket Agashe, Neo Jia, Kirti Wankhede,
	Tarun Gupta (SW-GPU),
	Vikram Sethi, Andy Currid, Dheeraj Nigam, Uday Dhoke, qemu-arm,
	qemu-devel, libvir-list, Laine Stump

On Tue, Oct 17, 2023 at 10:54:19AM -0600, Alex Williamson wrote:
> On Tue, 17 Oct 2023 12:28:30 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
> > 
> > > Do we therefore need some programatic means for the kernel driver to
> > > expose the node configuration to userspace?  What interfaces would
> > > libvirt like to see here?  Is there an opportunity that this could
> > > begin to define flavors or profiles for variant devices like we have
> > > types for mdev devices where the node configuration would be
> > > encompassed in a device profile?   
> > 
> > I don't think we should shift this mess into the kernel..
> > 
> > We have a wide range of things now that the orchestration must do in
> > order to prepare that are fairly device specific. I understand in K8S
> > configurations the preference is using operators (aka user space
> > drivers) to trigger these things.
> > 
> > Supplying a few extra qemu command line options seems minor compared
> > to all the profile and provisioning work that has to happen for other
> > device types.
> 
> This seems to be a growing problem for things like mlx5-vfio-pci where
> there's non-trivial device configuration necessary to enable migration
> support.  It's not super clear to me how those devices are actually
> expected to be used in practice with that configuration burden.

Yes, it is the nature of the situation. There is lots and lots of
stuff in the background here. We can nibble at some things, but I
don't see a way to be completely free of a userspace driver providing
the orchestration piece for every device type.

Maybe someone who knows more about the k8s stuff works can explain
more?

> Are we simply saying here that it's implicit knowledge that the
> orchestration must posses that when assigning devices exactly matching
> 10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver
> that 8 additional NUMA nodes should be added to the VM and an ACPI
> generic initiator object created linking those additional nodes to the
> assigned GPU?

What I'm trying to say is that orchestration should try to pull in a
userspace driver to provide the non-generic pieces.

But, it isn't clear to me what that driver is generically. Something
like this case is pretty stand alone, but mlx5 needs to interact with
the networking control plane to fully provision the PCI
function. Storage needs a different control plane. Few PCI devices are
so stand alone they can be provisioned without complicated help. 

Even things like IDXD need orchestration to sort of uniquely
understand how to spawn their SIOV functions.

I'm not sure I see a clear vision here from libvirt side how all these
parts interact in the libvirt world, or if the answer is "use
openshift and the operators".

Jason


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2023-10-17 17:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-07 20:17 [PATCH v2 0/3] acpi: report numa nodes for device memory using GI ankita
2023-10-07 20:17 ` [PATCH v2 1/3] qom: new object to associate device to numa node ankita
2023-10-09 12:26   ` Jonathan Cameron via
2023-10-09 12:26     ` Jonathan Cameron
2023-10-11 17:37     ` Vikram Sethi
2023-10-12  8:59       ` Jonathan Cameron via
2023-10-12  8:59         ` Jonathan Cameron
2023-10-09 21:16   ` Alex Williamson
2023-10-13 13:16   ` Markus Armbruster
2023-10-17 13:44     ` Ankit Agrawal
2023-10-07 20:17 ` [PATCH v2 2/3] hw/acpi: Implement the SRAT GI affinity structure ankita
2023-10-09 21:16   ` Alex Williamson
2023-10-17 13:51     ` Ankit Agrawal
2023-10-07 20:17 ` [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object ankita
2023-10-09 12:30   ` Jonathan Cameron via
2023-10-09 12:30     ` Jonathan Cameron
2023-10-09 12:57     ` David Hildenbrand
2023-10-09 21:27     ` Alex Williamson
2023-10-17 14:18       ` Ankit Agrawal
2023-10-09 21:16   ` Alex Williamson
2023-10-17 14:00     ` Ankit Agrawal
2023-10-17 15:21       ` Alex Williamson
2023-10-17 15:28         ` Jason Gunthorpe
2023-10-17 16:54           ` Alex Williamson
2023-10-17 17:24             ` Jason Gunthorpe
2023-10-13 13:17   ` Markus Armbruster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.