All of lore.kernel.org
 help / color / mirror / Atom feed
* [QEMU-devel][RFC PATCH 0/1] Introduce HostMemType for 'memory-backend-*'
@ 2024-01-01  7:53 Ho-Ren (Jack) Chuang
  2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
  0 siblings, 1 reply; 19+ messages in thread
From: Ho-Ren (Jack) Chuang @ 2024-01-01  7:53 UTC (permalink / raw)
  To: Michael S. Tsirkin, Hao Xiang, Jonathan Cameron, Ben Widawsky,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel
  Cc: Ho-Ren (Jack) Chuang, Ho-Ren (Jack) Chuang, linux-cxl

Heterogeneous memory setups are becoming popular. Today, we can run a server
system combining DRAM, pmem, HBM, and CXL-attached DDR. CXL-attached DDR
memory shares the same set of attributes as normal system DRAM but with higher
latency and lower bandwidth. With the rapid increase in CPU core counts in
today's server platforms, memory capacity and bandwidth become bottlenecks.
High-capacity memory devices are very expensive and deliver poor value on a
dollar-per-GB basis. There are a limited number of memory channels per socket,
and hence the total memory capacity per socket is limited by cost.

As a cloud service provider, virtual machines are a fundamental service.
The virtual machine models have pre-set vCPU counts and memory capacity.
Large memory capacity VM models have a higher memory capacity per vCPU.
Delivering VM instances with the same vCPU and memory requirements
on new-generation Intel/AMD server platforms becomes challenging as
the CPU core count rapidly increases. With the help of CXL local
memory expanders, we can install more DDR memory devices on a socket and
almost double the total memory capacity per socket at a reasonable cost on
new server platforms. Thus, we can continue to deliver existing VM models. On
top of that, low-cost, large memory capacity VM models become a possibility.

CXL-attached memory (CXL type-3 device) can be used in exactly the same way
as system-DRAM but with somewhat degraded performance. QEMU is in the process
of supporting CXL virtualization. Currently, in QEMU, we can already create
virtualized CXL memory devices, and a guest OS running the latest Linux kernel
can successfully bring CXL memory online.

We conducted benchmark testing on VMs with three setups:
1. VM with virtualized system-DRAM, backed by system-DRAM on
the physical host. No virtualized CXL memory.
2. VM with virtualized system-DRAM, backed by CXL-attached memory on
the physical host. No virtualized CXL memory.
3. VM with virtualized system-DRAM, backed by system-DRAM on
the physical host, and virtualized CXL memory backed by CXL-attached memory on
the physical host.

Benchmark 1: Intel Memory Latency Checker
Link: https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
Guest VM idle latency for random access in nanoseconds:
- System-DRAM backed by system-DRAM on the host = 116.1
- CXL memory backed by CXL-attached DRAM on the host = 266.8
- System-DRAM backed by CXL-attached DRAM on the host = 269.7
From within the guest VM, read/write latency on memory backed by
host CXL-DRAM is 2X compared to memory backed by host system-DRAM memory.
We observe the same performance result regardless of whether memory is exposed
as virtualized system-DRAM or virtualized CXL memory. The driving factor for
performance is the memory backend (backing memory type: system-DRAM vs
CXL-DRAM) on the host, not the frontend memory type (virtualized system-DRAM
vs virtualized CXL-DRAM) exposed to the guest OS.

Benchmark 2: Redis memtier benchmark
Link: https://redis.com/blog/memtier_benchmark-a-high-throughput-benchmarking-tool-for-redis-memcached/
Guest VM Redis concurrent read latency in milliseconds:
- Key size = 800B, value size = 800B
                         P50         P99         P999
(1) System-DRAM only    13.43       40.95       243.71
(2) CXL-DRAM only       29.18       49.15       249.85
(3) Tiered memory       13.76       39.16       241.66

- Key size = 800B, value size = 70kB
                           P50        P99        P999
(1) System-DRAM only     342.01     630.78      925.69
(2) CXL-DRAM only        696.32     720.89     1007.61
(3) Tiered memory        610.30     671.74     1011.71

From within the guest VM, the Redis server is filled with a large number of
in-memory key-value pairs. Almost all memory is used inside the VM. We then
start a workload with concurrent read operations. For (3), we only read the
key-value pairs located in CXL-DRAM.

- The P50 latency for read operations is almost the same between tiered memory
and system-DRAM when the value size is small (800 bytes). The performance
results are primarily dominated by other software stacks like communication
and CPU cache. The read workload consistently hit the key-value pairs stored
in CXL memory. However, the P50 latency in CXL-only is >2X slower than the
other two setups. When the value size is small, the latency for
read operations is mostly spent in the software stack. It seems critical to
have the guest Linux kernel running on system-DRAM backed memory for
good performance.

- The P50 latency for read operations becomes 70% worse in the tiered memory
setup compared to system-DRAM only when the value size is large (70 KB). The
CXL-only option consistently exhibits poor performance. When the value size is
large, the latency for read operations is mostly spent in reading the value
from the Redis server. The read workload consistently hit the key-value pairs
stored in CXL memory.

Please note that in our experiment, the tiered memory system didn't
promote/demote the pages as expected. The tiered memory setup should have
better performance as the Linux community gradually improves the page
promotion/demotion algorithm. The Linux kernel community has developed a
tiered memory system to better utilize various types of DRAM-like memory.
The future of memory tiering: https://lwn.net/Articles/931421/


Having the guest kernel running on system-DRAM-backed memory and
application data running on CXL-backed memory shows comparable performance to
the topline (system-DRAM only), and it is also a cost-effective solution.
Moreover, in the near future, users will be able to benefit from memory with
different features. Enabling the tiered memory system in the guest OS seems to
be in the right direction. To enable these scenarios from end to end, we need
some plumbing work in the memory backend and CXL memory virtualization stacks.

- QEMU's memory backend object needs to support an option to automatically map
guest memory to a specified type of memory on the host. Take CXL-attached
memory for example, we can then create the virtualized CXL type-3 device as
the frontend and automatically map it to a type of CXL memory backend. This
patchset contains a prototype implementation that accomplishes this. 
We introduce a new configuration option 'host-mem-type=', enabling users
to specify the type of memory from which they want to allocate. An argument
'cxlram' is used to automatically locate CXL-DRAM NUMA nodes on the host and
use them as the backend memory. This provides users with great convenience.
There is no existing API in the Linux kernel to explicitly allocate memory
from CXL-attached memory. Therefore, we rely on the information provided by
the dax kmem driver under the sysfs path
'/sys/bus/cxl/devices/region[X]/dax_region[X]/dax[X]/target_node'
in the prototype.

- Kernel memory tiering uses the dax kmem driver's device probe path to query
ACPI to obtain CXL device attributes (latency, bandwidth) and calculates its
abstract distance. The abstract distance sets the memory to the correct tier.
Although QEMU already provides the option "-numa hmat-lb" to set memory
latency/bandwidth attributes, we were not able to connect the dots from end
to end. After setting the attributes in QEMU, booting up the VM, and creating
devdax CXL devices, the guest kernel was not able to correctly read the memory
attributes for the devdax devices. We are still debugging that path, but we
suspect that it's due to missing functionality in CXL virtualization support.

- When creating two virtualized CXL type-3 devices and bringing them up by
using cxl and daxctl tools, we were not able to create the
2nd memory region/devdax device inside the VM. We are debugging this issue
but would appreciate feedback if others are also dealing with similar
challenges.

Ho-Ren (Jack) Chuang (1):
  backends/hostmem: qapi/qom: Add ObjectOptions for memory-backend-*
    called HostMemType

 backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |   1 +
 qapi/common.json         |  19 ++++
 qapi/qom.json            |   1 +
 qemu-options.hx          |   2 +-
 5 files changed, 206 insertions(+), 1 deletion(-)

-- 
Regards,
Hao Xiang and Ho-Ren (Jack) Chuang


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-01  7:53 [QEMU-devel][RFC PATCH 0/1] Introduce HostMemType for 'memory-backend-*' Ho-Ren (Jack) Chuang
@ 2024-01-01  7:53 ` Ho-Ren (Jack) Chuang
  2024-01-02 10:29   ` Philippe Mathieu-Daudé
                     ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Ho-Ren (Jack) Chuang @ 2024-01-01  7:53 UTC (permalink / raw)
  To: Michael S. Tsirkin, Hao Xiang, Jonathan Cameron, Ben Widawsky,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel
  Cc: Ho-Ren (Jack) Chuang, Ho-Ren (Jack) Chuang, linux-cxl

Introduce a new configuration option 'host-mem-type=' in the
'-object memory-backend-ram', allowing users to specify
from which type of memory to allocate.

Users can specify 'cxlram' as an argument, and QEMU will then
automatically locate CXL RAM NUMA nodes and use them as the backend memory.
For example:
	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \

In v1, we plan to move most of the implementations to util and break down
this patch into different smaller patches.

Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
---
 backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
 include/sysemu/hostmem.h |   1 +
 qapi/common.json         |  19 ++++
 qapi/qom.json            |   1 +
 qemu-options.hx          |   2 +-
 5 files changed, 206 insertions(+), 1 deletion(-)

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 747e7838c0..3bede13879 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -44,6 +44,133 @@ host_memory_backend_get_name(HostMemoryBackend *backend)
     return object_get_canonical_path(OBJECT(backend));
 }
 
+#define FILE_LINE_LEN 256
+static int
+is_valid_node(const char *path) {
+    FILE *file = fopen(path, "r");
+    if (file == NULL) {
+        return -1;
+    }
+
+    char line[FILE_LINE_LEN];
+    if (fgets(line, sizeof(line), file) != NULL) {
+        int target_node = atoi(line);
+
+        if (target_node >= 0) {
+            fclose(file);
+            return target_node;
+        }
+    }
+
+    fclose(file);
+    return -1;
+}
+
+static int
+is_directory(const char *path) {
+    struct stat path_stat;
+    stat(path, &path_stat);
+    return S_ISDIR(path_stat.st_mode);
+}
+
+static int
+is_symlink(const char *path) {
+    struct stat path_stat;
+    if (lstat(path, &path_stat) == -1) {
+        return 0;
+    }
+    return S_ISLNK(path_stat.st_mode);
+}
+
+#define CXL_DEVICE_PATH "/sys/bus/cxl/devices/"
+#define REGION_PATH_LEN 307
+#define DAX_REGION_PATH_LEN 563
+#define DAX_PATH_LEN 819
+#define TARGET_FILE_PATH_LEN 831
+/*
+ * return: the number of valid numa node id found
+ */
+static int
+host_memory_backend_get_cxlram_nodes(int *valid_cxlram_nodes) {
+    DIR *base_dir = NULL, *region_dir = NULL, *dax_region_dir = NULL;
+    const char *base_dir_path = CXL_DEVICE_PATH;
+    struct dirent *entry;
+    int valid_node = 0, ret = 0;
+
+    base_dir = opendir(base_dir_path);
+    if (base_dir == NULL) {
+        return valid_node;
+    }
+
+    while ((entry = readdir(base_dir)) != NULL) {
+        char region_path[REGION_PATH_LEN];
+
+        ret = snprintf(region_path, sizeof(region_path), "%s%s",
+                                            base_dir_path, entry->d_name);
+        if (ret < 0 ||
+            !is_symlink(region_path) ||
+            strncmp(entry->d_name, "region", ARRAY_SIZE("region") - 1)) {
+            continue;
+        }
+
+        region_dir = opendir(region_path);
+        if (region_dir == NULL) {
+            goto region_exit;
+        }
+
+        while ((entry = readdir(region_dir)) != NULL) {
+            char dax_region_path[DAX_REGION_PATH_LEN];
+
+            ret = snprintf(dax_region_path, sizeof(dax_region_path), "%s/%s",
+                                                    region_path, entry->d_name);
+            if (ret < 0 ||
+                !is_directory(dax_region_path) ||
+                strncmp(entry->d_name, "dax_region",
+                            ARRAY_SIZE("dax_region") - 1)) {
+
+                continue;
+            }
+
+            dax_region_dir = opendir(dax_region_path);
+            if (dax_region_dir == NULL) {
+                goto dax_region_exit;
+            }
+
+            while ((entry = readdir(dax_region_dir)) != NULL) {
+                int target_node;
+                char dax_path[DAX_PATH_LEN];
+                char target_file_path[TARGET_FILE_PATH_LEN];
+                ret = snprintf(dax_path, sizeof(dax_path), "%s/%s",
+                                            dax_region_path, entry->d_name);
+                if (ret < 0 ||
+                    !is_directory(dax_path) ||
+                    strncmp(entry->d_name, "dax", ARRAY_SIZE("dax") - 1)) {
+                    continue;
+                }
+
+                ret = snprintf(target_file_path, sizeof(target_file_path),
+                                                    "%s/target_node", dax_path);
+                if (ret < 0) {
+                    continue;
+                }
+
+                target_node = is_valid_node(target_file_path);
+                if (target_node >= 0) {
+                    valid_cxlram_nodes[valid_node] = target_node;
+                    valid_node++;
+                }
+            }
+        }
+    }
+
+    closedir(dax_region_dir);
+dax_region_exit:
+    closedir(region_dir);
+region_exit:
+    closedir(base_dir);
+    return valid_node;
+}
+
 static void
 host_memory_backend_get_size(Object *obj, Visitor *v, const char *name,
                              void *opaque, Error **errp)
@@ -117,6 +244,12 @@ host_memory_backend_set_host_nodes(Object *obj, Visitor *v, const char *name,
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     uint16List *l, *host_nodes = NULL;
 
+    if (backend->host_mem_type == HOST_MEM_TYPE_CXLRAM) {
+        error_setg(errp,
+            "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible");
+        return;
+    }
+
     visit_type_uint16List(v, name, &host_nodes, errp);
 
     for (l = host_nodes; l; l = l->next) {
@@ -150,6 +283,11 @@ host_memory_backend_set_policy(Object *obj, int policy, Error **errp)
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
     backend->policy = policy;
 
+    if (backend->host_mem_type == HOST_MEM_TYPE_CXLRAM) {
+        error_setg(errp,
+            "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible");
+    }
+
 #ifndef CONFIG_NUMA
     if (policy != HOST_MEM_POLICY_DEFAULT) {
         error_setg(errp, "NUMA policies are not supported by this QEMU");
@@ -157,6 +295,46 @@ host_memory_backend_set_policy(Object *obj, int policy, Error **errp)
 #endif
 }
 
+static int
+host_memory_backend_get_host_mem_type(Object *obj, Error **errp G_GNUC_UNUSED)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+    return backend->host_mem_type;
+}
+
+static void
+host_memory_backend_set_host_mem_type(Object *obj, int host_mem_type, Error **errp)
+{
+    HostMemoryBackend *backend = MEMORY_BACKEND(obj);
+    backend->host_mem_type = host_mem_type;
+
+#ifndef CONFIG_NUMA
+    error_setg(errp, "NUMA node host memory types are not supported by this QEMU");
+#else
+    int i, valid_cxlram_nodes[MAX_NODES];
+
+    if (backend->policy > 0 ||
+        !bitmap_empty(backend->host_nodes, MAX_NODES)) {
+        error_setg(errp,
+            "'host-mem-type=' and 'host-nodes='/'policy=' are incompatible");
+        return;
+    }
+
+    if (host_memory_backend_get_cxlram_nodes(valid_cxlram_nodes) > 0) {
+        for (i = 0; i < MAX_NODES; i++) {
+            if (valid_cxlram_nodes[i] < 0) {
+                break;
+            }
+            bitmap_set(backend->host_nodes, valid_cxlram_nodes[i], 1);
+        }
+    } else {
+        error_setg(errp, "Cannot find CXL RAM on host");
+        return;
+    }
+    backend->policy = HOST_MEM_POLICY_BIND;
+#endif
+}
+
 static bool host_memory_backend_get_merge(Object *obj, Error **errp)
 {
     HostMemoryBackend *backend = MEMORY_BACKEND(obj);
@@ -536,6 +714,12 @@ host_memory_backend_class_init(ObjectClass *oc, void *data)
         host_memory_backend_get_share, host_memory_backend_set_share);
     object_class_property_set_description(oc, "share",
         "Mark the memory as private to QEMU or shared");
+    object_class_property_add_enum(oc, "host-mem-type", "HostMemType",
+        &HostMemType_lookup,
+        host_memory_backend_get_host_mem_type,
+        host_memory_backend_set_host_mem_type);
+    object_class_property_set_description(oc, "host-mem-type",
+        "Set the backend host memory type");
 #ifdef CONFIG_LINUX
     object_class_property_add_bool(oc, "reserve",
         host_memory_backend_get_reserve, host_memory_backend_set_reserve);
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 39326f1d4f..afeb9b71d1 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -70,6 +70,7 @@ struct HostMemoryBackend {
     ThreadContext *prealloc_context;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
     HostMemPolicy policy;
+    HostMemType host_mem_type;
 
     MemoryRegion mr;
 };
diff --git a/qapi/common.json b/qapi/common.json
index 6fed9cde1a..591fd73291 100644
--- a/qapi/common.json
+++ b/qapi/common.json
@@ -167,6 +167,25 @@
 { 'enum': 'HostMemPolicy',
   'data': [ 'default', 'preferred', 'bind', 'interleave' ] }
 
+##
+# @HostMemType:
+#
+# Automatically find a backend memory type on host.
+# Can be further extened to support other types such as cxlpmem, hbm.
+#
+# @none: do nothing (default).
+#
+# @cxlram: a CXL RAM backend on host.
+#
+# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same
+# time. HostMemType is used to automatically bind with one kind of
+# host memory types.
+#
+# Since: 8.3
+##
+{ 'enum': 'HostMemType',
+  'data': [ 'none', 'cxlram' ] }
+
 ##
 # @NetFilterDirection:
 #
diff --git a/qapi/qom.json b/qapi/qom.json
index 95516ba325..fa3bc29708 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -626,6 +626,7 @@
             '*host-nodes': ['uint16'],
             '*merge': 'bool',
             '*policy': 'HostMemPolicy',
+            '*host-mem-type': 'HostMemType',
             '*prealloc': 'bool',
             '*prealloc-threads': 'uint32',
             '*prealloc-context': 'str',
diff --git a/qemu-options.hx b/qemu-options.hx
index b66570ae00..39074c1aa0 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -5211,7 +5211,7 @@ SRST
         (``share=off``). For this use case, we need writable RAM instead
         of ROM, and want to also set ``rom=off``.
 
-    ``-object memory-backend-ram,id=id,merge=on|off,dump=on|off,share=on|off,prealloc=on|off,size=size,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
+    ``-object memory-backend-ram,id=id,merge=on|off,dump=on|off,share=on|off,prealloc=on|off,size=size,host-mem-type=cxlram,host-nodes=host-nodes,policy=default|preferred|bind|interleave``
         Creates a memory backend object, which can be used to back the
         guest RAM. Memory backend objects offer more control than the
         ``-m`` option that is traditionally used to define guest RAM.
-- 
Regards,
Hao Xiang and Ho-Ren (Jack) Chuang


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
@ 2024-01-02 10:29   ` Philippe Mathieu-Daudé
  2024-01-02 13:03   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Philippe Mathieu-Daudé @ 2024-01-02 10:29 UTC (permalink / raw)
  To: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Hao Xiang,
	Jonathan Cameron, Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel
  Cc: Ho-Ren (Jack) Chuang, linux-cxl

Hi Jack,

On 1/1/24 08:53, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> 	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
> 
> In v1, we plan to move most of the implementations to util and break down
> this patch into different smaller patches.
> 
> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>
> ---
>   backends/hostmem.c       | 184 +++++++++++++++++++++++++++++++++++++++
>   include/sysemu/hostmem.h |   1 +
>   qapi/common.json         |  19 ++++
>   qapi/qom.json            |   1 +
>   qemu-options.hx          |   2 +-
>   5 files changed, 206 insertions(+), 1 deletion(-)
> 
> diff --git a/backends/hostmem.c b/backends/hostmem.c


> +#define CXL_DEVICE_PATH "/sys/bus/cxl/devices/"
> +#define REGION_PATH_LEN 307
> +#define DAX_REGION_PATH_LEN 563
> +#define DAX_PATH_LEN 819
> +#define TARGET_FILE_PATH_LEN 831

How do you get these numbers?

> diff --git a/qapi/common.json b/qapi/common.json
> index 6fed9cde1a..591fd73291 100644
> --- a/qapi/common.json
> +++ b/qapi/common.json
> @@ -167,6 +167,25 @@
>   { 'enum': 'HostMemPolicy',
>     'data': [ 'default', 'preferred', 'bind', 'interleave' ] }
>   
> +##
> +# @HostMemType:
> +#
> +# Automatically find a backend memory type on host.

This description is not clear (to me).

> +# Can be further extened to support other types such as cxlpmem, hbm.

Typo "extended" although I'm not sure it is helpful to mention it.

> +#
> +# @none: do nothing (default).

"do nothing" is confusing here, I'd drop it.

> +#
> +# @cxlram: a CXL RAM backend on host.
> +#
> +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same
> +# time. HostMemType is used to automatically bind with one kind of
> +# host memory types.
> +#
> +# Since: 8.3

9.0

> +##
> +{ 'enum': 'HostMemType',
> +  'data': [ 'none', 'cxlram' ] }
> +
>   ##
>   # @NetFilterDirection:
>   #
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 95516ba325..fa3bc29708 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -626,6 +626,7 @@
>               '*host-nodes': ['uint16'],
>               '*merge': 'bool',
>               '*policy': 'HostMemPolicy',
> +            '*host-mem-type': 'HostMemType',

Missing documentation in MemoryBackendProperties.

>               '*prealloc': 'bool',
>               '*prealloc-threads': 'uint32',
>               '*prealloc-context': 'str',


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
  2024-01-02 10:29   ` Philippe Mathieu-Daudé
@ 2024-01-02 13:03   ` David Hildenbrand
  2024-01-06  0:45     ` [External] " Hao Xiang
  2024-01-03 21:56   ` Gregory Price
  2024-01-12 15:32   ` Markus Armbruster
  3 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2024-01-02 13:03 UTC (permalink / raw)
  To: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Hao Xiang,
	Jonathan Cameron, Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Eric Blake, Markus Armbruster, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel
  Cc: Ho-Ren (Jack) Chuang, linux-cxl

On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> 	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
> 

You can achieve the exact same thing already simply by using memory 
policies and detecting the node(s) before calling QEMU, no?

There has to be a good reason to add such a shortcut into QEMU, and it 
should be spelled out here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
  2024-01-02 10:29   ` Philippe Mathieu-Daudé
  2024-01-02 13:03   ` David Hildenbrand
@ 2024-01-03 21:56   ` Gregory Price
  2024-01-06  5:59     ` [External] " Hao Xiang
  2024-01-12 15:32   ` Markus Armbruster
  3 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2024-01-03 21:56 UTC (permalink / raw)
  To: Ho-Ren (Jack) Chuang
  Cc: Michael S. Tsirkin, Hao Xiang, Jonathan Cameron, Ben Widawsky,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote:
> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
> 
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \

Stupid questions:

Why not just use `host-nodes` and pass in the numa node you want to
allocate from? Why should QEMU be made "CXL-aware" in the sense that
QEMU is responsible for figuring out what host node has CXL memory?

This feels like an "upper level software" operation (orchestration), rather
than something qemu should internally understand.

> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \

For a variety of performance reasons, this will not work the way you
want it to.  You are essentially telling QEMU to map the vmem0 into a
virtual cxl device, and now any memory accesses to that memory region
will end up going through the cxl-type3 device logic - which is an IO
path from the perspective of QEMU.

You may want to double check that your tests aren't using swap space in
the guest, because I found in my testing that linux would prefer to use
swap rather than attempt to use virtual cxl-type3 device memory because
of how god-awful slow it is (even if it is backed by DRAM).


Additionally, this configuration will not (should not) presently work
with VMX enabled.  Unless I missed some other update, QEMU w/ CXL memory
presently crashes when VMX is enabled for not-yet-debugged reasons.

Another possiblity: You mapped this memory-backend into another numa
node explicitly and never onlined the memory via cxlcli.  I've done
this, and it works, but it's a "hidden feature" that probably should
not exist / be supported.



If I understand the goal here, it's to pass CXL-hosted DRAM through to
the guest in a way that the system can manage it according to its
performance attributes.

You're better off just using the `host-nodes` field of host-memory
and passing bandwidth/latency attributes though via `-numa hmat-lb`

In that scenario, the guest software doesn't even need to know CXL
exists at all, it can just read the attributes of the numa node
that QEMU created for it.

In the future to deal with proper dynamic capacity, we may need to
consider a different backend object altogether that allows sparse
allocations, and a virtual cxl device which pre-allocates the CFMW
can at least be registered to manage it.  I'm not quite sure how that
looks just yet.

For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB
on a cpuless node.

qemu-system-x86_64 \
-nographic \
-accel kvm \
-machine type=q35,hmat=on \
-drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \
-m 8G,slots=4,maxmem=16G \
-smp cpus=4 \
-object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here
-object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here
-numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \          <-- cpu node
-numa node,initiator=0,nodeid=1,memdev=ram-node1 \       <-- cpuless node
-netdev bridge,id=hn0,br=virbr0 \
-device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880

[root@fedora ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 3965 MB
node 0 free: 3611 MB
node 1 cpus:
node 1 size: 3986 MB
node 1 free: 3960 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

[root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators
node0  read_bandwidth  read_latency  write_bandwidth  write_latency
[root@fedora initiators]# cat read_bandwidth
5
[root@fedora initiators]# cat read_latency
20


~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-02 13:03   ` David Hildenbrand
@ 2024-01-06  0:45     ` Hao Xiang
  0 siblings, 0 replies; 19+ messages in thread
From: Hao Xiang @ 2024-01-06  0:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	Igor Mammedov, Eric Blake, Markus Armbruster, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, Jan 2, 2024 at 5:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.01.24 08:53, Ho-Ren (Jack) Chuang wrote:
> > Introduce a new configuration option 'host-mem-type=' in the
> > '-object memory-backend-ram', allowing users to specify
> > from which type of memory to allocate.
> >
> > Users can specify 'cxlram' as an argument, and QEMU will then
> > automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> > For example:
> >       -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> >       -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >       -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >       -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> >       -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
> >
>
> You can achieve the exact same thing already simply by using memory
> policies and detecting the node(s) before calling QEMU, no?

Yes, I agree this can be done with memory policy and bind to the CXL
memory numa nodes on host.

>
> There has to be a good reason to add such a shortcut into QEMU, and it
> should be spelled out here.

So our end goal here is to enable CXL memory in the guest VM and have
the guest kernel to recognize the CXL memory to the correct memory
tier (slow tier) in the Linux kernel tiered memory system.

Here is what we observed:
* The kernel tiered memory system relies on calculating the memory
attributes (read/write latency, bandwidth from ACPI) for fast vs slow
tier.
* The kernel tiered memory system has two path to recognize a memory
tier 1) in mm subsystem init, memory_tier_init 2) in kmem driver
device probe dev_dax_kmem_probe. Since the ACPI subsystem is
initialized after mm, reading the memory attributes from ACPI can only
be done in 2). CXL memory has to be presented as a devdax device,
which can then be probed by the kmem driver in the guest and
recognized as the slow tier.

We do see that QEMU has this option "-numa hmat-lb" to set the memory
attributes per VM's numa node. The problem is that setting the memory
attributes per numa node means that the numa node is created during
guest kernel initialization. A CXL devdax device can only be created
post kernel initialization and new numa nodes are created for the CXL
devdax devices. The guest kernel is not reading the memory attributes
from "-numa hmat-lb" for the CXL devdax devices.

So we thought if we create an explicit CXL memory backend, and
associate that with the virtual CXL type-3 frontend, we can pass the
CXL memory attributes from the host into the guest VM and avoid using
memory policy and "-numa hmat-lb", thus simplifying the configuration.
We are still figuring out exactly how to pass the memory attributes
from the CXL backend into the VM. There is probably a solution to
"-numa hmat-lb" for the CXL devdax devices as well and we are also
looking into it.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-03 21:56   ` Gregory Price
@ 2024-01-06  5:59     ` Hao Xiang
  2024-01-08 17:15       ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Xiang @ 2024-01-06  5:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Sun, Dec 31, 2023 at 11:53:15PM -0800, Ho-Ren (Jack) Chuang wrote:
> > Introduce a new configuration option 'host-mem-type=' in the
> > '-object memory-backend-ram', allowing users to specify
> > from which type of memory to allocate.
> >
> > Users can specify 'cxlram' as an argument, and QEMU will then
> > automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> > For example:
> >       -object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
>
> Stupid questions:
>
> Why not just use `host-nodes` and pass in the numa node you want to
> allocate from? Why should QEMU be made "CXL-aware" in the sense that
> QEMU is responsible for figuring out what host node has CXL memory?
>
> This feels like an "upper level software" operation (orchestration), rather
> than something qemu should internally understand.

I don't have a "big picture" and I am learning. Maybe we proposed
something not useful :-) I replied to the same question on a fork of
this thread.

>
> >       -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >       -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >       -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
>
> For a variety of performance reasons, this will not work the way you
> want it to.  You are essentially telling QEMU to map the vmem0 into a
> virtual cxl device, and now any memory accesses to that memory region
> will end up going through the cxl-type3 device logic - which is an IO
> path from the perspective of QEMU.

I didn't understand exactly how the virtual cxl-type3 device works. I
thought it would go with the same "guest virtual address ->  guest
physical address -> host physical address" translation totally done by
CPU. But if it is going through an emulation path handled by virtual
cxl-type3, I agree the performance would be bad. Do you know why
accessing memory on a virtual cxl-type3 device can't go with the
nested page table translation?

>
> You may want to double check that your tests aren't using swap space in
> the guest, because I found in my testing that linux would prefer to use
> swap rather than attempt to use virtual cxl-type3 device memory because
> of how god-awful slow it is (even if it is backed by DRAM).

We didn't enable swap in our current test setup. I think there was a
kernel change making the mm page reclamation path to use cxl memory
instead of swap if you enable memory tiering. Did you try that? Swap
is persistent storage. I would be very surprised if virtual cxl is
actually slower than swap.

>
>
> Additionally, this configuration will not (should not) presently work
> with VMX enabled.  Unless I missed some other update, QEMU w/ CXL memory
> presently crashes when VMX is enabled for not-yet-debugged reasons.

When we had a discussion with Intel, they told us to not use the KVM
option in QEMU while using virtual cxl type3 device. That's probably
related to the issue you described here? We enabled KVM though but
haven't seen the crash yet.

>
> Another possiblity: You mapped this memory-backend into another numa
> node explicitly and never onlined the memory via cxlcli.  I've done
> this, and it works, but it's a "hidden feature" that probably should
> not exist / be supported.

I thought general purpose memory nodes are onlined by default?

>
>
>
> If I understand the goal here, it's to pass CXL-hosted DRAM through to
> the guest in a way that the system can manage it according to its
> performance attributes.

Yes.

>
> You're better off just using the `host-nodes` field of host-memory
> and passing bandwidth/latency attributes though via `-numa hmat-lb`

We tried this but it doesn't work from end to end right now. I
described the issue in another fork of this thread.

>
> In that scenario, the guest software doesn't even need to know CXL
> exists at all, it can just read the attributes of the numa node
> that QEMU created for it.

We thought about this before. But the current kernel implementation
requires a devdax device to be probed and recognized as a slow tier
(by reading the memory attributes). I don't think this can be done via
the path you described. Have you tried this before?

>
> In the future to deal with proper dynamic capacity, we may need to
> consider a different backend object altogether that allows sparse
> allocations, and a virtual cxl device which pre-allocates the CFMW
> can at least be registered to manage it.  I'm not quite sure how that
> looks just yet.

Are we talking about CXL memory pooling?

>
> For example: 1-socket, 4 CPU QEMU instance w/ 4GB on a cpu-node and 4GB
> on a cpuless node.
>
> qemu-system-x86_64 \
> -nographic \
> -accel kvm \
> -machine type=q35,hmat=on \
> -drive file=./myvm.qcow2,format=qcow2,index=0,media=disk,id=hd \
> -m 8G,slots=4,maxmem=16G \
> -smp cpus=4 \
> -object memory-backend-ram,size=4G,id=ram-node0,numa=X \ <-- extend here
> -object memory-backend-ram,size=4G,id=ram-node1,numa=Y \ <-- extend here
> -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \          <-- cpu node
> -numa node,initiator=0,nodeid=1,memdev=ram-node1 \       <-- cpuless node
> -netdev bridge,id=hn0,br=virbr0 \
> -device virtio-net-pci,netdev=hn0,id=nic1,mac=52:54:00:12:34:77 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>
> [root@fedora ~]# numactl -H
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3
> node 0 size: 3965 MB
> node 0 free: 3611 MB
> node 1 cpus:
> node 1 size: 3986 MB
> node 1 free: 3960 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> [root@fedora initiators]# cd /sys/devices/system/node/node1/access0/initiators
> node0  read_bandwidth  read_latency  write_bandwidth  write_latency
> [root@fedora initiators]# cat read_bandwidth
> 5
> [root@fedora initiators]# cat read_latency
> 20
>
>
> ~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-06  5:59     ` [External] " Hao Xiang
@ 2024-01-08 17:15       ` Gregory Price
  2024-01-08 22:47         ` Hao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2024-01-08 17:15 UTC (permalink / raw)
  To: Hao Xiang
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > For a variety of performance reasons, this will not work the way you
> > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > virtual cxl device, and now any memory accesses to that memory region
> > will end up going through the cxl-type3 device logic - which is an IO
> > path from the perspective of QEMU.
> 
> I didn't understand exactly how the virtual cxl-type3 device works. I
> thought it would go with the same "guest virtual address ->  guest
> physical address -> host physical address" translation totally done by
> CPU. But if it is going through an emulation path handled by virtual
> cxl-type3, I agree the performance would be bad. Do you know why
> accessing memory on a virtual cxl-type3 device can't go with the
> nested page table translation?
>

Because a byte-access on CXL memory can have checks on it that must be
emulated by the virtual device, and because there are caching
implications that have to be emulated as well.

The cxl device you are using is an emulated CXL device - not a
virtualization interface.  Nuanced difference:  the emulated device has
to emulate *everything* that CXL device does.

What you want is passthrough / managed access to a real device -
virtualization.  This is not the way to accomplish that.  A better way
to accomplish that is to simply pass the memory through as a static numa
node as I described.

> 
> When we had a discussion with Intel, they told us to not use the KVM
> option in QEMU while using virtual cxl type3 device. That's probably
> related to the issue you described here? We enabled KVM though but
> haven't seen the crash yet.
>

The crash really only happens, IIRC, if code ends up hosted in that
memory.  I forget the exact scenario, but the working theory is it has
to do with the way instruction caches are managed with KVM and this
device.

> >
> > You're better off just using the `host-nodes` field of host-memory
> > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> 
> We tried this but it doesn't work from end to end right now. I
> described the issue in another fork of this thread.
>
> >
> > In that scenario, the guest software doesn't even need to know CXL
> > exists at all, it can just read the attributes of the numa node
> > that QEMU created for it.
> 
> We thought about this before. But the current kernel implementation
> requires a devdax device to be probed and recognized as a slow tier
> (by reading the memory attributes). I don't think this can be done via
> the path you described. Have you tried this before?
>

Right, because the memory tiering component lumps the nodes together.

Better idea:  Fix the memory tiering component

I cc'd you on another patch line that is discussing something relevant
to this.

https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8

The point is: There's no need for this to be a dax device at all, there
is no need for the guest to even know what is providing the memory, or
for the guest to have any management access to the memory.  It just
wants the memory and the ability to tier it.

So we should fix the memory tiering component to work with this
workflow.

~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-08 17:15       ` Gregory Price
@ 2024-01-08 22:47         ` Hao Xiang
  2024-01-09  1:05           ` Hao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Xiang @ 2024-01-08 22:47 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > >
> > > For a variety of performance reasons, this will not work the way you
> > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > virtual cxl device, and now any memory accesses to that memory region
> > > will end up going through the cxl-type3 device logic - which is an IO
> > > path from the perspective of QEMU.
> >
> > I didn't understand exactly how the virtual cxl-type3 device works. I
> > thought it would go with the same "guest virtual address ->  guest
> > physical address -> host physical address" translation totally done by
> > CPU. But if it is going through an emulation path handled by virtual
> > cxl-type3, I agree the performance would be bad. Do you know why
> > accessing memory on a virtual cxl-type3 device can't go with the
> > nested page table translation?
> >
>
> Because a byte-access on CXL memory can have checks on it that must be
> emulated by the virtual device, and because there are caching
> implications that have to be emulated as well.

Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
CXL memory data path goes through them, the performance would be
pretty problematic. We have actually run Intel's Memory Latency
Checker benchmark from inside a guest VM with both system-DRAM and
virtual CXL-type3 configured. The idle latency on the virtual CXL
memory is 2X of system DRAM, which is on-par with the benchmark
running from a physical host. I need to debug this more to understand
why the latency is actually much better than I would expect now.

>
> The cxl device you are using is an emulated CXL device - not a
> virtualization interface.  Nuanced difference:  the emulated device has
> to emulate *everything* that CXL device does.
>
> What you want is passthrough / managed access to a real device -
> virtualization.  This is not the way to accomplish that.  A better way
> to accomplish that is to simply pass the memory through as a static numa
> node as I described.

That would work, too. But I think a kernel change is required to
establish the correct memory tiering if we go this routine.

>
> >
> > When we had a discussion with Intel, they told us to not use the KVM
> > option in QEMU while using virtual cxl type3 device. That's probably
> > related to the issue you described here? We enabled KVM though but
> > haven't seen the crash yet.
> >
>
> The crash really only happens, IIRC, if code ends up hosted in that
> memory.  I forget the exact scenario, but the working theory is it has
> to do with the way instruction caches are managed with KVM and this
> device.
>
> > >
> > > You're better off just using the `host-nodes` field of host-memory
> > > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> >
> > We tried this but it doesn't work from end to end right now. I
> > described the issue in another fork of this thread.
> >
> > >
> > > In that scenario, the guest software doesn't even need to know CXL
> > > exists at all, it can just read the attributes of the numa node
> > > that QEMU created for it.
> >
> > We thought about this before. But the current kernel implementation
> > requires a devdax device to be probed and recognized as a slow tier
> > (by reading the memory attributes). I don't think this can be done via
> > the path you described. Have you tried this before?
> >
>
> Right, because the memory tiering component lumps the nodes together.
>
> Better idea:  Fix the memory tiering component
>
> I cc'd you on another patch line that is discussing something relevant
> to this.
>
> https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8
>
> The point is: There's no need for this to be a dax device at all, there
> is no need for the guest to even know what is providing the memory, or
> for the guest to have any management access to the memory.  It just
> wants the memory and the ability to tier it.
>
> So we should fix the memory tiering component to work with this
> workflow.

Agreed. We really don't need the devdax device at all. I thought that
choice was made due to the memory tiering concept being started with
pmem ... Let's continue this part of the discussion on the above
thread.

>
> ~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-08 22:47         ` Hao Xiang
@ 2024-01-09  1:05           ` Hao Xiang
  2024-01-09  1:13             ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Xiang @ 2024-01-09  1:05 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
>
> On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > > >
> > > > For a variety of performance reasons, this will not work the way you
> > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > virtual cxl device, and now any memory accesses to that memory region
> > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > path from the perspective of QEMU.
> > >
> > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > thought it would go with the same "guest virtual address ->  guest
> > > physical address -> host physical address" translation totally done by
> > > CPU. But if it is going through an emulation path handled by virtual
> > > cxl-type3, I agree the performance would be bad. Do you know why
> > > accessing memory on a virtual cxl-type3 device can't go with the
> > > nested page table translation?
> > >
> >
> > Because a byte-access on CXL memory can have checks on it that must be
> > emulated by the virtual device, and because there are caching
> > implications that have to be emulated as well.
>
> Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> CXL memory data path goes through them, the performance would be
> pretty problematic. We have actually run Intel's Memory Latency
> Checker benchmark from inside a guest VM with both system-DRAM and
> virtual CXL-type3 configured. The idle latency on the virtual CXL
> memory is 2X of system DRAM, which is on-par with the benchmark
> running from a physical host. I need to debug this more to understand
> why the latency is actually much better than I would expect now.

So we double checked on benchmark testing. What we see is that running
Intel Memory Latency Checker from a guest VM with virtual CXL memory
VS from a physical host with CXL1.1 memory expander has the same
latency.

From guest VM: local socket system-DRAM latency is 117.0ns, local
socket CXL-DRAM latency is 269.4ns
From physical host: local socket system-DRAM latency is 113.6ns ,
local socket CXL-DRAM latency is 267.5ns

I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
while running the benchmark testing but those two functions are not
ever hit. We used the virtual CXL configuration while launching QEMU
but the CXL memory is present as a separate NUMA node and we are not
creating devdax devices. Does that make any difference?

>
> >
> > The cxl device you are using is an emulated CXL device - not a
> > virtualization interface.  Nuanced difference:  the emulated device has
> > to emulate *everything* that CXL device does.
> >
> > What you want is passthrough / managed access to a real device -
> > virtualization.  This is not the way to accomplish that.  A better way
> > to accomplish that is to simply pass the memory through as a static numa
> > node as I described.
>
> That would work, too. But I think a kernel change is required to
> establish the correct memory tiering if we go this routine.
>
> >
> > >
> > > When we had a discussion with Intel, they told us to not use the KVM
> > > option in QEMU while using virtual cxl type3 device. That's probably
> > > related to the issue you described here? We enabled KVM though but
> > > haven't seen the crash yet.
> > >
> >
> > The crash really only happens, IIRC, if code ends up hosted in that
> > memory.  I forget the exact scenario, but the working theory is it has
> > to do with the way instruction caches are managed with KVM and this
> > device.
> >
> > > >
> > > > You're better off just using the `host-nodes` field of host-memory
> > > > and passing bandwidth/latency attributes though via `-numa hmat-lb`
> > >
> > > We tried this but it doesn't work from end to end right now. I
> > > described the issue in another fork of this thread.
> > >
> > > >
> > > > In that scenario, the guest software doesn't even need to know CXL
> > > > exists at all, it can just read the attributes of the numa node
> > > > that QEMU created for it.
> > >
> > > We thought about this before. But the current kernel implementation
> > > requires a devdax device to be probed and recognized as a slow tier
> > > (by reading the memory attributes). I don't think this can be done via
> > > the path you described. Have you tried this before?
> > >
> >
> > Right, because the memory tiering component lumps the nodes together.
> >
> > Better idea:  Fix the memory tiering component
> >
> > I cc'd you on another patch line that is discussing something relevant
> > to this.
> >
> > https://lore.kernel.org/linux-mm/87fs00njft.fsf@yhuang6-desk2.ccr.corp.intel.com/T/#m32d58f8cc607aec942995994a41b17ff711519c8
> >
> > The point is: There's no need for this to be a dax device at all, there
> > is no need for the guest to even know what is providing the memory, or
> > for the guest to have any management access to the memory.  It just
> > wants the memory and the ability to tier it.
> >
> > So we should fix the memory tiering component to work with this
> > workflow.
>
> Agreed. We really don't need the devdax device at all. I thought that
> choice was made due to the memory tiering concept being started with
> pmem ... Let's continue this part of the discussion on the above
> thread.
>
> >
> > ~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09  1:05           ` Hao Xiang
@ 2024-01-09  1:13             ` Gregory Price
  2024-01-09 19:33               ` Hao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2024-01-09  1:13 UTC (permalink / raw)
  To: Hao Xiang
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote:
> On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
> >
> > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
> > >
> > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > > > >
> > > > > For a variety of performance reasons, this will not work the way you
> > > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > > virtual cxl device, and now any memory accesses to that memory region
> > > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > > path from the perspective of QEMU.
> > > >
> > > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > > thought it would go with the same "guest virtual address ->  guest
> > > > physical address -> host physical address" translation totally done by
> > > > CPU. But if it is going through an emulation path handled by virtual
> > > > cxl-type3, I agree the performance would be bad. Do you know why
> > > > accessing memory on a virtual cxl-type3 device can't go with the
> > > > nested page table translation?
> > > >
> > >
> > > Because a byte-access on CXL memory can have checks on it that must be
> > > emulated by the virtual device, and because there are caching
> > > implications that have to be emulated as well.
> >
> > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> > CXL memory data path goes through them, the performance would be
> > pretty problematic. We have actually run Intel's Memory Latency
> > Checker benchmark from inside a guest VM with both system-DRAM and
> > virtual CXL-type3 configured. The idle latency on the virtual CXL
> > memory is 2X of system DRAM, which is on-par with the benchmark
> > running from a physical host. I need to debug this more to understand
> > why the latency is actually much better than I would expect now.
> 
> So we double checked on benchmark testing. What we see is that running
> Intel Memory Latency Checker from a guest VM with virtual CXL memory
> VS from a physical host with CXL1.1 memory expander has the same
> latency.
> 
> From guest VM: local socket system-DRAM latency is 117.0ns, local
> socket CXL-DRAM latency is 269.4ns
> From physical host: local socket system-DRAM latency is 113.6ns ,
> local socket CXL-DRAM latency is 267.5ns
> 
> I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
> while running the benchmark testing but those two functions are not
> ever hit. We used the virtual CXL configuration while launching QEMU
> but the CXL memory is present as a separate NUMA node and we are not
> creating devdax devices. Does that make any difference?
> 

Could you possibly share your full QEMU configuration and what OS/kernel
you are running inside the guest?

The only thing I'm surprised by is that the numa node appears without
requiring the driver to generate the NUMA node.  It's possible I missed
a QEMU update that allows this.

~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09  1:13             ` Gregory Price
@ 2024-01-09 19:33               ` Hao Xiang
  2024-01-09 19:57                 ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Xiang @ 2024-01-09 19:33 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Mon, Jan 08, 2024 at 05:05:38PM -0800, Hao Xiang wrote:
> > On Mon, Jan 8, 2024 at 2:47 PM Hao Xiang <hao.xiang@bytedance.com> wrote:
> > >
> > > On Mon, Jan 8, 2024 at 9:15 AM Gregory Price <gregory.price@memverge.com> wrote:
> > > >
> > > > On Fri, Jan 05, 2024 at 09:59:19PM -0800, Hao Xiang wrote:
> > > > > On Wed, Jan 3, 2024 at 1:56 PM Gregory Price <gregory.price@memverge.com> wrote:
> > > > > >
> > > > > > For a variety of performance reasons, this will not work the way you
> > > > > > want it to.  You are essentially telling QEMU to map the vmem0 into a
> > > > > > virtual cxl device, and now any memory accesses to that memory region
> > > > > > will end up going through the cxl-type3 device logic - which is an IO
> > > > > > path from the perspective of QEMU.
> > > > >
> > > > > I didn't understand exactly how the virtual cxl-type3 device works. I
> > > > > thought it would go with the same "guest virtual address ->  guest
> > > > > physical address -> host physical address" translation totally done by
> > > > > CPU. But if it is going through an emulation path handled by virtual
> > > > > cxl-type3, I agree the performance would be bad. Do you know why
> > > > > accessing memory on a virtual cxl-type3 device can't go with the
> > > > > nested page table translation?
> > > > >
> > > >
> > > > Because a byte-access on CXL memory can have checks on it that must be
> > > > emulated by the virtual device, and because there are caching
> > > > implications that have to be emulated as well.
> > >
> > > Interesting. Now that I see the cxl_type3_read/cxl_type3_write. If the
> > > CXL memory data path goes through them, the performance would be
> > > pretty problematic. We have actually run Intel's Memory Latency
> > > Checker benchmark from inside a guest VM with both system-DRAM and
> > > virtual CXL-type3 configured. The idle latency on the virtual CXL
> > > memory is 2X of system DRAM, which is on-par with the benchmark
> > > running from a physical host. I need to debug this more to understand
> > > why the latency is actually much better than I would expect now.
> >
> > So we double checked on benchmark testing. What we see is that running
> > Intel Memory Latency Checker from a guest VM with virtual CXL memory
> > VS from a physical host with CXL1.1 memory expander has the same
> > latency.
> >
> > From guest VM: local socket system-DRAM latency is 117.0ns, local
> > socket CXL-DRAM latency is 269.4ns
> > From physical host: local socket system-DRAM latency is 113.6ns ,
> > local socket CXL-DRAM latency is 267.5ns
> >
> > I also set debugger breakpoints on cxl_type3_read/cxl_type3_write
> > while running the benchmark testing but those two functions are not
> > ever hit. We used the virtual CXL configuration while launching QEMU
> > but the CXL memory is present as a separate NUMA node and we are not
> > creating devdax devices. Does that make any difference?
> >
>
> Could you possibly share your full QEMU configuration and what OS/kernel
> you are running inside the guest?

Sounds like the technical details are explained on the other thread.
From what I understand now, if we don't go through a complex CXL
setup, it wouldn't go through the emulation path.

Here is our exact setup. Guest runs Linux kernel 6.6rc2

taskset --cpu-list 0-47,96-143 \
numactl -N 0 -m 0 ${QEMU} \
-M q35,cxl=on,hmat=on \
-m 64G \
-smp 8,sockets=1,cores=8,threads=1 \
-object memory-backend-ram,id=ram0,size=45G \
-numa node,memdev=ram0,cpus=0-7,nodeid=0 \
-msg timestamp=on -L /usr/share/seabios \
-enable-kvm \
-object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
\
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
-numa node,memdev=vmem0,nodeid=1 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k
\
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=14 \
-numa dist,src=1,dst=0,val=14 \
-numa dist,src=1,dst=1,val=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-latency,latency=91
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-latency,latency=100
\
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-latency,latency=91
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-latency,latency=100
\
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=read-bandwidth,bandwidth=262100M
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=read-bandwidth,bandwidth=30000M
\
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=write-bandwidth,bandwidth=176100M
\
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=write-bandwidth,bandwidth=30000M
\
-drive file="${DISK_IMG}",format=qcow2 \
-device pci-bridge,chassis_nr=3,id=pci.3,bus=pcie.0,addr=0xd \
-netdev tap,id=vm-sk-tap22,ifname=tap22,script=/usr/local/etc/qemu-ifup,downscript=no
\
-device virtio-net-pci,netdev=vm-sk-tap22,id=net0,mac=02:11:17:01:7e:33,bus=pci.3,addr=0x1,bootindex=3
\
-serial mon:stdio

>
> The only thing I'm surprised by is that the numa node appears without
> requiring the driver to generate the NUMA node.  It's possible I missed
> a QEMU update that allows this.
>
> ~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09 19:33               ` Hao Xiang
@ 2024-01-09 19:57                 ` Gregory Price
  2024-01-09 21:27                   ` Hao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2024-01-09 19:57 UTC (permalink / raw)
  To: Hao Xiang
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, Jan 09, 2024 at 11:33:04AM -0800, Hao Xiang wrote:
> On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote:
> 
> Sounds like the technical details are explained on the other thread.
> From what I understand now, if we don't go through a complex CXL
> setup, it wouldn't go through the emulation path.
> 
> Here is our exact setup. Guest runs Linux kernel 6.6rc2
> 
> taskset --cpu-list 0-47,96-143 \
> numactl -N 0 -m 0 ${QEMU} \
> -M q35,cxl=on,hmat=on \
> -m 64G \
> -smp 8,sockets=1,cores=8,threads=1 \
> -object memory-backend-ram,id=ram0,size=45G \
> -numa node,memdev=ram0,cpus=0-7,nodeid=0 \
> -msg timestamp=on -L /usr/share/seabios \
> -enable-kvm \
> -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
> \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> -numa node,memdev=vmem0,nodeid=1 \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k

:] you did what i thought you did

-numa node,memdev=vmem0,nodeid=1

"""
Another possiblity: You mapped this memory-backend into another numa
node explicitly and never onlined the memory via cxlcli.  I've done
this, and it works, but it's a "hidden feature" that probably should
not exist / be supported.
"""

You're mapping vmem0 into an explicit numa node *and* into the type3
device.  You don't need to do both - and technically this shouldn't be
allowed.

With this configuration, you can go thorugh the cxl-cli setup process
for the CXL device, you'll find that you can create *another* node
(node 2 in this case) that maps to the same memory you mapped to node1..


You can drop the cxl devices objects in here and the memory will still
come up the way you want it to.

If you drop this line:

-numa node,memdev=vmem0,nodeid=1

You have to use the CXL driver to instantiate the dax device and the
numa node, and at *that* point you will see the read/write functions
being called.

~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09 19:57                 ` Gregory Price
@ 2024-01-09 21:27                   ` Hao Xiang
  2024-01-09 22:13                     ` Gregory Price
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Xiang @ 2024-01-09 21:27 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Ben Widawsky, Gregory Price, Fan Ni, Ira Weiny,
	Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
<gregory.price@memverge.com> wrote:
>
> On Tue, Jan 09, 2024 at 11:33:04AM -0800, Hao Xiang wrote:
> > On Mon, Jan 8, 2024 at 5:13 PM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > Sounds like the technical details are explained on the other thread.
> > From what I understand now, if we don't go through a complex CXL
> > setup, it wouldn't go through the emulation path.
> >
> > Here is our exact setup. Guest runs Linux kernel 6.6rc2
> >
> > taskset --cpu-list 0-47,96-143 \
> > numactl -N 0 -m 0 ${QEMU} \
> > -M q35,cxl=on,hmat=on \
> > -m 64G \
> > -smp 8,sockets=1,cores=8,threads=1 \
> > -object memory-backend-ram,id=ram0,size=45G \
> > -numa node,memdev=ram0,cpus=0-7,nodeid=0 \
> > -msg timestamp=on -L /usr/share/seabios \
> > -enable-kvm \
> > -object memory-backend-ram,id=vmem0,size=19G,host-nodes=${HOST_CXL_NODE},policy=bind
> > \
> > -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> > -numa node,memdev=vmem0,nodeid=1 \
> > -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k
>
> :] you did what i thought you did
>
> -numa node,memdev=vmem0,nodeid=1
>
> """
> Another possiblity: You mapped this memory-backend into another numa
> node explicitly and never onlined the memory via cxlcli.  I've done
> this, and it works, but it's a "hidden feature" that probably should
> not exist / be supported.
> """
>
> You're mapping vmem0 into an explicit numa node *and* into the type3
> device.  You don't need to do both - and technically this shouldn't be
> allowed.
>
> With this configuration, you can go thorugh the cxl-cli setup process
> for the CXL device, you'll find that you can create *another* node
> (node 2 in this case) that maps to the same memory you mapped to node1..
>
>
> You can drop the cxl devices objects in here and the memory will still
> come up the way you want it to.
>
> If you drop this line:
>
> -numa node,memdev=vmem0,nodeid=1

We tried this as well and it works after going through the cxlcli
process and created the devdax device. The problem is that without the
"nodeid=1" configuration, we cannot connect with the explicit per numa
node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
lb information to VM's hmat.

From what I understand so far, the guest kernel will dynamically
create a numa node after a cxl devdax device is created. That means we
don't know the numa node until after VM boot. 2. QEMU can only
statically parse the lb information to the VM at boot time. How do we
connect these two things?

Assuming that the same issue applies to a physical server with CXL.
Were you able to see a host kernel getting the correct lb information
for a CXL devdax device?

>
> You have to use the CXL driver to instantiate the dax device and the
> numa node, and at *that* point you will see the read/write functions
> being called.
>
> ~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09 21:27                   ` Hao Xiang
@ 2024-01-09 22:13                     ` Gregory Price
  2024-01-09 23:55                       ` Hao Xiang
  0 siblings, 1 reply; 19+ messages in thread
From: Gregory Price @ 2024-01-09 22:13 UTC (permalink / raw)
  To: Hao Xiang
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:
> On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> <gregory.price@memverge.com> wrote:
> >
> > If you drop this line:
> >
> > -numa node,memdev=vmem0,nodeid=1
> 
> We tried this as well and it works after going through the cxlcli
> process and created the devdax device. The problem is that without the
> "nodeid=1" configuration, we cannot connect with the explicit per numa
> node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> lb information to VM's hmat.
>

Yeah, this is what Jonathan was saying - right now there isn't a good
way (in QEMU) to pass the hmat/cdat stuff down through the device.
Needs to be plumbed out.

In the meantime: You should just straight up drop the cxl device from
your QEMU config.  It doesn't actually get you anything.

> From what I understand so far, the guest kernel will dynamically
> create a numa node after a cxl devdax device is created. That means we
> don't know the numa node until after VM boot. 2. QEMU can only
> statically parse the lb information to the VM at boot time. How do we
> connect these two things?

during boot, the kernel discovers all the memory regions exposed to
bios. In this qemu configuration you have defined:

region 0: CPU + DRAM node
region 1: DRAM only node
region 2: CXL Fixed Memory Window (the last line of the cxl stuff)

The kernel reads this information on boot and reserves 1 numa node for
each of these regions.

The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
respectively.

Node2 sits dormant until you go through the cxl-cli startup sequence.


What you're asking for is for the QEMU team to plumb hmat/cdat
information down through the type3 device.  I *think* that is presently
possible with a custom CDAT file - but Jonathan probably has more
details on that.  You'll have to go digging for answers on that one.


Now - even if you did that - the current state of the cxl-type3 device
is *not what you want* because your memory accesses will be routed
through the read/write functions in the emulated device.

What Jonathan and I discussed on the other thread is how you might go
about slimming this down to allow pass-through of the memory without the
need for all the fluff.  This is a non-trivial refactor of the existing
device, so i would not expect that any time soon.

At the end of the day, quickest way to get-there-from-here is to just
drop the cxl related lines from your QEMU config, and keep everything
else.

> 
> Assuming that the same issue applies to a physical server with CXL.
> Were you able to see a host kernel getting the correct lb information
> for a CXL devdax device?
> 

Yes, if you bring up a CXL device via cxl-cli on real hardware, the
subsequent numa node ends up in the "lower tier" of the memory-tiering
infrastructure.

~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09 22:13                     ` Gregory Price
@ 2024-01-09 23:55                       ` Hao Xiang
  2024-01-10 14:31                           ` Jonathan Cameron via
  0 siblings, 1 reply; 19+ messages in thread
From: Hao Xiang @ 2024-01-09 23:55 UTC (permalink / raw)
  To: Gregory Price
  Cc: Ho-Ren (Jack) Chuang, Michael S. Tsirkin, Jonathan Cameron,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:
> > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> > <gregory.price@memverge.com> wrote:
> > >
> > > If you drop this line:
> > >
> > > -numa node,memdev=vmem0,nodeid=1
> >
> > We tried this as well and it works after going through the cxlcli
> > process and created the devdax device. The problem is that without the
> > "nodeid=1" configuration, we cannot connect with the explicit per numa
> > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> > lb information to VM's hmat.
> >
>
> Yeah, this is what Jonathan was saying - right now there isn't a good
> way (in QEMU) to pass the hmat/cdat stuff down through the device.
> Needs to be plumbed out.
>
> In the meantime: You should just straight up drop the cxl device from
> your QEMU config.  It doesn't actually get you anything.
>
> > From what I understand so far, the guest kernel will dynamically
> > create a numa node after a cxl devdax device is created. That means we
> > don't know the numa node until after VM boot. 2. QEMU can only
> > statically parse the lb information to the VM at boot time. How do we
> > connect these two things?
>
> during boot, the kernel discovers all the memory regions exposed to
> bios. In this qemu configuration you have defined:
>
> region 0: CPU + DRAM node
> region 1: DRAM only node
> region 2: CXL Fixed Memory Window (the last line of the cxl stuff)
>
> The kernel reads this information on boot and reserves 1 numa node for
> each of these regions.
>
> The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
> respectively.
>
> Node2 sits dormant until you go through the cxl-cli startup sequence.
>
>
> What you're asking for is for the QEMU team to plumb hmat/cdat
> information down through the type3 device.  I *think* that is presently
> possible with a custom CDAT file - but Jonathan probably has more
> details on that.  You'll have to go digging for answers on that one.

I think this is exactly what I was looking for. When we started with
the idea of having an explicit CXL memory backend, we wanted to
1) Bind a virtual CXL device to an actual CXL memory node on host.
2) Pass the latency/bandwidth information from the CXL backend into
the virtual CXL device.
I didn't have a concrete idea of how to do 2)
With the discussion here, I learned that the information is passed
from CDAT. Just looked into the virtual CXL code and found that
ct3_build_cdat_entries_for_mr() is the function that builds this
information. But the latency and bandwidth there are currently
hard-coded. I think it makes sense to have an explicit CXL memory
backend where QEMU can query the CXL memory attributes from the host
and pass that information from the CXL backend into the virtual CXL
type-3 device.

>
>
> Now - even if you did that - the current state of the cxl-type3 device
> is *not what you want* because your memory accesses will be routed
> through the read/write functions in the emulated device.
>
> What Jonathan and I discussed on the other thread is how you might go
> about slimming this down to allow pass-through of the memory without the
> need for all the fluff.  This is a non-trivial refactor of the existing
> device, so i would not expect that any time soon.
>
> At the end of the day, quickest way to get-there-from-here is to just
> drop the cxl related lines from your QEMU config, and keep everything
> else.

Agreed. We need the kernel to be capable of reading the memory
attributes from HMAT and establish the correct memory tier for
system-DRAM (on a CPUless numa node). Currently system-DRAM is assumed
to always be fast tier.

>
> >
> > Assuming that the same issue applies to a physical server with CXL.
> > Were you able to see a host kernel getting the correct lb information
> > for a CXL devdax device?
> >
>
> Yes, if you bring up a CXL device via cxl-cli on real hardware, the
> subsequent numa node ends up in the "lower tier" of the memory-tiering
> infrastructure.
>
> ~Gregory

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-09 23:55                       ` Hao Xiang
@ 2024-01-10 14:31                           ` Jonathan Cameron via
  0 siblings, 0 replies; 19+ messages in thread
From: Jonathan Cameron @ 2024-01-10 14:31 UTC (permalink / raw)
  To: Hao Xiang
  Cc: Gregory Price, Ho-Ren (Jack) Chuang, Michael S. Tsirkin,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, 9 Jan 2024 15:55:46 -0800
Hao Xiang <hao.xiang@bytedance.com> wrote:

> On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:  
> > > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> > > <gregory.price@memverge.com> wrote:  
> > > >
> > > > If you drop this line:
> > > >
> > > > -numa node,memdev=vmem0,nodeid=1  
> > >
> > > We tried this as well and it works after going through the cxlcli
> > > process and created the devdax device. The problem is that without the
> > > "nodeid=1" configuration, we cannot connect with the explicit per numa
> > > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> > > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> > > lb information to VM's hmat.
> > >  
> >
> > Yeah, this is what Jonathan was saying - right now there isn't a good
> > way (in QEMU) to pass the hmat/cdat stuff down through the device.
> > Needs to be plumbed out.
> >
> > In the meantime: You should just straight up drop the cxl device from
> > your QEMU config.  It doesn't actually get you anything.
> >  
> > > From what I understand so far, the guest kernel will dynamically
> > > create a numa node after a cxl devdax device is created. That means we
> > > don't know the numa node until after VM boot. 2. QEMU can only
> > > statically parse the lb information to the VM at boot time. How do we
> > > connect these two things?  
> >
> > during boot, the kernel discovers all the memory regions exposed to
> > bios. In this qemu configuration you have defined:
> >
> > region 0: CPU + DRAM node
> > region 1: DRAM only node
> > region 2: CXL Fixed Memory Window (the last line of the cxl stuff)
> >
> > The kernel reads this information on boot and reserves 1 numa node for
> > each of these regions.
> >
> > The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
> > respectively.
> >
> > Node2 sits dormant until you go through the cxl-cli startup sequence.
> >
> >
> > What you're asking for is for the QEMU team to plumb hmat/cdat
> > information down through the type3 device.  I *think* that is presently
> > possible with a custom CDAT file - but Jonathan probably has more
> > details on that.  You'll have to go digging for answers on that one.  
> 
> I think this is exactly what I was looking for. When we started with
> the idea of having an explicit CXL memory backend, we wanted to
> 1) Bind a virtual CXL device to an actual CXL memory node on host.
> 2) Pass the latency/bandwidth information from the CXL backend into
> the virtual CXL device.
> I didn't have a concrete idea of how to do 2)
> With the discussion here, I learned that the information is passed
> from CDAT. Just looked into the virtual CXL code and found that
> ct3_build_cdat_entries_for_mr() is the function that builds this
> information. But the latency and bandwidth there are currently
> hard-coded. I think it makes sense to have an explicit CXL memory
> backend where QEMU can query the CXL memory attributes from the host
> and pass that information from the CXL backend into the virtual CXL
> type-3 device.

There is probably an argument for a memory backend to be able to take
perf numbers in general (I don't see this as being CXL specific) or for
us adding more parameters to the cxl device entry, but for now you can
inject a cdat file that presents whatever you like.  

What we are missing though is generic port creation, so even with
everything else in place it won't quite work yet. There was a hacky
patch for generic ports, but it's not upstream yet (or in my tree).

Usefully there is work under review for adding generic initiators to
qemu that we can repurpose most of for GPs.

Jonathan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [External] Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
@ 2024-01-10 14:31                           ` Jonathan Cameron via
  0 siblings, 0 replies; 19+ messages in thread
From: Jonathan Cameron via @ 2024-01-10 14:31 UTC (permalink / raw)
  To: Hao Xiang
  Cc: Gregory Price, Ho-Ren (Jack) Chuang, Michael S. Tsirkin,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Markus Armbruster,
	Paolo Bonzini, Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

On Tue, 9 Jan 2024 15:55:46 -0800
Hao Xiang <hao.xiang@bytedance.com> wrote:

> On Tue, Jan 9, 2024 at 2:13 PM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Tue, Jan 09, 2024 at 01:27:28PM -0800, Hao Xiang wrote:  
> > > On Tue, Jan 9, 2024 at 11:58 AM Gregory Price
> > > <gregory.price@memverge.com> wrote:  
> > > >
> > > > If you drop this line:
> > > >
> > > > -numa node,memdev=vmem0,nodeid=1  
> > >
> > > We tried this as well and it works after going through the cxlcli
> > > process and created the devdax device. The problem is that without the
> > > "nodeid=1" configuration, we cannot connect with the explicit per numa
> > > node latency/bandwidth configuration "-numa hmat-lb". I glanced at the
> > > code in hw/numa.c, parse_numa_hmat_lb() looks like the one passing the
> > > lb information to VM's hmat.
> > >  
> >
> > Yeah, this is what Jonathan was saying - right now there isn't a good
> > way (in QEMU) to pass the hmat/cdat stuff down through the device.
> > Needs to be plumbed out.
> >
> > In the meantime: You should just straight up drop the cxl device from
> > your QEMU config.  It doesn't actually get you anything.
> >  
> > > From what I understand so far, the guest kernel will dynamically
> > > create a numa node after a cxl devdax device is created. That means we
> > > don't know the numa node until after VM boot. 2. QEMU can only
> > > statically parse the lb information to the VM at boot time. How do we
> > > connect these two things?  
> >
> > during boot, the kernel discovers all the memory regions exposed to
> > bios. In this qemu configuration you have defined:
> >
> > region 0: CPU + DRAM node
> > region 1: DRAM only node
> > region 2: CXL Fixed Memory Window (the last line of the cxl stuff)
> >
> > The kernel reads this information on boot and reserves 1 numa node for
> > each of these regions.
> >
> > The kernel then automatically brings up regions 0 and 1 in nodes 0 and 1
> > respectively.
> >
> > Node2 sits dormant until you go through the cxl-cli startup sequence.
> >
> >
> > What you're asking for is for the QEMU team to plumb hmat/cdat
> > information down through the type3 device.  I *think* that is presently
> > possible with a custom CDAT file - but Jonathan probably has more
> > details on that.  You'll have to go digging for answers on that one.  
> 
> I think this is exactly what I was looking for. When we started with
> the idea of having an explicit CXL memory backend, we wanted to
> 1) Bind a virtual CXL device to an actual CXL memory node on host.
> 2) Pass the latency/bandwidth information from the CXL backend into
> the virtual CXL device.
> I didn't have a concrete idea of how to do 2)
> With the discussion here, I learned that the information is passed
> from CDAT. Just looked into the virtual CXL code and found that
> ct3_build_cdat_entries_for_mr() is the function that builds this
> information. But the latency and bandwidth there are currently
> hard-coded. I think it makes sense to have an explicit CXL memory
> backend where QEMU can query the CXL memory attributes from the host
> and pass that information from the CXL backend into the virtual CXL
> type-3 device.

There is probably an argument for a memory backend to be able to take
perf numbers in general (I don't see this as being CXL specific) or for
us adding more parameters to the cxl device entry, but for now you can
inject a cdat file that presents whatever you like.  

What we are missing though is generic port creation, so even with
everything else in place it won't quite work yet. There was a hacky
patch for generic ports, but it's not upstream yet (or in my tree).

Usefully there is work under review for adding generic initiators to
qemu that we can repurpose most of for GPs.

Jonathan


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram'
  2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
                     ` (2 preceding siblings ...)
  2024-01-03 21:56   ` Gregory Price
@ 2024-01-12 15:32   ` Markus Armbruster
  3 siblings, 0 replies; 19+ messages in thread
From: Markus Armbruster @ 2024-01-12 15:32 UTC (permalink / raw)
  To: Ho-Ren (Jack) Chuang
  Cc: Michael S. Tsirkin, Hao Xiang, Jonathan Cameron, Ben Widawsky,
	Gregory Price, Fan Ni, Ira Weiny, Philippe Mathieu-Daudé,
	David Hildenbrand, Igor Mammedov, Eric Blake, Paolo Bonzini,
	Daniel P. Berrangé,
	Eduardo Habkost, qemu-devel, Ho-Ren (Jack) Chuang, linux-cxl

QAPI schema sanity review only.

"Ho-Ren (Jack) Chuang" <horenchuang@bytedance.com> writes:

> Introduce a new configuration option 'host-mem-type=' in the
> '-object memory-backend-ram', allowing users to specify
> from which type of memory to allocate.
>
> Users can specify 'cxlram' as an argument, and QEMU will then
> automatically locate CXL RAM NUMA nodes and use them as the backend memory.
> For example:
> 	-object memory-backend-ram,id=vmem0,size=19G,host-mem-type=cxlram \
> 	-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> 	-device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> 	-device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
> 	-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=19G,cxl-fmw.0.interleave-granularity=8k \
>
> In v1, we plan to move most of the implementations to util and break down
> this patch into different smaller patches.
>
> Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
> Signed-off-by: Hao Xiang <hao.xiang@bytedance.com>

[...]

> diff --git a/qapi/common.json b/qapi/common.json
> index 6fed9cde1a..591fd73291 100644
> --- a/qapi/common.json
> +++ b/qapi/common.json
> @@ -167,6 +167,25 @@
>  { 'enum': 'HostMemPolicy',
>    'data': [ 'default', 'preferred', 'bind', 'interleave' ] }
>  
> +##
> +# @HostMemType:
> +#
> +# Automatically find a backend memory type on host.

Types don't do, they are.  Suggest something like

   # Host memory types

> +# Can be further extened to support other types such as cxlpmem, hbm.

Doc comments are reference documentation for users of QMP.  This
sentence doesn't fit there.

> +#
> +# @none: do nothing (default).

Again, types don't do, they are.

(default) makes no sense here.  Types don't have defaults, optional
arguments do.  In this case, @host-mem-type below.

What kind of memory does this setting request?

> +#
> +# @cxlram: a CXL RAM backend on host.

Suggest to spell it cxl-ram.

> +#
> +# Note: HostMemType and HostMemPolicy/host-nodes cannot be set at the same
> +# time. HostMemType is used to automatically bind with one kind of
> +# host memory types.

I feel this note doesn't belong here.  Add it to the users of
HostMemType and HostMemPolicy instead.

> +#
> +# Since: 8.3
> +##
> +{ 'enum': 'HostMemType',
> +  'data': [ 'none', 'cxlram' ] }
> +

I guess you're adding this to common.json and not qom.json so it's next
to HostMemPolicy.  Okay.

>  ##
>  # @NetFilterDirection:
>  #
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 95516ba325..fa3bc29708 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -626,6 +626,7 @@
>              '*host-nodes': ['uint16'],
>              '*merge': 'bool',
>              '*policy': 'HostMemPolicy',
> +            '*host-mem-type': 'HostMemType',
>              '*prealloc': 'bool',
>              '*prealloc-threads': 'uint32',
>              '*prealloc-context': 'str',

Missing: doc comment update.

[...]


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-01-12 15:32 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-01  7:53 [QEMU-devel][RFC PATCH 0/1] Introduce HostMemType for 'memory-backend-*' Ho-Ren (Jack) Chuang
2024-01-01  7:53 ` [QEMU-devel][RFC PATCH 1/1] backends/hostmem: qapi/qom: Add an ObjectOption for memory-backend-* called HostMemType and its arg 'cxlram' Ho-Ren (Jack) Chuang
2024-01-02 10:29   ` Philippe Mathieu-Daudé
2024-01-02 13:03   ` David Hildenbrand
2024-01-06  0:45     ` [External] " Hao Xiang
2024-01-03 21:56   ` Gregory Price
2024-01-06  5:59     ` [External] " Hao Xiang
2024-01-08 17:15       ` Gregory Price
2024-01-08 22:47         ` Hao Xiang
2024-01-09  1:05           ` Hao Xiang
2024-01-09  1:13             ` Gregory Price
2024-01-09 19:33               ` Hao Xiang
2024-01-09 19:57                 ` Gregory Price
2024-01-09 21:27                   ` Hao Xiang
2024-01-09 22:13                     ` Gregory Price
2024-01-09 23:55                       ` Hao Xiang
2024-01-10 14:31                         ` Jonathan Cameron
2024-01-10 14:31                           ` Jonathan Cameron via
2024-01-12 15:32   ` Markus Armbruster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.