[PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT
@ 2023-07-21  1:29 Huang Ying
  2023-07-21  1:29 ` [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management Huang Ying
                   ` (4 more replies)
  0 siblings, 5 replies; 41+ messages in thread
From: Huang Ying @ 2023-07-21  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-cxl, nvdimm, linux-acpi,
	Huang Ying, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki

We have the explicit memory tiers framework to manage systems with
multiple types of memory, e.g., DRAM in DIMM slots and CXL memory
devices.  Where, same kind of memory devices will be grouped into
memory types, then put into memory tiers.  To describe the performance
of a memory type, abstract distance is defined.  Which is in direct
proportion to the memory latency and inversely proportional to the
memory bandwidth.  To keep the code as simple as possible, fixed
abstract distance is used in dax/kmem to describe slow memory such as
Optane DCPMM.

To support more memory types, in this series, we added the abstract
distance calculation algorithm management mechanism, provided a
algorithm implementation based on ACPI HMAT, and used the general
abstract distance calculation interface in dax/kmem driver.  So,
dax/kmem can support HBM (high bandwidth memory) in addition to the
original Optane DCPMM.

Changelog:

V1 (from RFC):

- Added some comments per Aneesh's comments, Thanks!

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-21  1:29 [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Huang Ying
@ 2023-07-21  1:29 ` Huang Ying
  2023-07-25  2:13   ` Alistair Popple
  2023-07-21  1:29 ` [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators() Huang Ying
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 41+ messages in thread
From: Huang Ying @ 2023-07-21  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-cxl, nvdimm, linux-acpi,
	Huang Ying, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki

The abstract distance may be calculated by various drivers, such as
ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
users and the providers, the abstract distance calculation algorithms
management mechanism is implemented in this patch.  It provides
interface for the providers to register the implementation, and
interface for the users.

Multiple algorithm implementations can cooperate via calculating
abstract distance for different memory nodes.  The preference of
algorithm implementations can be specified via
priority (notifier_block.priority).

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/memory-tiers.h | 19 ++++++++++++
 mm/memory-tiers.c            | 59 ++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index fc9647b1b4f9..c6429e624244 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -6,6 +6,7 @@
 #include <linux/nodemask.h>
 #include <linux/kref.h>
 #include <linux/mmzone.h>
+#include <linux/notifier.h>
 /*
  * Each tier cover a abstrace distance chunk size of 128
  */
@@ -36,6 +37,9 @@ struct memory_dev_type *alloc_memory_type(int adistance);
 void destroy_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
 void clear_node_memory_type(int node, struct memory_dev_type *memtype);
+int register_mt_adistance_algorithm(struct notifier_block *nb);
+int unregister_mt_adistance_algorithm(struct notifier_block *nb);
+int mt_calc_adistance(int node, int *adist);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -97,5 +101,20 @@ static inline bool node_is_toptier(int node)
 {
 	return true;
 }
+
+static inline int register_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return 0;
+}
+
+static inline int mt_calc_adistance(int node, int *adist)
+{
+	return NOTIFY_DONE;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index a516e303e304..1e55fbe2ad51 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@
 #include <linux/kobject.h>
 #include <linux/memory.h>
 #include <linux/memory-tiers.h>
+#include <linux/notifier.h>
 
 #include "internal.h"
 
@@ -105,6 +106,8 @@ static int top_tier_adistance;
 static struct demotion_nodes *node_demotion __read_mostly;
 #endif /* CONFIG_MIGRATION */
 
+static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
+
 static inline struct memory_tier *to_memory_tier(struct device *device)
 {
 	return container_of(device, struct memory_tier, dev);
@@ -592,6 +595,62 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+/**
+ * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm
+ * @nb: The notifier block which describe the algorithm
+ *
+ * Return: 0 on success, errno on error.
+ *
+ * Every memory tiering abstract distance algorithm provider needs to
+ * register the algorithm with register_mt_adistance_algorithm().  To
+ * calculate the abstract distance for a specified memory node, the
+ * notifier function will be called unless some high priority
+ * algorithm has provided result.  The prototype of the notifier
+ * function is as follows,
+ *
+ *   int (*algorithm_notifier)(struct notifier_block *nb,
+ *                             unsigned long nid, void *data);
+ *
+ * Where "nid" specifies the memory node, "data" is the pointer to the
+ * returned abstract distance (that is, "int *adist").  If the
+ * algorithm provides the result, NOTIFY_STOP should be returned.
+ * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next
+ * algorithm in the chain to provide the result.
+ */
+int register_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&mt_adistance_algorithms, nb);
+}
+EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm);
+
+/**
+ * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm
+ * @nb: the notifier block which describe the algorithm
+ *
+ * Return: 0 on success, errno on error.
+ */
+int unregister_mt_adistance_algorithm(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm);
+
+/**
+ * mt_calc_adistance() - Calculate abstract distance with registered algorithms
+ * @node: the node to calculate abstract distance for
+ * @adist: the returned abstract distance
+ *
+ * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some
+ * abstract distance algorithm provides the result, and return it via
+ * @adist.  Otherwise, no algorithm can provide the result and @adist
+ * will be kept as it is.
+ */
+int mt_calc_adistance(int node, int *adist)
+{
+	return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist);
+}
+EXPORT_SYMBOL_GPL(mt_calc_adistance);
+
 static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 					      unsigned long action, void *_arg)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()
  2023-07-21  1:29 [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Huang Ying
  2023-07-21  1:29 ` [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management Huang Ying
@ 2023-07-21  1:29 ` Huang Ying
  2023-07-25  2:44   ` Alistair Popple
  2023-08-07 16:55   ` Jonathan Cameron
  2023-07-21  1:29 ` [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT Huang Ying
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 41+ messages in thread
From: Huang Ying @ 2023-07-21  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-cxl, nvdimm, linux-acpi,
	Huang Ying, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki

Previously, in hmat_register_target_initiators(), the performance
attributes are calculated and the corresponding sysfs links and files
are created too.  Which is called during memory onlining.

But now, to calculate the abstract distance of a memory target before
memory onlining, we need to calculate the performance attributes for
a memory target without creating sysfs links and files.

To do that, hmat_register_target_initiators() is refactored to make it
possible to calculate performance attributes separately.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/numa/hmat.c | 81 +++++++++++++++-------------------------
 1 file changed, 30 insertions(+), 51 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index bba268ecd802..2dee0098f1a9 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -582,28 +582,25 @@ static int initiators_to_nodemask(unsigned long *p_nodes)
 	return 0;
 }
 
-static void hmat_register_target_initiators(struct memory_target *target)
+static void hmat_update_target_attrs(struct memory_target *target,
+				     unsigned long *p_nodes, int access)
 {
-	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
 	struct memory_initiator *initiator;
-	unsigned int mem_nid, cpu_nid;
+	unsigned int cpu_nid;
 	struct memory_locality *loc = NULL;
 	u32 best = 0;
-	bool access0done = false;
 	int i;
 
-	mem_nid = pxm_to_node(target->memory_pxm);
+	bitmap_zero(p_nodes, MAX_NUMNODES);
 	/*
-	 * If the Address Range Structure provides a local processor pxm, link
+	 * If the Address Range Structure provides a local processor pxm, set
 	 * only that one. Otherwise, find the best performance attributes and
-	 * register all initiators that match.
+	 * collect all initiators that match.
 	 */
 	if (target->processor_pxm != PXM_INVAL) {
 		cpu_nid = pxm_to_node(target->processor_pxm);
-		register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
-		access0done = true;
-		if (node_state(cpu_nid, N_CPU)) {
-			register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
+		if (access == 0 || node_state(cpu_nid, N_CPU)) {
+			set_bit(target->processor_pxm, p_nodes);
 			return;
 		}
 	}
@@ -617,47 +614,10 @@ static void hmat_register_target_initiators(struct memory_target *target)
 	 * We'll also use the sorting to prime the candidate nodes with known
 	 * initiators.
 	 */
-	bitmap_zero(p_nodes, MAX_NUMNODES);
 	list_sort(NULL, &initiators, initiator_cmp);
 	if (initiators_to_nodemask(p_nodes) < 0)
 		return;
 
-	if (!access0done) {
-		for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
-			loc = localities_types[i];
-			if (!loc)
-				continue;
-
-			best = 0;
-			list_for_each_entry(initiator, &initiators, node) {
-				u32 value;
-
-				if (!test_bit(initiator->processor_pxm, p_nodes))
-					continue;
-
-				value = hmat_initiator_perf(target, initiator,
-							    loc->hmat_loc);
-				if (hmat_update_best(loc->hmat_loc->data_type, value, &best))
-					bitmap_clear(p_nodes, 0, initiator->processor_pxm);
-				if (value != best)
-					clear_bit(initiator->processor_pxm, p_nodes);
-			}
-			if (best)
-				hmat_update_target_access(target, loc->hmat_loc->data_type,
-							  best, 0);
-		}
-
-		for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
-			cpu_nid = pxm_to_node(i);
-			register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
-		}
-	}
-
-	/* Access 1 ignores Generic Initiators */
-	bitmap_zero(p_nodes, MAX_NUMNODES);
-	if (initiators_to_nodemask(p_nodes) < 0)
-		return;
-
 	for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
 		loc = localities_types[i];
 		if (!loc)
@@ -667,7 +627,7 @@ static void hmat_register_target_initiators(struct memory_target *target)
 		list_for_each_entry(initiator, &initiators, node) {
 			u32 value;
 
-			if (!initiator->has_cpu) {
+			if (access == 1 && !initiator->has_cpu) {
 				clear_bit(initiator->processor_pxm, p_nodes);
 				continue;
 			}
@@ -681,14 +641,33 @@ static void hmat_register_target_initiators(struct memory_target *target)
 				clear_bit(initiator->processor_pxm, p_nodes);
 		}
 		if (best)
-			hmat_update_target_access(target, loc->hmat_loc->data_type, best, 1);
+			hmat_update_target_access(target, loc->hmat_loc->data_type, best, access);
 	}
+}
+
+static void __hmat_register_target_initiators(struct memory_target *target,
+					      unsigned long *p_nodes,
+					      int access)
+{
+	unsigned int mem_nid, cpu_nid;
+	int i;
+
+	mem_nid = pxm_to_node(target->memory_pxm);
+	hmat_update_target_attrs(target, p_nodes, access);
 	for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
 		cpu_nid = pxm_to_node(i);
-		register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
+		register_memory_node_under_compute_node(mem_nid, cpu_nid, access);
 	}
 }
 
+static void hmat_register_target_initiators(struct memory_target *target)
+{
+	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
+
+	__hmat_register_target_initiators(target, p_nodes, 0);
+	__hmat_register_target_initiators(target, p_nodes, 1);
+}
+
 static void hmat_register_target_cache(struct memory_target *target)
 {
 	unsigned mem_nid = pxm_to_node(target->memory_pxm);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT
  2023-07-21  1:29 [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Huang Ying
  2023-07-21  1:29 ` [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management Huang Ying
  2023-07-21  1:29 ` [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators() Huang Ying
@ 2023-07-21  1:29 ` Huang Ying
  2023-07-25  2:45   ` Alistair Popple
  2023-07-21  1:29 ` [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface Huang Ying
  2023-07-21  4:15 ` [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Alistair Popple
  4 siblings, 1 reply; 41+ messages in thread
From: Huang Ying @ 2023-07-21  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-cxl, nvdimm, linux-acpi,
	Huang Ying, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki

A memory tiering abstract distance calculation algorithm based on ACPI
HMAT is implemented.  The basic idea is as follows.

The performance attributes of system default DRAM nodes are recorded
as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
Then, the ratio of the abstract distance of a memory node (target) to
MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
attributes of the node to that of the default DRAM nodes.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/numa/hmat.c     | 138 ++++++++++++++++++++++++++++++++++-
 include/linux/memory-tiers.h |   2 +
 mm/memory-tiers.c            |   2 +-
 3 files changed, 140 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 2dee0098f1a9..306a912090f0 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -24,6 +24,7 @@
 #include <linux/node.h>
 #include <linux/sysfs.h>
 #include <linux/dax.h>
+#include <linux/memory-tiers.h>
 
 static u8 hmat_revision;
 static int hmat_disable __initdata;
@@ -759,6 +760,137 @@ static int hmat_callback(struct notifier_block *self,
 	return NOTIFY_OK;
 }
 
+static int hmat_adistance_disabled;
+static struct node_hmem_attrs default_dram_attrs;
+
+static void dump_hmem_attrs(struct node_hmem_attrs *attrs)
+{
+	pr_cont("read_latency: %u, write_latency: %u, read_bandwidth: %u, write_bandwidth: %u\n",
+		attrs->read_latency, attrs->write_latency,
+		attrs->read_bandwidth, attrs->write_bandwidth);
+}
+
+static void disable_hmat_adistance_algorithm(void)
+{
+	hmat_adistance_disabled = true;
+}
+
+static int hmat_init_default_dram_attrs(void)
+{
+	struct memory_target *target;
+	struct node_hmem_attrs *attrs;
+	int nid, pxm;
+	int nid_dram = NUMA_NO_NODE;
+
+	if (default_dram_attrs.read_latency +
+	    default_dram_attrs.write_latency != 0)
+		return 0;
+
+	if (!default_dram_type)
+		return -EIO;
+
+	for_each_node_mask(nid, default_dram_type->nodes) {
+		pxm = node_to_pxm(nid);
+		target = find_mem_target(pxm);
+		if (!target)
+			continue;
+		attrs = &target->hmem_attrs[1];
+		if (nid_dram == NUMA_NO_NODE) {
+			if (attrs->read_latency + attrs->write_latency == 0 ||
+			    attrs->read_bandwidth + attrs->write_bandwidth == 0) {
+				pr_info("hmat: invalid hmem attrs for default DRAM node: %d,\n",
+					nid);
+				pr_info("  ");
+				dump_hmem_attrs(attrs);
+				pr_info("  disable hmat based abstract distance algorithm.\n");
+				disable_hmat_adistance_algorithm();
+				return -EIO;
+			}
+			nid_dram = nid;
+			default_dram_attrs = *attrs;
+			continue;
+		}
+
+		/*
+		 * The performance of all default DRAM nodes is expected
+		 * to be same (that is, the variation is less than 10%).
+		 * And it will be used as base to calculate the abstract
+		 * distance of other memory nodes.
+		 */
+		if (abs(attrs->read_latency - default_dram_attrs.read_latency) * 10 >
+		    default_dram_attrs.read_latency ||
+		    abs(attrs->write_latency - default_dram_attrs.write_latency) * 10 >
+		    default_dram_attrs.write_latency ||
+		    abs(attrs->read_bandwidth - default_dram_attrs.read_bandwidth) * 10 >
+		    default_dram_attrs.read_bandwidth) {
+			pr_info("hmat: hmem attrs for DRAM nodes mismatch.\n");
+			pr_info("  node %d:", nid_dram);
+			dump_hmem_attrs(&default_dram_attrs);
+			pr_info("  node %d:", nid);
+			dump_hmem_attrs(attrs);
+			pr_info("  disable hmat based abstract distance algorithm.\n");
+			disable_hmat_adistance_algorithm();
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+static int hmat_calculate_adistance(struct notifier_block *self,
+				    unsigned long nid, void *data)
+{
+	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
+	struct memory_target *target;
+	struct node_hmem_attrs *attrs;
+	int *adist = data;
+	int pxm;
+
+	if (hmat_adistance_disabled)
+		return NOTIFY_OK;
+
+	pxm = node_to_pxm(nid);
+	target = find_mem_target(pxm);
+	if (!target)
+		return NOTIFY_OK;
+
+	if (hmat_init_default_dram_attrs())
+		return NOTIFY_OK;
+
+	mutex_lock(&target_lock);
+	hmat_update_target_attrs(target, p_nodes, 1);
+	mutex_unlock(&target_lock);
+
+	attrs = &target->hmem_attrs[1];
+
+	if (attrs->read_latency + attrs->write_latency == 0 ||
+	    attrs->read_bandwidth + attrs->write_bandwidth == 0)
+		return NOTIFY_OK;
+
+	/*
+	 * The abstract distance of a memory node is in direct
+	 * proportion to its memory latency (read + write) and
+	 * inversely proportional to its memory bandwidth (read +
+	 * write).  The abstract distance, memory latency, and memory
+	 * bandwidth of the default DRAM nodes are used as the base.
+	 */
+	*adist = MEMTIER_ADISTANCE_DRAM *
+		(attrs->read_latency + attrs->write_latency) /
+		(default_dram_attrs.read_latency +
+		 default_dram_attrs.write_latency) *
+		(default_dram_attrs.read_bandwidth +
+		 default_dram_attrs.write_bandwidth) /
+		(attrs->read_bandwidth + attrs->write_bandwidth);
+
+	return NOTIFY_STOP;
+}
+
+static __meminitdata struct notifier_block hmat_adist_nb =
+{
+	.notifier_call = hmat_calculate_adistance,
+	.priority = 100,
+};
+
 static __init void hmat_free_structures(void)
 {
 	struct memory_target *target, *tnext;
@@ -801,6 +933,7 @@ static __init int hmat_init(void)
 	struct acpi_table_header *tbl;
 	enum acpi_hmat_type i;
 	acpi_status status;
+	int usage;
 
 	if (srat_disabled() || hmat_disable)
 		return 0;
@@ -841,8 +974,11 @@ static __init int hmat_init(void)
 	hmat_register_targets();
 
 	/* Keep the table and structures if the notifier may use them */
-	if (!hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
+	usage = !hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI);
+	usage += !register_mt_adistance_algorithm(&hmat_adist_nb);
+	if (usage)
 		return 0;
+
 out_put:
 	hmat_free_structures();
 	acpi_put_table(tbl);
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index c6429e624244..9377239c8d34 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -33,6 +33,7 @@ struct memory_dev_type {
 
 #ifdef CONFIG_NUMA
 extern bool numa_demotion_enabled;
+extern struct memory_dev_type *default_dram_type;
 struct memory_dev_type *alloc_memory_type(int adistance);
 void destroy_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
@@ -64,6 +65,7 @@ static inline bool node_is_toptier(int node)
 #else
 
 #define numa_demotion_enabled	false
+#define default_dram_type	NULL
 /*
  * CONFIG_NUMA implementation returns non NULL error.
  */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 1e55fbe2ad51..9a734ef2edfb 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -37,7 +37,7 @@ struct node_memory_type_map {
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
-static struct memory_dev_type *default_dram_type;
+struct memory_dev_type *default_dram_type;
 
 static struct bus_type memory_tier_subsys = {
 	.name = "memory_tiering",
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-07-21  1:29 [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Huang Ying
                   ` (2 preceding siblings ...)
  2023-07-21  1:29 ` [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT Huang Ying
@ 2023-07-21  1:29 ` Huang Ying
  2023-07-25  3:11   ` Alistair Popple
  2023-07-21  4:15 ` [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Alistair Popple
  4 siblings, 1 reply; 41+ messages in thread
From: Huang Ying @ 2023-07-21  1:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, linux-cxl, nvdimm, linux-acpi,
	Huang Ying, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki

Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
used for slow memory type in kmem driver.  This limits the usage of
kmem driver, for example, it cannot be used for HBM (high bandwidth
memory).

So, we use the general abstract distance calculation mechanism in kmem
drivers to get more accurate abstract distance on systems with proper
support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
fallback only.

Now, multiple memory types may be managed by kmem.  These memory types
are put into the "kmem_memory_types" list and protected by
kmem_memory_type_lock.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
 include/linux/memory-tiers.h |  2 ++
 mm/memory-tiers.c            |  2 +-
 3 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 898ca9505754..837165037231 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -49,14 +49,40 @@ struct dax_kmem_data {
 	struct resource *res[];
 };
 
-static struct memory_dev_type *dax_slowmem_type;
+static DEFINE_MUTEX(kmem_memory_type_lock);
+static LIST_HEAD(kmem_memory_types);
+
+static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
+{
+	bool found = false;
+	struct memory_dev_type *mtype;
+
+	mutex_lock(&kmem_memory_type_lock);
+	list_for_each_entry(mtype, &kmem_memory_types, list) {
+		if (mtype->adistance == adist) {
+			found = true;
+			break;
+		}
+	}
+	if (!found) {
+		mtype = alloc_memory_type(adist);
+		if (!IS_ERR(mtype))
+			list_add(&mtype->list, &kmem_memory_types);
+	}
+	mutex_unlock(&kmem_memory_type_lock);
+
+	return mtype;
+}
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
 	unsigned long total_len = 0;
 	struct dax_kmem_data *data;
+	struct memory_dev_type *mtype;
 	int i, rc, mapped = 0;
 	int numa_node;
+	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
 
 	/*
 	 * Ensure good NUMA information for the persistent memory.
@@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		return -EINVAL;
 	}
 
+	mt_calc_adistance(numa_node, &adist);
+	mtype = kmem_find_alloc_memorty_type(adist);
+	if (IS_ERR(mtype))
+		return PTR_ERR(mtype);
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct range range;
 
@@ -88,7 +119,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		return -EINVAL;
 	}
 
-	init_node_memory_type(numa_node, dax_slowmem_type);
+	init_node_memory_type(numa_node, mtype);
 
 	rc = -ENOMEM;
 	data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
@@ -167,7 +198,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 err_res_name:
 	kfree(data);
 err_dax_kmem_data:
-	clear_node_memory_type(numa_node, dax_slowmem_type);
+	clear_node_memory_type(numa_node, mtype);
 	return rc;
 }
 
@@ -219,7 +250,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
 		 * for that. This implies this reference will be around
 		 * till next reboot.
 		 */
-		clear_node_memory_type(node, dax_slowmem_type);
+		clear_node_memory_type(node, NULL);
 	}
 }
 #else
@@ -251,12 +282,6 @@ static int __init dax_kmem_init(void)
 	if (!kmem_name)
 		return -ENOMEM;
 
-	dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE);
-	if (IS_ERR(dax_slowmem_type)) {
-		rc = PTR_ERR(dax_slowmem_type);
-		goto err_dax_slowmem_type;
-	}
-
 	rc = dax_driver_register(&device_dax_kmem_driver);
 	if (rc)
 		goto error_dax_driver;
@@ -264,18 +289,21 @@ static int __init dax_kmem_init(void)
 	return rc;
 
 error_dax_driver:
-	destroy_memory_type(dax_slowmem_type);
-err_dax_slowmem_type:
 	kfree_const(kmem_name);
 	return rc;
 }
 
 static void __exit dax_kmem_exit(void)
 {
+	struct memory_dev_type *mtype, *mtn;
+
 	dax_driver_unregister(&device_dax_kmem_driver);
 	if (!any_hotremove_failed)
 		kfree_const(kmem_name);
-	destroy_memory_type(dax_slowmem_type);
+	list_for_each_entry_safe(mtype, mtn, &kmem_memory_types, list) {
+		list_del(&mtype->list);
+		destroy_memory_type(mtype);
+	}
 }
 
 MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 9377239c8d34..aca22220cb5c 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -24,6 +24,8 @@ struct memory_tier;
 struct memory_dev_type {
 	/* list of memory types that are part of same tier as this type */
 	struct list_head tier_sibiling;
+	/* list of memory types that are managed by one driver */
+	struct list_head list;
 	/* abstract distance for this specific memory type */
 	int adistance;
 	/* Nodes of same abstract distance */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 9a734ef2edfb..38005c60fa2d 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -581,7 +581,7 @@ EXPORT_SYMBOL_GPL(init_node_memory_type);
 void clear_node_memory_type(int node, struct memory_dev_type *memtype)
 {
 	mutex_lock(&memory_tier_lock);
-	if (node_memory_types[node].memtype == memtype)
+	if (node_memory_types[node].memtype == memtype || !memtype)
 		node_memory_types[node].map_count--;
 	/*
 	 * If we umapped all the attached devices to this node,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT
  2023-07-21  1:29 [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Huang Ying
                   ` (3 preceding siblings ...)
  2023-07-21  1:29 ` [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface Huang Ying
@ 2023-07-21  4:15 ` Alistair Popple
  2023-07-24 17:58   ` Andrew Morton
  4 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-21  4:15 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


Thanks for this Huang, I had been hoping to take a look at it this week
but have run out of time. I'm keen to do some testing with it as well.

Hopefully next week...

Huang Ying <ying.huang@intel.com> writes:

> We have the explicit memory tiers framework to manage systems with
> multiple types of memory, e.g., DRAM in DIMM slots and CXL memory
> devices.  Where, same kind of memory devices will be grouped into
> memory types, then put into memory tiers.  To describe the performance
> of a memory type, abstract distance is defined.  Which is in direct
> proportion to the memory latency and inversely proportional to the
> memory bandwidth.  To keep the code as simple as possible, fixed
> abstract distance is used in dax/kmem to describe slow memory such as
> Optane DCPMM.
>
> To support more memory types, in this series, we added the abstract
> distance calculation algorithm management mechanism, provided a
> algorithm implementation based on ACPI HMAT, and used the general
> abstract distance calculation interface in dax/kmem driver.  So,
> dax/kmem can support HBM (high bandwidth memory) in addition to the
> original Optane DCPMM.
>
> Changelog:
>
> V1 (from RFC):
>
> - Added some comments per Aneesh's comments, Thanks!
>
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT
  2023-07-21  4:15 ` [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Alistair Popple
@ 2023-07-24 17:58   ` Andrew Morton
  2023-08-01  2:35     ` Bharata B Rao
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Morton @ 2023-07-24 17:58 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Huang Ying, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple <apopple@nvidia.com> wrote:

> Thanks for this Huang, I had been hoping to take a look at it this week
> but have run out of time. I'm keen to do some testing with it as well.

Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
review and testing would be appreciated.

I made some adjustments to handle the renaming of destroy_memory_type()
to put_memory_type()
(https://lkml.kernel.org/r/20230706063905.543800-1-linmiaohe@huawei.com)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-21  1:29 ` [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management Huang Ying
@ 2023-07-25  2:13   ` Alistair Popple
  2023-07-25  3:14     ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-25  2:13 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


Huang Ying <ying.huang@intel.com> writes:

> The abstract distance may be calculated by various drivers, such as
> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
> users and the providers, the abstract distance calculation algorithms
> management mechanism is implemented in this patch.  It provides
> interface for the providers to register the implementation, and
> interface for the users.

I wonder if we need this level of decoupling though? It seems to me like
it would be simpler and better for drivers to calculate the abstract
distance directly themselves by calling the desired algorithm (eg. ACPI
HMAT) and pass this when creating the nodes rather than having a
notifier chain.

At the moment it seems we've only identified two possible algorithms
(ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
of those to fallback to the other based on priority, so why not just
have drivers call the correct algorithm directly?

> Multiple algorithm implementations can cooperate via calculating
> abstract distance for different memory nodes.  The preference of
> algorithm implementations can be specified via
> priority (notifier_block.priority).

How/what decides the priority though? That seems like something better
decided by a device driver than the algorithm driver IMHO.

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/memory-tiers.h | 19 ++++++++++++
>  mm/memory-tiers.c            | 59 ++++++++++++++++++++++++++++++++++++
>  2 files changed, 78 insertions(+)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index fc9647b1b4f9..c6429e624244 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -6,6 +6,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/kref.h>
>  #include <linux/mmzone.h>
> +#include <linux/notifier.h>
>  /*
>   * Each tier cover a abstrace distance chunk size of 128
>   */
> @@ -36,6 +37,9 @@ struct memory_dev_type *alloc_memory_type(int adistance);
>  void destroy_memory_type(struct memory_dev_type *memtype);
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
>  void clear_node_memory_type(int node, struct memory_dev_type *memtype);
> +int register_mt_adistance_algorithm(struct notifier_block *nb);
> +int unregister_mt_adistance_algorithm(struct notifier_block *nb);
> +int mt_calc_adistance(int node, int *adist);
>  #ifdef CONFIG_MIGRATION
>  int next_demotion_node(int node);
>  void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
> @@ -97,5 +101,20 @@ static inline bool node_is_toptier(int node)
>  {
>  	return true;
>  }
> +
> +static inline int register_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +
> +static inline int unregister_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +
> +static inline int mt_calc_adistance(int node, int *adist)
> +{
> +	return NOTIFY_DONE;
> +}
>  #endif	/* CONFIG_NUMA */
>  #endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index a516e303e304..1e55fbe2ad51 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -5,6 +5,7 @@
>  #include <linux/kobject.h>
>  #include <linux/memory.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/notifier.h>
>  
>  #include "internal.h"
>  
> @@ -105,6 +106,8 @@ static int top_tier_adistance;
>  static struct demotion_nodes *node_demotion __read_mostly;
>  #endif /* CONFIG_MIGRATION */
>  
> +static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
> +
>  static inline struct memory_tier *to_memory_tier(struct device *device)
>  {
>  	return container_of(device, struct memory_tier, dev);
> @@ -592,6 +595,62 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype)
>  }
>  EXPORT_SYMBOL_GPL(clear_node_memory_type);
>  
> +/**
> + * register_mt_adistance_algorithm() - Register memory tiering abstract distance algorithm
> + * @nb: The notifier block which describe the algorithm
> + *
> + * Return: 0 on success, errno on error.
> + *
> + * Every memory tiering abstract distance algorithm provider needs to
> + * register the algorithm with register_mt_adistance_algorithm().  To
> + * calculate the abstract distance for a specified memory node, the
> + * notifier function will be called unless some high priority
> + * algorithm has provided result.  The prototype of the notifier
> + * function is as follows,
> + *
> + *   int (*algorithm_notifier)(struct notifier_block *nb,
> + *                             unsigned long nid, void *data);
> + *
> + * Where "nid" specifies the memory node, "data" is the pointer to the
> + * returned abstract distance (that is, "int *adist").  If the
> + * algorithm provides the result, NOTIFY_STOP should be returned.
> + * Otherwise, return_value & %NOTIFY_STOP_MASK == 0 to allow the next
> + * algorithm in the chain to provide the result.
> + */
> +int register_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&mt_adistance_algorithms, nb);
> +}
> +EXPORT_SYMBOL_GPL(register_mt_adistance_algorithm);
> +
> +/**
> + * unregister_mt_adistance_algorithm() - Unregister memory tiering abstract distance algorithm
> + * @nb: the notifier block which describe the algorithm
> + *
> + * Return: 0 on success, errno on error.
> + */
> +int unregister_mt_adistance_algorithm(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_unregister(&mt_adistance_algorithms, nb);
> +}
> +EXPORT_SYMBOL_GPL(unregister_mt_adistance_algorithm);
> +
> +/**
> + * mt_calc_adistance() - Calculate abstract distance with registered algorithms
> + * @node: the node to calculate abstract distance for
> + * @adist: the returned abstract distance
> + *
> + * Return: if return_value & %NOTIFY_STOP_MASK != 0, then some
> + * abstract distance algorithm provides the result, and return it via
> + * @adist.  Otherwise, no algorithm can provide the result and @adist
> + * will be kept as it is.
> + */
> +int mt_calc_adistance(int node, int *adist)
> +{
> +	return blocking_notifier_call_chain(&mt_adistance_algorithms, node, adist);
> +}
> +EXPORT_SYMBOL_GPL(mt_calc_adistance);
> +
>  static int __meminit memtier_hotplug_callback(struct notifier_block *self,
>  					      unsigned long action, void *_arg)
>  {


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()
  2023-07-21  1:29 ` [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators() Huang Ying
@ 2023-07-25  2:44   ` Alistair Popple
  2023-08-07 16:55   ` Jonathan Cameron
  1 sibling, 0 replies; 41+ messages in thread
From: Alistair Popple @ 2023-07-25  2:44 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


Huang Ying <ying.huang@intel.com> writes:

> Previously, in hmat_register_target_initiators(), the performance
> attributes are calculated and the corresponding sysfs links and files
> are created too.  Which is called during memory onlining.
>
> But now, to calculate the abstract distance of a memory target before
> memory onlining, we need to calculate the performance attributes for
> a memory target without creating sysfs links and files.
>
> To do that, hmat_register_target_initiators() is refactored to make it
> possible to calculate performance attributes separately.

The refactor looks good and I have run the whole series on a system with
some hmat data so:

Reviewed-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Alistair Popple <apopple@nvidia.com>

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/acpi/numa/hmat.c | 81 +++++++++++++++-------------------------
>  1 file changed, 30 insertions(+), 51 deletions(-)
>
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index bba268ecd802..2dee0098f1a9 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -582,28 +582,25 @@ static int initiators_to_nodemask(unsigned long *p_nodes)
>  	return 0;
>  }
>  
> -static void hmat_register_target_initiators(struct memory_target *target)
> +static void hmat_update_target_attrs(struct memory_target *target,
> +				     unsigned long *p_nodes, int access)
>  {
> -	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
>  	struct memory_initiator *initiator;
> -	unsigned int mem_nid, cpu_nid;
> +	unsigned int cpu_nid;
>  	struct memory_locality *loc = NULL;
>  	u32 best = 0;
> -	bool access0done = false;
>  	int i;
>  
> -	mem_nid = pxm_to_node(target->memory_pxm);
> +	bitmap_zero(p_nodes, MAX_NUMNODES);
>  	/*
> -	 * If the Address Range Structure provides a local processor pxm, link
> +	 * If the Address Range Structure provides a local processor pxm, set
>  	 * only that one. Otherwise, find the best performance attributes and
> -	 * register all initiators that match.
> +	 * collect all initiators that match.
>  	 */
>  	if (target->processor_pxm != PXM_INVAL) {
>  		cpu_nid = pxm_to_node(target->processor_pxm);
> -		register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
> -		access0done = true;
> -		if (node_state(cpu_nid, N_CPU)) {
> -			register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
> +		if (access == 0 || node_state(cpu_nid, N_CPU)) {
> +			set_bit(target->processor_pxm, p_nodes);
>  			return;
>  		}
>  	}
> @@ -617,47 +614,10 @@ static void hmat_register_target_initiators(struct memory_target *target)
>  	 * We'll also use the sorting to prime the candidate nodes with known
>  	 * initiators.
>  	 */
> -	bitmap_zero(p_nodes, MAX_NUMNODES);
>  	list_sort(NULL, &initiators, initiator_cmp);
>  	if (initiators_to_nodemask(p_nodes) < 0)
>  		return;
>  
> -	if (!access0done) {
> -		for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
> -			loc = localities_types[i];
> -			if (!loc)
> -				continue;
> -
> -			best = 0;
> -			list_for_each_entry(initiator, &initiators, node) {
> -				u32 value;
> -
> -				if (!test_bit(initiator->processor_pxm, p_nodes))
> -					continue;
> -
> -				value = hmat_initiator_perf(target, initiator,
> -							    loc->hmat_loc);
> -				if (hmat_update_best(loc->hmat_loc->data_type, value, &best))
> -					bitmap_clear(p_nodes, 0, initiator->processor_pxm);
> -				if (value != best)
> -					clear_bit(initiator->processor_pxm, p_nodes);
> -			}
> -			if (best)
> -				hmat_update_target_access(target, loc->hmat_loc->data_type,
> -							  best, 0);
> -		}
> -
> -		for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
> -			cpu_nid = pxm_to_node(i);
> -			register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
> -		}
> -	}
> -
> -	/* Access 1 ignores Generic Initiators */
> -	bitmap_zero(p_nodes, MAX_NUMNODES);
> -	if (initiators_to_nodemask(p_nodes) < 0)
> -		return;
> -
>  	for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
>  		loc = localities_types[i];
>  		if (!loc)
> @@ -667,7 +627,7 @@ static void hmat_register_target_initiators(struct memory_target *target)
>  		list_for_each_entry(initiator, &initiators, node) {
>  			u32 value;
>  
> -			if (!initiator->has_cpu) {
> +			if (access == 1 && !initiator->has_cpu) {
>  				clear_bit(initiator->processor_pxm, p_nodes);
>  				continue;
>  			}
> @@ -681,14 +641,33 @@ static void hmat_register_target_initiators(struct memory_target *target)
>  				clear_bit(initiator->processor_pxm, p_nodes);
>  		}
>  		if (best)
> -			hmat_update_target_access(target, loc->hmat_loc->data_type, best, 1);
> +			hmat_update_target_access(target, loc->hmat_loc->data_type, best, access);
>  	}
> +}
> +
> +static void __hmat_register_target_initiators(struct memory_target *target,
> +					      unsigned long *p_nodes,
> +					      int access)
> +{
> +	unsigned int mem_nid, cpu_nid;
> +	int i;
> +
> +	mem_nid = pxm_to_node(target->memory_pxm);
> +	hmat_update_target_attrs(target, p_nodes, access);
>  	for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
>  		cpu_nid = pxm_to_node(i);
> -		register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
> +		register_memory_node_under_compute_node(mem_nid, cpu_nid, access);
>  	}
>  }
>  
> +static void hmat_register_target_initiators(struct memory_target *target)
> +{
> +	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
> +
> +	__hmat_register_target_initiators(target, p_nodes, 0);
> +	__hmat_register_target_initiators(target, p_nodes, 1);
> +}
> +
>  static void hmat_register_target_cache(struct memory_target *target)
>  {
>  	unsigned mem_nid = pxm_to_node(target->memory_pxm);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT
  2023-07-21  1:29 ` [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT Huang Ying
@ 2023-07-25  2:45   ` Alistair Popple
  2023-07-25  6:47     ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-25  2:45 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


Huang Ying <ying.huang@intel.com> writes:

> A memory tiering abstract distance calculation algorithm based on ACPI
> HMAT is implemented.  The basic idea is as follows.
>
> The performance attributes of system default DRAM nodes are recorded
> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
> Then, the ratio of the abstract distance of a memory node (target) to
> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
> attributes of the node to that of the default DRAM nodes.

The problem I encountered here with the calculations is that HBM memory
ended up in a lower-tiered node which isn't what I wanted (at least when
that HBM is attached to a GPU say).

I suspect this is because the calculations are based on the CPU
point-of-view (access1) which still sees lower bandwidth to remote HBM
than local DRAM, even though the remote GPU has higher bandwidth access
to that memory. Perhaps we need to be considering access0 as well?
Ie. HBM directly attached to a generic initiator should be in a higher
tier regardless of CPU access characteristics?

That said I'm not entirely convinced the HMAT tables I'm testing against
are accurate/complete.

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/acpi/numa/hmat.c     | 138 ++++++++++++++++++++++++++++++++++-
>  include/linux/memory-tiers.h |   2 +
>  mm/memory-tiers.c            |   2 +-
>  3 files changed, 140 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 2dee0098f1a9..306a912090f0 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -24,6 +24,7 @@
>  #include <linux/node.h>
>  #include <linux/sysfs.h>
>  #include <linux/dax.h>
> +#include <linux/memory-tiers.h>
>  
>  static u8 hmat_revision;
>  static int hmat_disable __initdata;
> @@ -759,6 +760,137 @@ static int hmat_callback(struct notifier_block *self,
>  	return NOTIFY_OK;
>  }
>  
> +static int hmat_adistance_disabled;
> +static struct node_hmem_attrs default_dram_attrs;
> +
> +static void dump_hmem_attrs(struct node_hmem_attrs *attrs)
> +{
> +	pr_cont("read_latency: %u, write_latency: %u, read_bandwidth: %u, write_bandwidth: %u\n",
> +		attrs->read_latency, attrs->write_latency,
> +		attrs->read_bandwidth, attrs->write_bandwidth);
> +}
> +
> +static void disable_hmat_adistance_algorithm(void)
> +{
> +	hmat_adistance_disabled = true;
> +}
> +
> +static int hmat_init_default_dram_attrs(void)
> +{
> +	struct memory_target *target;
> +	struct node_hmem_attrs *attrs;
> +	int nid, pxm;
> +	int nid_dram = NUMA_NO_NODE;
> +
> +	if (default_dram_attrs.read_latency +
> +	    default_dram_attrs.write_latency != 0)
> +		return 0;
> +
> +	if (!default_dram_type)
> +		return -EIO;
> +
> +	for_each_node_mask(nid, default_dram_type->nodes) {
> +		pxm = node_to_pxm(nid);
> +		target = find_mem_target(pxm);
> +		if (!target)
> +			continue;
> +		attrs = &target->hmem_attrs[1];
> +		if (nid_dram == NUMA_NO_NODE) {
> +			if (attrs->read_latency + attrs->write_latency == 0 ||
> +			    attrs->read_bandwidth + attrs->write_bandwidth == 0) {
> +				pr_info("hmat: invalid hmem attrs for default DRAM node: %d,\n",
> +					nid);
> +				pr_info("  ");
> +				dump_hmem_attrs(attrs);
> +				pr_info("  disable hmat based abstract distance algorithm.\n");
> +				disable_hmat_adistance_algorithm();
> +				return -EIO;
> +			}
> +			nid_dram = nid;
> +			default_dram_attrs = *attrs;
> +			continue;
> +		}
> +
> +		/*
> +		 * The performance of all default DRAM nodes is expected
> +		 * to be same (that is, the variation is less than 10%).
> +		 * And it will be used as base to calculate the abstract
> +		 * distance of other memory nodes.
> +		 */
> +		if (abs(attrs->read_latency - default_dram_attrs.read_latency) * 10 >
> +		    default_dram_attrs.read_latency ||
> +		    abs(attrs->write_latency - default_dram_attrs.write_latency) * 10 >
> +		    default_dram_attrs.write_latency ||
> +		    abs(attrs->read_bandwidth - default_dram_attrs.read_bandwidth) * 10 >
> +		    default_dram_attrs.read_bandwidth) {
> +			pr_info("hmat: hmem attrs for DRAM nodes mismatch.\n");
> +			pr_info("  node %d:", nid_dram);
> +			dump_hmem_attrs(&default_dram_attrs);
> +			pr_info("  node %d:", nid);
> +			dump_hmem_attrs(attrs);
> +			pr_info("  disable hmat based abstract distance algorithm.\n");
> +			disable_hmat_adistance_algorithm();
> +			return -EIO;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static int hmat_calculate_adistance(struct notifier_block *self,
> +				    unsigned long nid, void *data)
> +{
> +	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
> +	struct memory_target *target;
> +	struct node_hmem_attrs *attrs;
> +	int *adist = data;
> +	int pxm;
> +
> +	if (hmat_adistance_disabled)
> +		return NOTIFY_OK;
> +
> +	pxm = node_to_pxm(nid);
> +	target = find_mem_target(pxm);
> +	if (!target)
> +		return NOTIFY_OK;
> +
> +	if (hmat_init_default_dram_attrs())
> +		return NOTIFY_OK;
> +
> +	mutex_lock(&target_lock);
> +	hmat_update_target_attrs(target, p_nodes, 1);
> +	mutex_unlock(&target_lock);
> +
> +	attrs = &target->hmem_attrs[1];
> +
> +	if (attrs->read_latency + attrs->write_latency == 0 ||
> +	    attrs->read_bandwidth + attrs->write_bandwidth == 0)
> +		return NOTIFY_OK;
> +
> +	/*
> +	 * The abstract distance of a memory node is in direct
> +	 * proportion to its memory latency (read + write) and
> +	 * inversely proportional to its memory bandwidth (read +
> +	 * write).  The abstract distance, memory latency, and memory
> +	 * bandwidth of the default DRAM nodes are used as the base.
> +	 */
> +	*adist = MEMTIER_ADISTANCE_DRAM *
> +		(attrs->read_latency + attrs->write_latency) /
> +		(default_dram_attrs.read_latency +
> +		 default_dram_attrs.write_latency) *
> +		(default_dram_attrs.read_bandwidth +
> +		 default_dram_attrs.write_bandwidth) /
> +		(attrs->read_bandwidth + attrs->write_bandwidth);
> +
> +	return NOTIFY_STOP;
> +}
> +
> +static __meminitdata struct notifier_block hmat_adist_nb =
> +{
> +	.notifier_call = hmat_calculate_adistance,
> +	.priority = 100,
> +};
> +
>  static __init void hmat_free_structures(void)
>  {
>  	struct memory_target *target, *tnext;
> @@ -801,6 +933,7 @@ static __init int hmat_init(void)
>  	struct acpi_table_header *tbl;
>  	enum acpi_hmat_type i;
>  	acpi_status status;
> +	int usage;
>  
>  	if (srat_disabled() || hmat_disable)
>  		return 0;
> @@ -841,8 +974,11 @@ static __init int hmat_init(void)
>  	hmat_register_targets();
>  
>  	/* Keep the table and structures if the notifier may use them */
> -	if (!hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI))
> +	usage = !hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI);
> +	usage += !register_mt_adistance_algorithm(&hmat_adist_nb);
> +	if (usage)
>  		return 0;
> +
>  out_put:
>  	hmat_free_structures();
>  	acpi_put_table(tbl);
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index c6429e624244..9377239c8d34 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -33,6 +33,7 @@ struct memory_dev_type {
>  
>  #ifdef CONFIG_NUMA
>  extern bool numa_demotion_enabled;
> +extern struct memory_dev_type *default_dram_type;
>  struct memory_dev_type *alloc_memory_type(int adistance);
>  void destroy_memory_type(struct memory_dev_type *memtype);
>  void init_node_memory_type(int node, struct memory_dev_type *default_type);
> @@ -64,6 +65,7 @@ static inline bool node_is_toptier(int node)
>  #else
>  
>  #define numa_demotion_enabled	false
> +#define default_dram_type	NULL
>  /*
>   * CONFIG_NUMA implementation returns non NULL error.
>   */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 1e55fbe2ad51..9a734ef2edfb 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -37,7 +37,7 @@ struct node_memory_type_map {
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
> -static struct memory_dev_type *default_dram_type;
> +struct memory_dev_type *default_dram_type;
>  
>  static struct bus_type memory_tier_subsys = {
>  	.name = "memory_tiering",


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-07-21  1:29 ` [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface Huang Ying
@ 2023-07-25  3:11   ` Alistair Popple
  2023-07-25  7:02     ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-25  3:11 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


Huang Ying <ying.huang@intel.com> writes:

> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
> used for slow memory type in kmem driver.  This limits the usage of
> kmem driver, for example, it cannot be used for HBM (high bandwidth
> memory).
>
> So, we use the general abstract distance calculation mechanism in kmem
> drivers to get more accurate abstract distance on systems with proper
> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
> fallback only.
>
> Now, multiple memory types may be managed by kmem.  These memory types
> are put into the "kmem_memory_types" list and protected by
> kmem_memory_type_lock.

See below but I wonder if kmem_memory_types could be a common helper
rather than kdax specific?

> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>  include/linux/memory-tiers.h |  2 ++
>  mm/memory-tiers.c            |  2 +-
>  3 files changed, 44 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 898ca9505754..837165037231 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>  	struct resource *res[];
>  };
>  
> -static struct memory_dev_type *dax_slowmem_type;
> +static DEFINE_MUTEX(kmem_memory_type_lock);
> +static LIST_HEAD(kmem_memory_types);
> +
> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
> +{
> +	bool found = false;
> +	struct memory_dev_type *mtype;
> +
> +	mutex_lock(&kmem_memory_type_lock);
> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
> +		if (mtype->adistance == adist) {
> +			found = true;
> +			break;
> +		}
> +	}
> +	if (!found) {
> +		mtype = alloc_memory_type(adist);
> +		if (!IS_ERR(mtype))
> +			list_add(&mtype->list, &kmem_memory_types);
> +	}
> +	mutex_unlock(&kmem_memory_type_lock);
> +
> +	return mtype;
> +}
> +
>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  {
>  	struct device *dev = &dev_dax->dev;
>  	unsigned long total_len = 0;
>  	struct dax_kmem_data *data;
> +	struct memory_dev_type *mtype;
>  	int i, rc, mapped = 0;
>  	int numa_node;
> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>  
>  	/*
>  	 * Ensure good NUMA information for the persistent memory.
> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		return -EINVAL;
>  	}
>  
> +	mt_calc_adistance(numa_node, &adist);
> +	mtype = kmem_find_alloc_memorty_type(adist);
> +	if (IS_ERR(mtype))
> +		return PTR_ERR(mtype);
> +

I wrote my own quick and dirty module to test this and wrote basically
the same code sequence.

I notice your using a list of memory types here though. I think it would
be nice to have a common helper that other users could call to do the
mt_calc_adistance() / kmem_find_alloc_memory_type() /
init_node_memory_type() sequence and cleanup as my naive approach would
result in a new memory_dev_type per device even though adist might be
the same. A common helper would make it easy to de-dup those.

>  	for (i = 0; i < dev_dax->nr_range; i++) {
>  		struct range range;
>  
> @@ -88,7 +119,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		return -EINVAL;
>  	}
>  
> -	init_node_memory_type(numa_node, dax_slowmem_type);
> +	init_node_memory_type(numa_node, mtype);
>  
>  	rc = -ENOMEM;
>  	data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
> @@ -167,7 +198,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  err_res_name:
>  	kfree(data);
>  err_dax_kmem_data:
> -	clear_node_memory_type(numa_node, dax_slowmem_type);
> +	clear_node_memory_type(numa_node, mtype);
>  	return rc;
>  }
>  
> @@ -219,7 +250,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>  		 * for that. This implies this reference will be around
>  		 * till next reboot.
>  		 */
> -		clear_node_memory_type(node, dax_slowmem_type);
> +		clear_node_memory_type(node, NULL);
>  	}
>  }
>  #else
> @@ -251,12 +282,6 @@ static int __init dax_kmem_init(void)
>  	if (!kmem_name)
>  		return -ENOMEM;
>  
> -	dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE);
> -	if (IS_ERR(dax_slowmem_type)) {
> -		rc = PTR_ERR(dax_slowmem_type);
> -		goto err_dax_slowmem_type;
> -	}
> -
>  	rc = dax_driver_register(&device_dax_kmem_driver);
>  	if (rc)
>  		goto error_dax_driver;
> @@ -264,18 +289,21 @@ static int __init dax_kmem_init(void)
>  	return rc;
>  
>  error_dax_driver:
> -	destroy_memory_type(dax_slowmem_type);
> -err_dax_slowmem_type:
>  	kfree_const(kmem_name);
>  	return rc;
>  }
>  
>  static void __exit dax_kmem_exit(void)
>  {
> +	struct memory_dev_type *mtype, *mtn;
> +
>  	dax_driver_unregister(&device_dax_kmem_driver);
>  	if (!any_hotremove_failed)
>  		kfree_const(kmem_name);
> -	destroy_memory_type(dax_slowmem_type);
> +	list_for_each_entry_safe(mtype, mtn, &kmem_memory_types, list) {
> +		list_del(&mtype->list);
> +		destroy_memory_type(mtype);
> +	}
>  }
>  
>  MODULE_AUTHOR("Intel Corporation");
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 9377239c8d34..aca22220cb5c 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -24,6 +24,8 @@ struct memory_tier;
>  struct memory_dev_type {
>  	/* list of memory types that are part of same tier as this type */
>  	struct list_head tier_sibiling;
> +	/* list of memory types that are managed by one driver */
> +	struct list_head list;
>  	/* abstract distance for this specific memory type */
>  	int adistance;
>  	/* Nodes of same abstract distance */
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 9a734ef2edfb..38005c60fa2d 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -581,7 +581,7 @@ EXPORT_SYMBOL_GPL(init_node_memory_type);
>  void clear_node_memory_type(int node, struct memory_dev_type *memtype)
>  {
>  	mutex_lock(&memory_tier_lock);
> -	if (node_memory_types[node].memtype == memtype)
> +	if (node_memory_types[node].memtype == memtype || !memtype)
>  		node_memory_types[node].map_count--;
>  	/*
>  	 * If we umapped all the attached devices to this node,


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-25  2:13   ` Alistair Popple
@ 2023-07-25  3:14     ` Huang, Ying
  2023-07-25  8:26       ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-07-25  3:14 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Hi, Alistair,

Thanks a lot for comments!

Alistair Popple <apopple@nvidia.com> writes:

> Huang Ying <ying.huang@intel.com> writes:
>
>> The abstract distance may be calculated by various drivers, such as
>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>> users and the providers, the abstract distance calculation algorithms
>> management mechanism is implemented in this patch.  It provides
>> interface for the providers to register the implementation, and
>> interface for the users.
>
> I wonder if we need this level of decoupling though? It seems to me like
> it would be simpler and better for drivers to calculate the abstract
> distance directly themselves by calling the desired algorithm (eg. ACPI
> HMAT) and pass this when creating the nodes rather than having a
> notifier chain.

Per my understanding, ACPI HMAT and memory device drivers (such as
dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
good to call functions across subsystems directly.  So, I think it's
better to use a general subsystem: memory-tier.c to decouple them.  If
it turns out that a notifier chain is unnecessary, we can use some
function pointers instead.

> At the moment it seems we've only identified two possible algorithms
> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
> of those to fallback to the other based on priority, so why not just
> have drivers call the correct algorithm directly?

For example, we have a system with PMEM (persistent memory, Optane
DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
types of the device and call corresponding algorithms.  The other way
(suggested by this series) is to make dax/kmem call a notifier chain,
then CXL CDAT or ACPI HMAT can identify the type of device and calculate
the distance if the type is correct for them.  I don't think that it's
good to make dax/kem to know every possible types of memory devices.

>> Multiple algorithm implementations can cooperate via calculating
>> abstract distance for different memory nodes.  The preference of
>> algorithm implementations can be specified via
>> priority (notifier_block.priority).
>
> How/what decides the priority though? That seems like something better
> decided by a device driver than the algorithm driver IMHO.

Do we need the memory device driver specific priority?  Or we just share
a common priority?  For example, the priority of CXL CDAT is always
higher than that of ACPI HMAT?  Or architecture specific?

And, I don't think that we are forced to use the general notifier chain
interface in all memory device drivers.  If the memory device driver has
better understanding of the memory device, it can use other way to
determine abstract distance.  For example, a CXL memory device driver
can identify abstract distance by itself.  While other memory device drivers
can use the general notifier chain interface at the same time.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT
  2023-07-25  2:45   ` Alistair Popple
@ 2023-07-25  6:47     ` Huang, Ying
  2023-08-21 11:53       ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-07-25  6:47 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

Alistair Popple <apopple@nvidia.com> writes:

> Huang Ying <ying.huang@intel.com> writes:
>
>> A memory tiering abstract distance calculation algorithm based on ACPI
>> HMAT is implemented.  The basic idea is as follows.
>>
>> The performance attributes of system default DRAM nodes are recorded
>> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
>> Then, the ratio of the abstract distance of a memory node (target) to
>> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
>> attributes of the node to that of the default DRAM nodes.
>
> The problem I encountered here with the calculations is that HBM memory
> ended up in a lower-tiered node which isn't what I wanted (at least when
> that HBM is attached to a GPU say).

I have tested the series on a server machine with HBM (pure HBM, not
attached to a GPU).  Where, HBM is placed in a higher tier than DRAM.

> I suspect this is because the calculations are based on the CPU
> point-of-view (access1) which still sees lower bandwidth to remote HBM
> than local DRAM, even though the remote GPU has higher bandwidth access
> to that memory. Perhaps we need to be considering access0 as well?
> Ie. HBM directly attached to a generic initiator should be in a higher
> tier regardless of CPU access characteristics?

What's your requirements for memory tiers on the machine?  I guess you
want to put GPU attache HBM in a higher tier and put DRAM in a lower
tier.  So, cold HBM pages can be demoted to DRAM when there are memory
pressure on HBM?  This sounds reasonable from GPU point of view.

The above requirements may be satisfied via calculating abstract
distance based on access0 (or combined with access1).  But I suspect
this will be a general solution.  I guess that any memory devices that
are used mainly by the memory initiators other than CPUs want to put
themselves in a higher memory tier than DRAM, regardless of its
access0.

One solution is to put GPU HBM in the highest memory tier (with smallest
abstract distance) always in GPU device driver regardless its HMAT
performance attributes.  Is it possible?

> That said I'm not entirely convinced the HMAT tables I'm testing against
> are accurate/complete.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-07-25  3:11   ` Alistair Popple
@ 2023-07-25  7:02     ` Huang, Ying
  2023-08-21 12:03       ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-07-25  7:02 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

Alistair Popple <apopple@nvidia.com> writes:

> Huang Ying <ying.huang@intel.com> writes:
>
>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>> used for slow memory type in kmem driver.  This limits the usage of
>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>> memory).
>>
>> So, we use the general abstract distance calculation mechanism in kmem
>> drivers to get more accurate abstract distance on systems with proper
>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>> fallback only.
>>
>> Now, multiple memory types may be managed by kmem.  These memory types
>> are put into the "kmem_memory_types" list and protected by
>> kmem_memory_type_lock.
>
> See below but I wonder if kmem_memory_types could be a common helper
> rather than kdax specific?
>
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> Cc: Wei Xu <weixugc@google.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>>  include/linux/memory-tiers.h |  2 ++
>>  mm/memory-tiers.c            |  2 +-
>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 898ca9505754..837165037231 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>  	struct resource *res[];
>>  };
>>  
>> -static struct memory_dev_type *dax_slowmem_type;
>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>> +static LIST_HEAD(kmem_memory_types);
>> +
>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>> +{
>> +	bool found = false;
>> +	struct memory_dev_type *mtype;
>> +
>> +	mutex_lock(&kmem_memory_type_lock);
>> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
>> +		if (mtype->adistance == adist) {
>> +			found = true;
>> +			break;
>> +		}
>> +	}
>> +	if (!found) {
>> +		mtype = alloc_memory_type(adist);
>> +		if (!IS_ERR(mtype))
>> +			list_add(&mtype->list, &kmem_memory_types);
>> +	}
>> +	mutex_unlock(&kmem_memory_type_lock);
>> +
>> +	return mtype;
>> +}
>> +
>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  {
>>  	struct device *dev = &dev_dax->dev;
>>  	unsigned long total_len = 0;
>>  	struct dax_kmem_data *data;
>> +	struct memory_dev_type *mtype;
>>  	int i, rc, mapped = 0;
>>  	int numa_node;
>> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>  
>>  	/*
>>  	 * Ensure good NUMA information for the persistent memory.
>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  		return -EINVAL;
>>  	}
>>  
>> +	mt_calc_adistance(numa_node, &adist);
>> +	mtype = kmem_find_alloc_memorty_type(adist);
>> +	if (IS_ERR(mtype))
>> +		return PTR_ERR(mtype);
>> +
>
> I wrote my own quick and dirty module to test this and wrote basically
> the same code sequence.
>
> I notice your using a list of memory types here though. I think it would
> be nice to have a common helper that other users could call to do the
> mt_calc_adistance() / kmem_find_alloc_memory_type() /
> init_node_memory_type() sequence and cleanup as my naive approach would
> result in a new memory_dev_type per device even though adist might be
> the same. A common helper would make it easy to de-dup those.

If it's useful, we can move kmem_find_alloc_memory_type() to
memory-tier.c after some revision.  But I tend to move it after we have
the second user.  What do you think about that?

--
Best Regards,
Huang, Ying

>>  	for (i = 0; i < dev_dax->nr_range; i++) {
>>  		struct range range;
>>  
>> @@ -88,7 +119,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  		return -EINVAL;
>>  	}
>>  
>> -	init_node_memory_type(numa_node, dax_slowmem_type);
>> +	init_node_memory_type(numa_node, mtype);
>>  
>>  	rc = -ENOMEM;
>>  	data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
>> @@ -167,7 +198,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>  err_res_name:
>>  	kfree(data);
>>  err_dax_kmem_data:
>> -	clear_node_memory_type(numa_node, dax_slowmem_type);
>> +	clear_node_memory_type(numa_node, mtype);
>>  	return rc;
>>  }
>>  
>> @@ -219,7 +250,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
>>  		 * for that. This implies this reference will be around
>>  		 * till next reboot.
>>  		 */
>> -		clear_node_memory_type(node, dax_slowmem_type);
>> +		clear_node_memory_type(node, NULL);
>>  	}
>>  }
>>  #else
>> @@ -251,12 +282,6 @@ static int __init dax_kmem_init(void)
>>  	if (!kmem_name)
>>  		return -ENOMEM;
>>  
>> -	dax_slowmem_type = alloc_memory_type(MEMTIER_DEFAULT_DAX_ADISTANCE);
>> -	if (IS_ERR(dax_slowmem_type)) {
>> -		rc = PTR_ERR(dax_slowmem_type);
>> -		goto err_dax_slowmem_type;
>> -	}
>> -
>>  	rc = dax_driver_register(&device_dax_kmem_driver);
>>  	if (rc)
>>  		goto error_dax_driver;
>> @@ -264,18 +289,21 @@ static int __init dax_kmem_init(void)
>>  	return rc;
>>  
>>  error_dax_driver:
>> -	destroy_memory_type(dax_slowmem_type);
>> -err_dax_slowmem_type:
>>  	kfree_const(kmem_name);
>>  	return rc;
>>  }
>>  
>>  static void __exit dax_kmem_exit(void)
>>  {
>> +	struct memory_dev_type *mtype, *mtn;
>> +
>>  	dax_driver_unregister(&device_dax_kmem_driver);
>>  	if (!any_hotremove_failed)
>>  		kfree_const(kmem_name);
>> -	destroy_memory_type(dax_slowmem_type);
>> +	list_for_each_entry_safe(mtype, mtn, &kmem_memory_types, list) {
>> +		list_del(&mtype->list);
>> +		destroy_memory_type(mtype);
>> +	}
>>  }
>>  
>>  MODULE_AUTHOR("Intel Corporation");
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 9377239c8d34..aca22220cb5c 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -24,6 +24,8 @@ struct memory_tier;
>>  struct memory_dev_type {
>>  	/* list of memory types that are part of same tier as this type */
>>  	struct list_head tier_sibiling;
>> +	/* list of memory types that are managed by one driver */
>> +	struct list_head list;
>>  	/* abstract distance for this specific memory type */
>>  	int adistance;
>>  	/* Nodes of same abstract distance */
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index 9a734ef2edfb..38005c60fa2d 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -581,7 +581,7 @@ EXPORT_SYMBOL_GPL(init_node_memory_type);
>>  void clear_node_memory_type(int node, struct memory_dev_type *memtype)
>>  {
>>  	mutex_lock(&memory_tier_lock);
>> -	if (node_memory_types[node].memtype == memtype)
>> +	if (node_memory_types[node].memtype == memtype || !memtype)
>>  		node_memory_types[node].map_count--;
>>  	/*
>>  	 * If we umapped all the attached devices to this node,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-25  3:14     ` Huang, Ying
@ 2023-07-25  8:26       ` Alistair Popple
  2023-07-26  7:33         ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-25  8:26 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Alistair,
>
> Thanks a lot for comments!
>
> Alistair Popple <apopple@nvidia.com> writes:
>
>> Huang Ying <ying.huang@intel.com> writes:
>>
>>> The abstract distance may be calculated by various drivers, such as
>>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>>> users and the providers, the abstract distance calculation algorithms
>>> management mechanism is implemented in this patch.  It provides
>>> interface for the providers to register the implementation, and
>>> interface for the users.
>>
>> I wonder if we need this level of decoupling though? It seems to me like
>> it would be simpler and better for drivers to calculate the abstract
>> distance directly themselves by calling the desired algorithm (eg. ACPI
>> HMAT) and pass this when creating the nodes rather than having a
>> notifier chain.
>
> Per my understanding, ACPI HMAT and memory device drivers (such as
> dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
> good to call functions across subsystems directly.  So, I think it's
> better to use a general subsystem: memory-tier.c to decouple them.  If
> it turns out that a notifier chain is unnecessary, we can use some
> function pointers instead.
>
>> At the moment it seems we've only identified two possible algorithms
>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
>> of those to fallback to the other based on priority, so why not just
>> have drivers call the correct algorithm directly?
>
> For example, we have a system with PMEM (persistent memory, Optane
> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
> via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
> and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
> types of the device and call corresponding algorithms.

Yes, that is what I was thinking.

> The other way (suggested by this series) is to make dax/kmem call a
> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
> device and calculate the distance if the type is correct for them.  I
> don't think that it's good to make dax/kem to know every possible
> types of memory devices.

Do we expect there to be lots of different types of memory devices
sharing a common dax/kmem driver though? Must admit I'm coming from a
GPU background where we'd expect each type of device to have it's own
driver anyway so wasn't expecting different types of memory devices to
be handled by the same driver.

>>> Multiple algorithm implementations can cooperate via calculating
>>> abstract distance for different memory nodes.  The preference of
>>> algorithm implementations can be specified via
>>> priority (notifier_block.priority).
>>
>> How/what decides the priority though? That seems like something better
>> decided by a device driver than the algorithm driver IMHO.
>
> Do we need the memory device driver specific priority?  Or we just share
> a common priority?  For example, the priority of CXL CDAT is always
> higher than that of ACPI HMAT?  Or architecture specific?

Ok, thanks. Having read the above I think the priority is
unimportant. Algorithms can either decide to return a distance and
NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they
can't for a specific device.

> And, I don't think that we are forced to use the general notifier
> chain interface in all memory device drivers.  If the memory device
> driver has better understanding of the memory device, it can use other
> way to determine abstract distance.  For example, a CXL memory device
> driver can identify abstract distance by itself.  While other memory
> device drivers can use the general notifier chain interface at the
> same time.

Whilst I think personally I would find that flexibility useful I am
concerned it means every driver will just end up divining it's own
distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
would kind of defeat the purpose of it all then.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-25  8:26       ` Alistair Popple
@ 2023-07-26  7:33         ` Huang, Ying
  2023-07-27  3:42           ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-07-26  7:33 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Hi, Alistair,
>>
>> Thanks a lot for comments!
>>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> Huang Ying <ying.huang@intel.com> writes:
>>>
>>>> The abstract distance may be calculated by various drivers, such as
>>>> ACPI HMAT, CXL CDAT, etc.  While it may be used by various code which
>>>> hot-add memory node, such as dax/kmem etc.  To decouple the algorithm
>>>> users and the providers, the abstract distance calculation algorithms
>>>> management mechanism is implemented in this patch.  It provides
>>>> interface for the providers to register the implementation, and
>>>> interface for the users.
>>>
>>> I wonder if we need this level of decoupling though? It seems to me like
>>> it would be simpler and better for drivers to calculate the abstract
>>> distance directly themselves by calling the desired algorithm (eg. ACPI
>>> HMAT) and pass this when creating the nodes rather than having a
>>> notifier chain.
>>
>> Per my understanding, ACPI HMAT and memory device drivers (such as
>> dax/kmem) may belong to different subsystems (ACPI vs. dax).  It's not
>> good to call functions across subsystems directly.  So, I think it's
>> better to use a general subsystem: memory-tier.c to decouple them.  If
>> it turns out that a notifier chain is unnecessary, we can use some
>> function pointers instead.
>>
>>> At the moment it seems we've only identified two possible algorithms
>>> (ACPI HMAT and CXL CDAT) and I don't think it would make sense for one
>>> of those to fallback to the other based on priority, so why not just
>>> have drivers call the correct algorithm directly?
>>
>> For example, we have a system with PMEM (persistent memory, Optane
>> DCPMM, or AEP, or something else) in DIMM slots and CXL.mem connected
>> via CXL link to a remote memory pool.  We will need ACPI HMAT for PMEM
>> and CXL CDAT for CXL.mem.  One way is to make dax/kmem identify the
>> types of the device and call corresponding algorithms.
>
> Yes, that is what I was thinking.
>
>> The other way (suggested by this series) is to make dax/kmem call a
>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>> device and calculate the distance if the type is correct for them.  I
>> don't think that it's good to make dax/kem to know every possible
>> types of memory devices.
>
> Do we expect there to be lots of different types of memory devices
> sharing a common dax/kmem driver though? Must admit I'm coming from a
> GPU background where we'd expect each type of device to have it's own
> driver anyway so wasn't expecting different types of memory devices to
> be handled by the same driver.

Now, dax/kmem.c is used for

- PMEM (Optane DCPMM, or AEP)
- CXL.mem
- HBM (attached to CPU)

I understand that for a CXL GPU driver it's OK to call some CXL CDAT
helper to identify the abstract distance of memory attached to GPU.
Because there's no cross-subsystem function calls.  But it looks not
very clean to call from dax/kmem.c to CXL CDAT because it's a
cross-subsystem function call.

>>>> Multiple algorithm implementations can cooperate via calculating
>>>> abstract distance for different memory nodes.  The preference of
>>>> algorithm implementations can be specified via
>>>> priority (notifier_block.priority).
>>>
>>> How/what decides the priority though? That seems like something better
>>> decided by a device driver than the algorithm driver IMHO.
>>
>> Do we need the memory device driver specific priority?  Or we just share
>> a common priority?  For example, the priority of CXL CDAT is always
>> higher than that of ACPI HMAT?  Or architecture specific?
>
> Ok, thanks. Having read the above I think the priority is
> unimportant. Algorithms can either decide to return a distance and
> NOTIFY_STOP_MASK if they can calculate a distance or NOTIFY_DONE if they
> can't for a specific device.

Yes.  In most cases, there are no overlaps among algorithms.

>> And, I don't think that we are forced to use the general notifier
>> chain interface in all memory device drivers.  If the memory device
>> driver has better understanding of the memory device, it can use other
>> way to determine abstract distance.  For example, a CXL memory device
>> driver can identify abstract distance by itself.  While other memory
>> device drivers can use the general notifier chain interface at the
>> same time.
>
> Whilst I think personally I would find that flexibility useful I am
> concerned it means every driver will just end up divining it's own
> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
> would kind of defeat the purpose of it all then.

But we have no way to enforce that too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-26  7:33         ` Huang, Ying
@ 2023-07-27  3:42           ` Alistair Popple
  2023-07-27  4:02             ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-27  3:42 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

>>> The other way (suggested by this series) is to make dax/kmem call a
>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>>> device and calculate the distance if the type is correct for them.  I
>>> don't think that it's good to make dax/kem to know every possible
>>> types of memory devices.
>>
>> Do we expect there to be lots of different types of memory devices
>> sharing a common dax/kmem driver though? Must admit I'm coming from a
>> GPU background where we'd expect each type of device to have it's own
>> driver anyway so wasn't expecting different types of memory devices to
>> be handled by the same driver.
>
> Now, dax/kmem.c is used for
>
> - PMEM (Optane DCPMM, or AEP)
> - CXL.mem
> - HBM (attached to CPU)

Thanks a lot for the background! I will admit to having a faily narrow
focus here.

>>> And, I don't think that we are forced to use the general notifier
>>> chain interface in all memory device drivers.  If the memory device
>>> driver has better understanding of the memory device, it can use other
>>> way to determine abstract distance.  For example, a CXL memory device
>>> driver can identify abstract distance by itself.  While other memory
>>> device drivers can use the general notifier chain interface at the
>>> same time.
>>
>> Whilst I think personally I would find that flexibility useful I am
>> concerned it means every driver will just end up divining it's own
>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>> would kind of defeat the purpose of it all then.
>
> But we have no way to enforce that too.

Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
we can influence it. If drivers can easily ignore the notifier chain and
do their own thing that's what will happen.

>>> While other memory device drivers can use the general notifier chain
>>> interface at the same time.

How would that work in practice though? The abstract distance as far as
I can tell doesn't have any meaning other than establishing preferences
for memory demotion order. Therefore all calculations are relative to
the rest of the calculations on the system. So if a driver does it's own
thing how does it choose a sensible distance? IHMO the value here is in
coordinating all that through a standard interface, whether that is HMAT
or something else.

 - Alistair

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-27  3:42           ` Alistair Popple
@ 2023-07-27  4:02             ` Huang, Ying
  2023-07-27  4:07               ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-07-27  4:02 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>>>> The other way (suggested by this series) is to make dax/kmem call a
>>>> notifier chain, then CXL CDAT or ACPI HMAT can identify the type of
>>>> device and calculate the distance if the type is correct for them.  I
>>>> don't think that it's good to make dax/kem to know every possible
>>>> types of memory devices.
>>>
>>> Do we expect there to be lots of different types of memory devices
>>> sharing a common dax/kmem driver though? Must admit I'm coming from a
>>> GPU background where we'd expect each type of device to have it's own
>>> driver anyway so wasn't expecting different types of memory devices to
>>> be handled by the same driver.
>>
>> Now, dax/kmem.c is used for
>>
>> - PMEM (Optane DCPMM, or AEP)
>> - CXL.mem
>> - HBM (attached to CPU)
>
> Thanks a lot for the background! I will admit to having a faily narrow
> focus here.
>
>>>> And, I don't think that we are forced to use the general notifier
>>>> chain interface in all memory device drivers.  If the memory device
>>>> driver has better understanding of the memory device, it can use other
>>>> way to determine abstract distance.  For example, a CXL memory device
>>>> driver can identify abstract distance by itself.  While other memory
>>>> device drivers can use the general notifier chain interface at the
>>>> same time.
>>>
>>> Whilst I think personally I would find that flexibility useful I am
>>> concerned it means every driver will just end up divining it's own
>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>> would kind of defeat the purpose of it all then.
>>
>> But we have no way to enforce that too.
>
> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
> we can influence it. If drivers can easily ignore the notifier chain and
> do their own thing that's what will happen.

IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
general interface we provided.  Anyway, we should try to make HMAT/CDAT
works well, so drivers want to use them :-)

>>>> While other memory device drivers can use the general notifier chain
>>>> interface at the same time.
>
> How would that work in practice though? The abstract distance as far as
> I can tell doesn't have any meaning other than establishing preferences
> for memory demotion order. Therefore all calculations are relative to
> the rest of the calculations on the system. So if a driver does it's own
> thing how does it choose a sensible distance? IHMO the value here is in
> coordinating all that through a standard interface, whether that is HMAT
> or something else.

Only if different algorithms follow the same basic principle.  For
example, the abstract distance of default DRAM nodes are fixed
(MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
in linear direct proportion to the memory latency and inversely
proportional to the memory bandwidth.  Use the memory latency and
bandwidth of default DRAM nodes as base.

HMAT and CDAT report the raw memory latency and bandwidth.  If there are
some other methods to report the raw memory latency and bandwidth, we
can use them too.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-27  4:02             ` Huang, Ying
@ 2023-07-27  4:07               ` Alistair Popple
  2023-07-27  5:41                 ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-27  4:07 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>>>> And, I don't think that we are forced to use the general notifier
>>>>> chain interface in all memory device drivers.  If the memory device
>>>>> driver has better understanding of the memory device, it can use other
>>>>> way to determine abstract distance.  For example, a CXL memory device
>>>>> driver can identify abstract distance by itself.  While other memory
>>>>> device drivers can use the general notifier chain interface at the
>>>>> same time.
>>>>
>>>> Whilst I think personally I would find that flexibility useful I am
>>>> concerned it means every driver will just end up divining it's own
>>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>>> would kind of defeat the purpose of it all then.
>>>
>>> But we have no way to enforce that too.
>>
>> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
>> we can influence it. If drivers can easily ignore the notifier chain and
>> do their own thing that's what will happen.
>
> IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
> general interface we provided.  Anyway, we should try to make HMAT/CDAT
> works well, so drivers want to use them :-)

Exactly :-)

>>>>> While other memory device drivers can use the general notifier chain
>>>>> interface at the same time.
>>
>> How would that work in practice though? The abstract distance as far as
>> I can tell doesn't have any meaning other than establishing preferences
>> for memory demotion order. Therefore all calculations are relative to
>> the rest of the calculations on the system. So if a driver does it's own
>> thing how does it choose a sensible distance? IHMO the value here is in
>> coordinating all that through a standard interface, whether that is HMAT
>> or something else.
>
> Only if different algorithms follow the same basic principle.  For
> example, the abstract distance of default DRAM nodes are fixed
> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
> in linear direct proportion to the memory latency and inversely
> proportional to the memory bandwidth.  Use the memory latency and
> bandwidth of default DRAM nodes as base.
>
> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
> some other methods to report the raw memory latency and bandwidth, we
> can use them too.

Argh! So we could address my concerns by having drivers feed
latency/bandwidth numbers into a standard calculation algorithm right?
Ie. Rather than having drivers calculate abstract distance themselves we
have the notifier chains return the raw performance data from which the
abstract distance is derived.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-27  4:07               ` Alistair Popple
@ 2023-07-27  5:41                 ` Huang, Ying
  2023-07-28  1:20                   ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-07-27  5:41 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>>>> And, I don't think that we are forced to use the general notifier
>>>>>> chain interface in all memory device drivers.  If the memory device
>>>>>> driver has better understanding of the memory device, it can use other
>>>>>> way to determine abstract distance.  For example, a CXL memory device
>>>>>> driver can identify abstract distance by itself.  While other memory
>>>>>> device drivers can use the general notifier chain interface at the
>>>>>> same time.
>>>>>
>>>>> Whilst I think personally I would find that flexibility useful I am
>>>>> concerned it means every driver will just end up divining it's own
>>>>> distance rather than ensuring data in HMAT/CDAT/etc. is correct. That
>>>>> would kind of defeat the purpose of it all then.
>>>>
>>>> But we have no way to enforce that too.
>>>
>>> Enforce that HMAT/CDAT/etc. is correct? Agree we can't enforce it, but
>>> we can influence it. If drivers can easily ignore the notifier chain and
>>> do their own thing that's what will happen.
>>
>> IMHO, both enforce HMAT/CDAT/etc is correct and enforce drivers to use
>> general interface we provided.  Anyway, we should try to make HMAT/CDAT
>> works well, so drivers want to use them :-)
>
> Exactly :-)
>
>>>>>> While other memory device drivers can use the general notifier chain
>>>>>> interface at the same time.
>>>
>>> How would that work in practice though? The abstract distance as far as
>>> I can tell doesn't have any meaning other than establishing preferences
>>> for memory demotion order. Therefore all calculations are relative to
>>> the rest of the calculations on the system. So if a driver does it's own
>>> thing how does it choose a sensible distance? IHMO the value here is in
>>> coordinating all that through a standard interface, whether that is HMAT
>>> or something else.
>>
>> Only if different algorithms follow the same basic principle.  For
>> example, the abstract distance of default DRAM nodes are fixed
>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>> in linear direct proportion to the memory latency and inversely
>> proportional to the memory bandwidth.  Use the memory latency and
>> bandwidth of default DRAM nodes as base.
>>
>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>> some other methods to report the raw memory latency and bandwidth, we
>> can use them too.
>
> Argh! So we could address my concerns by having drivers feed
> latency/bandwidth numbers into a standard calculation algorithm right?
> Ie. Rather than having drivers calculate abstract distance themselves we
> have the notifier chains return the raw performance data from which the
> abstract distance is derived.

Now, memory device drivers only need a general interface to get the
abstract distance from the NUMA node ID.  In the future, if they need
more interfaces, we can add them.  For example, the interface you
suggested above.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-27  5:41                 ` Huang, Ying
@ 2023-07-28  1:20                   ` Alistair Popple
  2023-08-11  3:51                     ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-07-28  1:20 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>> interface at the same time.
>>>>
>>>> How would that work in practice though? The abstract distance as far as
>>>> I can tell doesn't have any meaning other than establishing preferences
>>>> for memory demotion order. Therefore all calculations are relative to
>>>> the rest of the calculations on the system. So if a driver does it's own
>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>> coordinating all that through a standard interface, whether that is HMAT
>>>> or something else.
>>>
>>> Only if different algorithms follow the same basic principle.  For
>>> example, the abstract distance of default DRAM nodes are fixed
>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>> in linear direct proportion to the memory latency and inversely
>>> proportional to the memory bandwidth.  Use the memory latency and
>>> bandwidth of default DRAM nodes as base.
>>>
>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>> some other methods to report the raw memory latency and bandwidth, we
>>> can use them too.
>>
>> Argh! So we could address my concerns by having drivers feed
>> latency/bandwidth numbers into a standard calculation algorithm right?
>> Ie. Rather than having drivers calculate abstract distance themselves we
>> have the notifier chains return the raw performance data from which the
>> abstract distance is derived.
>
> Now, memory device drivers only need a general interface to get the
> abstract distance from the NUMA node ID.  In the future, if they need
> more interfaces, we can add them.  For example, the interface you
> suggested above.

Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
distance, it's a meaningless number. The only reason they care about it
is so they can pass it to alloc_memory_type():

struct memory_dev_type *alloc_memory_type(int adistance)

Instead alloc_memory_type() should be taking bandwidth/latency numbers
and the calculation of abstract distance should be done there. That
resovles the issues about how drivers are supposed to devine adistance
and also means that when CDAT is added we don't have to duplicate the
calculation code.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT
  2023-07-24 17:58   ` Andrew Morton
@ 2023-08-01  2:35     ` Bharata B Rao
  2023-08-11  6:26       ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Bharata B Rao @ 2023-08-01  2:35 UTC (permalink / raw)
  To: Andrew Morton, Alistair Popple
  Cc: Huang Ying, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

On 24-Jul-23 11:28 PM, Andrew Morton wrote:
> On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple <apopple@nvidia.com> wrote:
> 
>> Thanks for this Huang, I had been hoping to take a look at it this week
>> but have run out of time. I'm keen to do some testing with it as well.
> 
> Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
> review and testing would be appreciated.

I gave this series a try on a 2P system with 2 CXL cards. I don't trust the
bandwidth and latency numbers reported by HMAT here, but FWIW, this patchset
puts the CXL nodes on a lower tier than DRAM nodes.

Regards,
Bharata.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()
  2023-07-21  1:29 ` [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators() Huang Ying
  2023-07-25  2:44   ` Alistair Popple
@ 2023-08-07 16:55   ` Jonathan Cameron
  2023-08-11  1:13     ` Huang, Ying
  1 sibling, 1 reply; 41+ messages in thread
From: Jonathan Cameron @ 2023-08-07 16:55 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Michal Hocko, Yang Shi, Rafael J Wysocki

On Fri, 21 Jul 2023 09:29:30 +0800
Huang Ying <ying.huang@intel.com> wrote:

> Previously, in hmat_register_target_initiators(), the performance
> attributes are calculated and the corresponding sysfs links and files
> are created too.  Which is called during memory onlining.
> 
> But now, to calculate the abstract distance of a memory target before
> memory onlining, we need to calculate the performance attributes for
> a memory target without creating sysfs links and files.
> 
> To do that, hmat_register_target_initiators() is refactored to make it
> possible to calculate performance attributes separately.
> 
> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Wei Xu <weixugc@google.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>

Unfortunately I don't think I still have the tables I used to test the
generic initiator and won't get time to generate them all again in
next few weeks.  So just a superficial review for now.
I 'think' the cleanup looks good but the original code was rather fiddly
so I'm not 100% sure nothing is missed.

One comment inline on the fact the list is now sorted twice.


> ---
>  drivers/acpi/numa/hmat.c | 81 +++++++++++++++-------------------------
>  1 file changed, 30 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index bba268ecd802..2dee0098f1a9 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -582,28 +582,25 @@ static int initiators_to_nodemask(unsigned long *p_nodes)
>  	return 0;
>  }
>  
> -static void hmat_register_target_initiators(struct memory_target *target)
> +static void hmat_update_target_attrs(struct memory_target *target,
> +				     unsigned long *p_nodes, int access)
>  {
> -	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
>  	struct memory_initiator *initiator;
> -	unsigned int mem_nid, cpu_nid;
> +	unsigned int cpu_nid;
>  	struct memory_locality *loc = NULL;
>  	u32 best = 0;
> -	bool access0done = false;
>  	int i;
>  
> -	mem_nid = pxm_to_node(target->memory_pxm);
> +	bitmap_zero(p_nodes, MAX_NUMNODES);
>  	/*
> -	 * If the Address Range Structure provides a local processor pxm, link
> +	 * If the Address Range Structure provides a local processor pxm, set
>  	 * only that one. Otherwise, find the best performance attributes and
> -	 * register all initiators that match.
> +	 * collect all initiators that match.
>  	 */
>  	if (target->processor_pxm != PXM_INVAL) {
>  		cpu_nid = pxm_to_node(target->processor_pxm);
> -		register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
> -		access0done = true;
> -		if (node_state(cpu_nid, N_CPU)) {
> -			register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
> +		if (access == 0 || node_state(cpu_nid, N_CPU)) {
> +			set_bit(target->processor_pxm, p_nodes);
>  			return;
>  		}
>  	}
> @@ -617,47 +614,10 @@ static void hmat_register_target_initiators(struct memory_target *target)
>  	 * We'll also use the sorting to prime the candidate nodes with known
>  	 * initiators.
>  	 */
> -	bitmap_zero(p_nodes, MAX_NUMNODES);
>  	list_sort(NULL, &initiators, initiator_cmp);
>  	if (initiators_to_nodemask(p_nodes) < 0)
>  		return;

One result of this refactor is that a few things run twice, that previously only ran once
like this list_sort()
Not necessarily a problem though as probably fairly cheap.

>  
> -	if (!access0done) {
> -		for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
> -			loc = localities_types[i];
> -			if (!loc)
> -				continue;
> -
> -			best = 0;
> -			list_for_each_entry(initiator, &initiators, node) {
> -				u32 value;
> -
> -				if (!test_bit(initiator->processor_pxm, p_nodes))
> -					continue;
> -
> -				value = hmat_initiator_perf(target, initiator,
> -							    loc->hmat_loc);
> -				if (hmat_update_best(loc->hmat_loc->data_type, value, &best))
> -					bitmap_clear(p_nodes, 0, initiator->processor_pxm);
> -				if (value != best)
> -					clear_bit(initiator->processor_pxm, p_nodes);
> -			}
> -			if (best)
> -				hmat_update_target_access(target, loc->hmat_loc->data_type,
> -							  best, 0);
> -		}
> -
> -		for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
> -			cpu_nid = pxm_to_node(i);
> -			register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
> -		}
> -	}
> -
> -	/* Access 1 ignores Generic Initiators */
> -	bitmap_zero(p_nodes, MAX_NUMNODES);
> -	if (initiators_to_nodemask(p_nodes) < 0)
> -		return;
> -
>  	for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
>  		loc = localities_types[i];
>  		if (!loc)
> @@ -667,7 +627,7 @@ static void hmat_register_target_initiators(struct memory_target *target)
>  		list_for_each_entry(initiator, &initiators, node) {
>  			u32 value;
>  
> -			if (!initiator->has_cpu) {
> +			if (access == 1 && !initiator->has_cpu) {
>  				clear_bit(initiator->processor_pxm, p_nodes);
>  				continue;
>  			}
> @@ -681,14 +641,33 @@ static void hmat_register_target_initiators(struct memory_target *target)
>  				clear_bit(initiator->processor_pxm, p_nodes);
>  		}
>  		if (best)
> -			hmat_update_target_access(target, loc->hmat_loc->data_type, best, 1);
> +			hmat_update_target_access(target, loc->hmat_loc->data_type, best, access);
>  	}
> +}
> +
> +static void __hmat_register_target_initiators(struct memory_target *target,
> +					      unsigned long *p_nodes,
> +					      int access)
> +{
> +	unsigned int mem_nid, cpu_nid;
> +	int i;
> +
> +	mem_nid = pxm_to_node(target->memory_pxm);
> +	hmat_update_target_attrs(target, p_nodes, access);
>  	for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
>  		cpu_nid = pxm_to_node(i);
> -		register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
> +		register_memory_node_under_compute_node(mem_nid, cpu_nid, access);
>  	}
>  }
>  
> +static void hmat_register_target_initiators(struct memory_target *target)
> +{
> +	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
> +
> +	__hmat_register_target_initiators(target, p_nodes, 0);
> +	__hmat_register_target_initiators(target, p_nodes, 1);
> +}
> +
>  static void hmat_register_target_cache(struct memory_target *target)
>  {
>  	unsigned mem_nid = pxm_to_node(target->memory_pxm);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators()
  2023-08-07 16:55   ` Jonathan Cameron
@ 2023-08-11  1:13     ` Huang, Ying
  0 siblings, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2023-08-11  1:13 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Michal Hocko, Yang Shi, Rafael J Wysocki

Hi, Jonathan,

Thanks for review!

Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:

> On Fri, 21 Jul 2023 09:29:30 +0800
> Huang Ying <ying.huang@intel.com> wrote:
>
>> Previously, in hmat_register_target_initiators(), the performance
>> attributes are calculated and the corresponding sysfs links and files
>> are created too.  Which is called during memory onlining.
>> 
>> But now, to calculate the abstract distance of a memory target before
>> memory onlining, we need to calculate the performance attributes for
>> a memory target without creating sysfs links and files.
>> 
>> To do that, hmat_register_target_initiators() is refactored to make it
>> possible to calculate performance attributes separately.
>> 
>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> Cc: Wei Xu <weixugc@google.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Yang Shi <shy828301@gmail.com>
>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>
> Unfortunately I don't think I still have the tables I used to test the
> generic initiator and won't get time to generate them all again in
> next few weeks.  So just a superficial review for now.
> I 'think' the cleanup looks good but the original code was rather fiddly
> so I'm not 100% sure nothing is missed.
>
> One comment inline on the fact the list is now sorted twice.
>
>
>> ---
>>  drivers/acpi/numa/hmat.c | 81 +++++++++++++++-------------------------
>>  1 file changed, 30 insertions(+), 51 deletions(-)
>> 
>> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> index bba268ecd802..2dee0098f1a9 100644
>> --- a/drivers/acpi/numa/hmat.c
>> +++ b/drivers/acpi/numa/hmat.c
>> @@ -582,28 +582,25 @@ static int initiators_to_nodemask(unsigned long *p_nodes)
>>  	return 0;
>>  }
>>  
>> -static void hmat_register_target_initiators(struct memory_target *target)
>> +static void hmat_update_target_attrs(struct memory_target *target,
>> +				     unsigned long *p_nodes, int access)
>>  {
>> -	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
>>  	struct memory_initiator *initiator;
>> -	unsigned int mem_nid, cpu_nid;
>> +	unsigned int cpu_nid;
>>  	struct memory_locality *loc = NULL;
>>  	u32 best = 0;
>> -	bool access0done = false;
>>  	int i;
>>  
>> -	mem_nid = pxm_to_node(target->memory_pxm);
>> +	bitmap_zero(p_nodes, MAX_NUMNODES);
>>  	/*
>> -	 * If the Address Range Structure provides a local processor pxm, link
>> +	 * If the Address Range Structure provides a local processor pxm, set
>>  	 * only that one. Otherwise, find the best performance attributes and
>> -	 * register all initiators that match.
>> +	 * collect all initiators that match.
>>  	 */
>>  	if (target->processor_pxm != PXM_INVAL) {
>>  		cpu_nid = pxm_to_node(target->processor_pxm);
>> -		register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
>> -		access0done = true;
>> -		if (node_state(cpu_nid, N_CPU)) {
>> -			register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
>> +		if (access == 0 || node_state(cpu_nid, N_CPU)) {
>> +			set_bit(target->processor_pxm, p_nodes);
>>  			return;
>>  		}
>>  	}
>> @@ -617,47 +614,10 @@ static void hmat_register_target_initiators(struct memory_target *target)
>>  	 * We'll also use the sorting to prime the candidate nodes with known
>>  	 * initiators.
>>  	 */
>> -	bitmap_zero(p_nodes, MAX_NUMNODES);
>>  	list_sort(NULL, &initiators, initiator_cmp);
>>  	if (initiators_to_nodemask(p_nodes) < 0)
>>  		return;
>
> One result of this refactor is that a few things run twice, that previously only ran once
> like this list_sort()
> Not necessarily a problem though as probably fairly cheap.

Yes.  The original code sorts once for each target.  But it appears that
it's unnecessary too.  We can sort the initiators list when adding new
item to it in alloc_memory_initiator().  If necessary, I can add an
additional patch to do that.  But as you said, it may be unnecessary
because the sort should be fairly cheap.

--
Best Regards,
Huang, Ying

>>  
>> -	if (!access0done) {
>> -		for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
>> -			loc = localities_types[i];
>> -			if (!loc)
>> -				continue;
>> -
>> -			best = 0;
>> -			list_for_each_entry(initiator, &initiators, node) {
>> -				u32 value;
>> -
>> -				if (!test_bit(initiator->processor_pxm, p_nodes))
>> -					continue;
>> -
>> -				value = hmat_initiator_perf(target, initiator,
>> -							    loc->hmat_loc);
>> -				if (hmat_update_best(loc->hmat_loc->data_type, value, &best))
>> -					bitmap_clear(p_nodes, 0, initiator->processor_pxm);
>> -				if (value != best)
>> -					clear_bit(initiator->processor_pxm, p_nodes);
>> -			}
>> -			if (best)
>> -				hmat_update_target_access(target, loc->hmat_loc->data_type,
>> -							  best, 0);
>> -		}
>> -
>> -		for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
>> -			cpu_nid = pxm_to_node(i);
>> -			register_memory_node_under_compute_node(mem_nid, cpu_nid, 0);
>> -		}
>> -	}
>> -
>> -	/* Access 1 ignores Generic Initiators */
>> -	bitmap_zero(p_nodes, MAX_NUMNODES);
>> -	if (initiators_to_nodemask(p_nodes) < 0)
>> -		return;
>> -
>>  	for (i = WRITE_LATENCY; i <= READ_BANDWIDTH; i++) {
>>  		loc = localities_types[i];
>>  		if (!loc)
>> @@ -667,7 +627,7 @@ static void hmat_register_target_initiators(struct memory_target *target)
>>  		list_for_each_entry(initiator, &initiators, node) {
>>  			u32 value;
>>  
>> -			if (!initiator->has_cpu) {
>> +			if (access == 1 && !initiator->has_cpu) {
>>  				clear_bit(initiator->processor_pxm, p_nodes);
>>  				continue;
>>  			}
>> @@ -681,14 +641,33 @@ static void hmat_register_target_initiators(struct memory_target *target)
>>  				clear_bit(initiator->processor_pxm, p_nodes);
>>  		}
>>  		if (best)
>> -			hmat_update_target_access(target, loc->hmat_loc->data_type, best, 1);
>> +			hmat_update_target_access(target, loc->hmat_loc->data_type, best, access);
>>  	}
>> +}
>> +
>> +static void __hmat_register_target_initiators(struct memory_target *target,
>> +					      unsigned long *p_nodes,
>> +					      int access)
>> +{
>> +	unsigned int mem_nid, cpu_nid;
>> +	int i;
>> +
>> +	mem_nid = pxm_to_node(target->memory_pxm);
>> +	hmat_update_target_attrs(target, p_nodes, access);
>>  	for_each_set_bit(i, p_nodes, MAX_NUMNODES) {
>>  		cpu_nid = pxm_to_node(i);
>> -		register_memory_node_under_compute_node(mem_nid, cpu_nid, 1);
>> +		register_memory_node_under_compute_node(mem_nid, cpu_nid, access);
>>  	}
>>  }
>>  
>> +static void hmat_register_target_initiators(struct memory_target *target)
>> +{
>> +	static DECLARE_BITMAP(p_nodes, MAX_NUMNODES);
>> +
>> +	__hmat_register_target_initiators(target, p_nodes, 0);
>> +	__hmat_register_target_initiators(target, p_nodes, 1);
>> +}
>> +
>>  static void hmat_register_target_cache(struct memory_target *target)
>>  {
>>  	unsigned mem_nid = pxm_to_node(target->memory_pxm);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-07-28  1:20                   ` Alistair Popple
@ 2023-08-11  3:51                     ` Huang, Ying
  2023-08-21 11:26                       ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-11  3:51 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Hi, Alistair,

Sorry for late response.  Just come back from vacation.

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>
>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>> interface at the same time.
>>>>>
>>>>> How would that work in practice though? The abstract distance as far as
>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>> or something else.
>>>>
>>>> Only if different algorithms follow the same basic principle.  For
>>>> example, the abstract distance of default DRAM nodes are fixed
>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>> in linear direct proportion to the memory latency and inversely
>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>> bandwidth of default DRAM nodes as base.
>>>>
>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>> some other methods to report the raw memory latency and bandwidth, we
>>>> can use them too.
>>>
>>> Argh! So we could address my concerns by having drivers feed
>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>> have the notifier chains return the raw performance data from which the
>>> abstract distance is derived.
>>
>> Now, memory device drivers only need a general interface to get the
>> abstract distance from the NUMA node ID.  In the future, if they need
>> more interfaces, we can add them.  For example, the interface you
>> suggested above.
>
> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
> distance, it's a meaningless number. The only reason they care about it
> is so they can pass it to alloc_memory_type():
>
> struct memory_dev_type *alloc_memory_type(int adistance)
>
> Instead alloc_memory_type() should be taking bandwidth/latency numbers
> and the calculation of abstract distance should be done there. That
> resovles the issues about how drivers are supposed to devine adistance
> and also means that when CDAT is added we don't have to duplicate the
> calculation code.

In the current design, the abstract distance is the key concept of
memory types and memory tiers.  And it is used as interface to allocate
memory types.  This provides more flexibility than some other interfaces
(e.g. read/write bandwidth/latency).  For example, in current
dax/kmem.c, if HMAT isn't available in the system, the default abstract
distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
to support some systems now.  On a system without HMAT/CDAT, it's
possible to calculate abstract distance from ACPI SLIT, although this is
quite limited.  I'm not sure whether all systems will provide read/write
bandwith/latency data for all memory devices.

HMAT and CDAT or some other mechanisms may provide the read/write
bandwidth/latency data to be used to calculate abstract distance.  For
them, we can provide a shared implementation in mm/memory-tiers.c to map
from read/write bandwith/latency to the abstract distance.  Can this
solve your concerns about the consistency among algorithms?  If so, we
can do that when we add the second algorithm that needs that.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT
  2023-08-01  2:35     ` Bharata B Rao
@ 2023-08-11  6:26       ` Huang, Ying
  2023-08-11  7:49         ` Bharata B Rao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-11  6:26 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: Andrew Morton, Alistair Popple, linux-mm, linux-kernel,
	linux-cxl, nvdimm, linux-acpi, Aneesh Kumar K . V, Wei Xu,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki

Hi, Rao,

Bharata B Rao <bharata@amd.com> writes:

> On 24-Jul-23 11:28 PM, Andrew Morton wrote:
>> On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple <apopple@nvidia.com> wrote:
>> 
>>> Thanks for this Huang, I had been hoping to take a look at it this week
>>> but have run out of time. I'm keen to do some testing with it as well.
>> 
>> Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
>> review and testing would be appreciated.
>
> I gave this series a try on a 2P system with 2 CXL cards. I don't trust the
> bandwidth and latency numbers reported by HMAT here, but FWIW, this patchset
> puts the CXL nodes on a lower tier than DRAM nodes.

Thank you very much!

Can I add your "Tested-by" for the series?

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT
  2023-08-11  6:26       ` Huang, Ying
@ 2023-08-11  7:49         ` Bharata B Rao
  0 siblings, 0 replies; 41+ messages in thread
From: Bharata B Rao @ 2023-08-11  7:49 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Alistair Popple, linux-mm, linux-kernel,
	linux-cxl, nvdimm, linux-acpi, Aneesh Kumar K . V, Wei Xu,
	Dan Williams, Dave Hansen, Davidlohr Bueso, Johannes Weiner,
	Jonathan Cameron, Michal Hocko, Yang Shi, Rafael J Wysocki


On 11-Aug-23 11:56 AM, Huang, Ying wrote:
> Hi, Rao,
> 
> Bharata B Rao <bharata@amd.com> writes:
> 
>> On 24-Jul-23 11:28 PM, Andrew Morton wrote:
>>> On Fri, 21 Jul 2023 14:15:31 +1000 Alistair Popple <apopple@nvidia.com> wrote:
>>>
>>>> Thanks for this Huang, I had been hoping to take a look at it this week
>>>> but have run out of time. I'm keen to do some testing with it as well.
>>>
>>> Thanks.  I'll queue this in mm-unstable for some testing.  Detailed
>>> review and testing would be appreciated.
>>
>> I gave this series a try on a 2P system with 2 CXL cards. I don't trust the
>> bandwidth and latency numbers reported by HMAT here, but FWIW, this patchset
>> puts the CXL nodes on a lower tier than DRAM nodes.
> 
> Thank you very much!
> 
> Can I add your "Tested-by" for the series?

Yes if the above test qualifies for it, please go ahead.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-11  3:51                     ` Huang, Ying
@ 2023-08-21 11:26                       ` Alistair Popple
  2023-08-21 22:50                         ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-08-21 11:26 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Hi, Alistair,
>
> Sorry for late response.  Just come back from vacation.

Ditto for this response :-)

I see Andrew has taken this into mm-unstable though, so my bad for not
getting around to following all this up sooner.

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>
>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>> interface at the same time.
>>>>>>
>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>> or something else.
>>>>>
>>>>> Only if different algorithms follow the same basic principle.  For
>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>> in linear direct proportion to the memory latency and inversely
>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>> bandwidth of default DRAM nodes as base.
>>>>>
>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>> can use them too.
>>>>
>>>> Argh! So we could address my concerns by having drivers feed
>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>> have the notifier chains return the raw performance data from which the
>>>> abstract distance is derived.
>>>
>>> Now, memory device drivers only need a general interface to get the
>>> abstract distance from the NUMA node ID.  In the future, if they need
>>> more interfaces, we can add them.  For example, the interface you
>>> suggested above.
>>
>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>> distance, it's a meaningless number. The only reason they care about it
>> is so they can pass it to alloc_memory_type():
>>
>> struct memory_dev_type *alloc_memory_type(int adistance)
>>
>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>> and the calculation of abstract distance should be done there. That
>> resovles the issues about how drivers are supposed to devine adistance
>> and also means that when CDAT is added we don't have to duplicate the
>> calculation code.
>
> In the current design, the abstract distance is the key concept of
> memory types and memory tiers.  And it is used as interface to allocate
> memory types.  This provides more flexibility than some other interfaces
> (e.g. read/write bandwidth/latency).  For example, in current
> dax/kmem.c, if HMAT isn't available in the system, the default abstract
> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
> to support some systems now.  On a system without HMAT/CDAT, it's
> possible to calculate abstract distance from ACPI SLIT, although this is
> quite limited.  I'm not sure whether all systems will provide read/write
> bandwith/latency data for all memory devices.
>
> HMAT and CDAT or some other mechanisms may provide the read/write
> bandwidth/latency data to be used to calculate abstract distance.  For
> them, we can provide a shared implementation in mm/memory-tiers.c to map
> from read/write bandwith/latency to the abstract distance.  Can this
> solve your concerns about the consistency among algorithms?  If so, we
> can do that when we add the second algorithm that needs that.

I guess it would address my concerns if we did that now. I don't see why
we need to wait for a second implementation for that though - the whole
series seems to be built around adding a framework for supporting
multiple algorithms even though only one exists. So I think we should
support that fully, or simplfy the whole thing and just assume the only
thing that exists is HMAT and get rid of the general interface until a
second algorithm comes along.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT
  2023-07-25  6:47     ` Huang, Ying
@ 2023-08-21 11:53       ` Alistair Popple
  2023-08-21 23:28         ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-08-21 11:53 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> Huang Ying <ying.huang@intel.com> writes:
>>
>>> A memory tiering abstract distance calculation algorithm based on ACPI
>>> HMAT is implemented.  The basic idea is as follows.
>>>
>>> The performance attributes of system default DRAM nodes are recorded
>>> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
>>> Then, the ratio of the abstract distance of a memory node (target) to
>>> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
>>> attributes of the node to that of the default DRAM nodes.
>>
>> The problem I encountered here with the calculations is that HBM memory
>> ended up in a lower-tiered node which isn't what I wanted (at least when
>> that HBM is attached to a GPU say).
>
> I have tested the series on a server machine with HBM (pure HBM, not
> attached to a GPU).  Where, HBM is placed in a higher tier than DRAM.

Good to know.

>> I suspect this is because the calculations are based on the CPU
>> point-of-view (access1) which still sees lower bandwidth to remote HBM
>> than local DRAM, even though the remote GPU has higher bandwidth access
>> to that memory. Perhaps we need to be considering access0 as well?
>> Ie. HBM directly attached to a generic initiator should be in a higher
>> tier regardless of CPU access characteristics?
>
> What's your requirements for memory tiers on the machine?  I guess you
> want to put GPU attache HBM in a higher tier and put DRAM in a lower
> tier.  So, cold HBM pages can be demoted to DRAM when there are memory
> pressure on HBM?  This sounds reasonable from GPU point of view.

Yes, that is what I would like to implement.

> The above requirements may be satisfied via calculating abstract
> distance based on access0 (or combined with access1).  But I suspect
> this will be a general solution.  I guess that any memory devices that
> are used mainly by the memory initiators other than CPUs want to put
> themselves in a higher memory tier than DRAM, regardless of its
> access0.

Right. I'm still figuring out how ACPI HMAT fits together but that
sounds reasonable.

> One solution is to put GPU HBM in the highest memory tier (with smallest
> abstract distance) always in GPU device driver regardless its HMAT
> performance attributes.  Is it possible?

It's certainly possible and easy enough to do, although I think it would
be good to provide upper and lower bounds for HMAT derived adistances to
make that easier. It does make me wonder what the point of HMAT is if we
have to ignore it in some scenarios though. But perhaps I need to dig
deeper into the GPU values to figure out how it can be applied correctly
there.

>> That said I'm not entirely convinced the HMAT tables I'm testing against
>> are accurate/complete.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-07-25  7:02     ` Huang, Ying
@ 2023-08-21 12:03       ` Alistair Popple
  2023-08-21 23:33         ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-08-21 12:03 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> Huang Ying <ying.huang@intel.com> writes:
>>
>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>> used for slow memory type in kmem driver.  This limits the usage of
>>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>> memory).
>>>
>>> So, we use the general abstract distance calculation mechanism in kmem
>>> drivers to get more accurate abstract distance on systems with proper
>>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>>> fallback only.
>>>
>>> Now, multiple memory types may be managed by kmem.  These memory types
>>> are put into the "kmem_memory_types" list and protected by
>>> kmem_memory_type_lock.
>>
>> See below but I wonder if kmem_memory_types could be a common helper
>> rather than kdax specific?
>>
>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> Cc: Wei Xu <weixugc@google.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Yang Shi <shy828301@gmail.com>
>>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>>>  include/linux/memory-tiers.h |  2 ++
>>>  mm/memory-tiers.c            |  2 +-
>>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>> index 898ca9505754..837165037231 100644
>>> --- a/drivers/dax/kmem.c
>>> +++ b/drivers/dax/kmem.c
>>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>>  	struct resource *res[];
>>>  };
>>>  
>>> -static struct memory_dev_type *dax_slowmem_type;
>>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>>> +static LIST_HEAD(kmem_memory_types);
>>> +
>>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>>> +{
>>> +	bool found = false;
>>> +	struct memory_dev_type *mtype;
>>> +
>>> +	mutex_lock(&kmem_memory_type_lock);
>>> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
>>> +		if (mtype->adistance == adist) {
>>> +			found = true;
>>> +			break;
>>> +		}
>>> +	}
>>> +	if (!found) {
>>> +		mtype = alloc_memory_type(adist);
>>> +		if (!IS_ERR(mtype))
>>> +			list_add(&mtype->list, &kmem_memory_types);
>>> +	}
>>> +	mutex_unlock(&kmem_memory_type_lock);
>>> +
>>> +	return mtype;
>>> +}
>>> +
>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>  {
>>>  	struct device *dev = &dev_dax->dev;
>>>  	unsigned long total_len = 0;
>>>  	struct dax_kmem_data *data;
>>> +	struct memory_dev_type *mtype;
>>>  	int i, rc, mapped = 0;
>>>  	int numa_node;
>>> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>>  
>>>  	/*
>>>  	 * Ensure good NUMA information for the persistent memory.
>>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>  		return -EINVAL;
>>>  	}
>>>  
>>> +	mt_calc_adistance(numa_node, &adist);
>>> +	mtype = kmem_find_alloc_memorty_type(adist);
>>> +	if (IS_ERR(mtype))
>>> +		return PTR_ERR(mtype);
>>> +
>>
>> I wrote my own quick and dirty module to test this and wrote basically
>> the same code sequence.
>>
>> I notice your using a list of memory types here though. I think it would
>> be nice to have a common helper that other users could call to do the
>> mt_calc_adistance() / kmem_find_alloc_memory_type() /
>> init_node_memory_type() sequence and cleanup as my naive approach would
>> result in a new memory_dev_type per device even though adist might be
>> the same. A common helper would make it easy to de-dup those.
>
> If it's useful, we can move kmem_find_alloc_memory_type() to
> memory-tier.c after some revision.  But I tend to move it after we have
> the second user.  What do you think about that?

Usually I would agree, but this series already introduces a general
interface for calculating adist even though there's only one user and
implementation. So if we're going to add a general interface I think it
would be better to make it more usable now rather than after variations
of it have been cut and pasted into other drivers.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-21 11:26                       ` Alistair Popple
@ 2023-08-21 22:50                         ` Huang, Ying
  2023-08-21 23:52                           ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-21 22:50 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Hi, Alistair,
>>
>> Sorry for late response.  Just come back from vacation.
>
> Ditto for this response :-)
>
> I see Andrew has taken this into mm-unstable though, so my bad for not
> getting around to following all this up sooner.
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>
>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>
>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>
>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>> interface at the same time.
>>>>>>>
>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>> or something else.
>>>>>>
>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>
>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>> can use them too.
>>>>>
>>>>> Argh! So we could address my concerns by having drivers feed
>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>> have the notifier chains return the raw performance data from which the
>>>>> abstract distance is derived.
>>>>
>>>> Now, memory device drivers only need a general interface to get the
>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>> more interfaces, we can add them.  For example, the interface you
>>>> suggested above.
>>>
>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>> distance, it's a meaningless number. The only reason they care about it
>>> is so they can pass it to alloc_memory_type():
>>>
>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>
>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>> and the calculation of abstract distance should be done there. That
>>> resovles the issues about how drivers are supposed to devine adistance
>>> and also means that when CDAT is added we don't have to duplicate the
>>> calculation code.
>>
>> In the current design, the abstract distance is the key concept of
>> memory types and memory tiers.  And it is used as interface to allocate
>> memory types.  This provides more flexibility than some other interfaces
>> (e.g. read/write bandwidth/latency).  For example, in current
>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>> to support some systems now.  On a system without HMAT/CDAT, it's
>> possible to calculate abstract distance from ACPI SLIT, although this is
>> quite limited.  I'm not sure whether all systems will provide read/write
>> bandwith/latency data for all memory devices.
>>
>> HMAT and CDAT or some other mechanisms may provide the read/write
>> bandwidth/latency data to be used to calculate abstract distance.  For
>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>> from read/write bandwith/latency to the abstract distance.  Can this
>> solve your concerns about the consistency among algorithms?  If so, we
>> can do that when we add the second algorithm that needs that.
>
> I guess it would address my concerns if we did that now. I don't see why
> we need to wait for a second implementation for that though - the whole
> series seems to be built around adding a framework for supporting
> multiple algorithms even though only one exists. So I think we should
> support that fully, or simplfy the whole thing and just assume the only
> thing that exists is HMAT and get rid of the general interface until a
> second algorithm comes along.

We will need a general interface even for one algorithm implementation.
Because it's not good to make a dax subsystem driver (dax/kmem) to
depend on a ACPI subsystem driver (acpi/hmat).  We need some general
interface at subsystem level (memory tier here) between them.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT
  2023-08-21 11:53       ` Alistair Popple
@ 2023-08-21 23:28         ` Huang, Ying
  0 siblings, 0 replies; 41+ messages in thread
From: Huang, Ying @ 2023-08-21 23:28 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> Huang Ying <ying.huang@intel.com> writes:
>>>
>>>> A memory tiering abstract distance calculation algorithm based on ACPI
>>>> HMAT is implemented.  The basic idea is as follows.
>>>>
>>>> The performance attributes of system default DRAM nodes are recorded
>>>> as the base line.  Whose abstract distance is MEMTIER_ADISTANCE_DRAM.
>>>> Then, the ratio of the abstract distance of a memory node (target) to
>>>> MEMTIER_ADISTANCE_DRAM is scaled based on the ratio of the performance
>>>> attributes of the node to that of the default DRAM nodes.
>>>
>>> The problem I encountered here with the calculations is that HBM memory
>>> ended up in a lower-tiered node which isn't what I wanted (at least when
>>> that HBM is attached to a GPU say).
>>
>> I have tested the series on a server machine with HBM (pure HBM, not
>> attached to a GPU).  Where, HBM is placed in a higher tier than DRAM.
>
> Good to know.
>
>>> I suspect this is because the calculations are based on the CPU
>>> point-of-view (access1) which still sees lower bandwidth to remote HBM
>>> than local DRAM, even though the remote GPU has higher bandwidth access
>>> to that memory. Perhaps we need to be considering access0 as well?
>>> Ie. HBM directly attached to a generic initiator should be in a higher
>>> tier regardless of CPU access characteristics?
>>
>> What's your requirements for memory tiers on the machine?  I guess you
>> want to put GPU attache HBM in a higher tier and put DRAM in a lower
>> tier.  So, cold HBM pages can be demoted to DRAM when there are memory
>> pressure on HBM?  This sounds reasonable from GPU point of view.
>
> Yes, that is what I would like to implement.
>
>> The above requirements may be satisfied via calculating abstract
>> distance based on access0 (or combined with access1).  But I suspect
>> this will be a general solution.  I guess that any memory devices that
>> are used mainly by the memory initiators other than CPUs want to put
>> themselves in a higher memory tier than DRAM, regardless of its
>> access0.
>
> Right. I'm still figuring out how ACPI HMAT fits together but that
> sounds reasonable.
>
>> One solution is to put GPU HBM in the highest memory tier (with smallest
>> abstract distance) always in GPU device driver regardless its HMAT
>> performance attributes.  Is it possible?
>
> It's certainly possible and easy enough to do, although I think it would
> be good to provide upper and lower bounds for HMAT derived adistances to
> make that easier. It does make me wonder what the point of HMAT is if we
> have to ignore it in some scenarios though. But perhaps I need to dig
> deeper into the GPU values to figure out how it can be applied correctly
> there.

In the original design (page 11 of [1]),

[1] https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

the default memory tier hierarchy is based on the performance from CPU
point of view.  Then the abstract distance of a memory type (e.g., GPU
HBM) can be adjusted via a sysfs knob
(<memory_type>/abstract_distance_offset) based on the requirements of
GPU.

That's another possible solution.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-08-21 12:03       ` Alistair Popple
@ 2023-08-21 23:33         ` Huang, Ying
  2023-08-22  7:36           ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-21 23:33 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> Huang Ying <ying.huang@intel.com> writes:
>>>
>>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>>> used for slow memory type in kmem driver.  This limits the usage of
>>>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>>> memory).
>>>>
>>>> So, we use the general abstract distance calculation mechanism in kmem
>>>> drivers to get more accurate abstract distance on systems with proper
>>>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>>>> fallback only.
>>>>
>>>> Now, multiple memory types may be managed by kmem.  These memory types
>>>> are put into the "kmem_memory_types" list and protected by
>>>> kmem_memory_type_lock.
>>>
>>> See below but I wonder if kmem_memory_types could be a common helper
>>> rather than kdax specific?
>>>
>>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>> Cc: Wei Xu <weixugc@google.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>>>> ---
>>>>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>>>>  include/linux/memory-tiers.h |  2 ++
>>>>  mm/memory-tiers.c            |  2 +-
>>>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>> index 898ca9505754..837165037231 100644
>>>> --- a/drivers/dax/kmem.c
>>>> +++ b/drivers/dax/kmem.c
>>>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>>>  	struct resource *res[];
>>>>  };
>>>>  
>>>> -static struct memory_dev_type *dax_slowmem_type;
>>>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>>>> +static LIST_HEAD(kmem_memory_types);
>>>> +
>>>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>>>> +{
>>>> +	bool found = false;
>>>> +	struct memory_dev_type *mtype;
>>>> +
>>>> +	mutex_lock(&kmem_memory_type_lock);
>>>> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
>>>> +		if (mtype->adistance == adist) {
>>>> +			found = true;
>>>> +			break;
>>>> +		}
>>>> +	}
>>>> +	if (!found) {
>>>> +		mtype = alloc_memory_type(adist);
>>>> +		if (!IS_ERR(mtype))
>>>> +			list_add(&mtype->list, &kmem_memory_types);
>>>> +	}
>>>> +	mutex_unlock(&kmem_memory_type_lock);
>>>> +
>>>> +	return mtype;
>>>> +}
>>>> +
>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>  {
>>>>  	struct device *dev = &dev_dax->dev;
>>>>  	unsigned long total_len = 0;
>>>>  	struct dax_kmem_data *data;
>>>> +	struct memory_dev_type *mtype;
>>>>  	int i, rc, mapped = 0;
>>>>  	int numa_node;
>>>> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>>>  
>>>>  	/*
>>>>  	 * Ensure good NUMA information for the persistent memory.
>>>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>  		return -EINVAL;
>>>>  	}
>>>>  
>>>> +	mt_calc_adistance(numa_node, &adist);
>>>> +	mtype = kmem_find_alloc_memorty_type(adist);
>>>> +	if (IS_ERR(mtype))
>>>> +		return PTR_ERR(mtype);
>>>> +
>>>
>>> I wrote my own quick and dirty module to test this and wrote basically
>>> the same code sequence.
>>>
>>> I notice your using a list of memory types here though. I think it would
>>> be nice to have a common helper that other users could call to do the
>>> mt_calc_adistance() / kmem_find_alloc_memory_type() /
>>> init_node_memory_type() sequence and cleanup as my naive approach would
>>> result in a new memory_dev_type per device even though adist might be
>>> the same. A common helper would make it easy to de-dup those.
>>
>> If it's useful, we can move kmem_find_alloc_memory_type() to
>> memory-tier.c after some revision.  But I tend to move it after we have
>> the second user.  What do you think about that?
>
> Usually I would agree, but this series already introduces a general
> interface for calculating adist even though there's only one user and
> implementation. So if we're going to add a general interface I think it
> would be better to make it more usable now rather than after variations
> of it have been cut and pasted into other drivers.

In general, I would like to introduce complexity when necessary.  So, we
can discuss the necessity of the general interface firstly.  We can do
that in [1/4] of the series.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-21 22:50                         ` Huang, Ying
@ 2023-08-21 23:52                           ` Alistair Popple
  2023-08-22  0:58                             ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-08-21 23:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Hi, Alistair,
>>>
>>> Sorry for late response.  Just come back from vacation.
>>
>> Ditto for this response :-)
>>
>> I see Andrew has taken this into mm-unstable though, so my bad for not
>> getting around to following all this up sooner.
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>
>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>> interface at the same time.
>>>>>>>>
>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>> or something else.
>>>>>>>
>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>
>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>> can use them too.
>>>>>>
>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>> have the notifier chains return the raw performance data from which the
>>>>>> abstract distance is derived.
>>>>>
>>>>> Now, memory device drivers only need a general interface to get the
>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>> more interfaces, we can add them.  For example, the interface you
>>>>> suggested above.
>>>>
>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>> distance, it's a meaningless number. The only reason they care about it
>>>> is so they can pass it to alloc_memory_type():
>>>>
>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>
>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>> and the calculation of abstract distance should be done there. That
>>>> resovles the issues about how drivers are supposed to devine adistance
>>>> and also means that when CDAT is added we don't have to duplicate the
>>>> calculation code.
>>>
>>> In the current design, the abstract distance is the key concept of
>>> memory types and memory tiers.  And it is used as interface to allocate
>>> memory types.  This provides more flexibility than some other interfaces
>>> (e.g. read/write bandwidth/latency).  For example, in current
>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>>> to support some systems now.  On a system without HMAT/CDAT, it's
>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>> quite limited.  I'm not sure whether all systems will provide read/write
>>> bandwith/latency data for all memory devices.
>>>
>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>> bandwidth/latency data to be used to calculate abstract distance.  For
>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>> from read/write bandwith/latency to the abstract distance.  Can this
>>> solve your concerns about the consistency among algorithms?  If so, we
>>> can do that when we add the second algorithm that needs that.
>>
>> I guess it would address my concerns if we did that now. I don't see why
>> we need to wait for a second implementation for that though - the whole
>> series seems to be built around adding a framework for supporting
>> multiple algorithms even though only one exists. So I think we should
>> support that fully, or simplfy the whole thing and just assume the only
>> thing that exists is HMAT and get rid of the general interface until a
>> second algorithm comes along.
>
> We will need a general interface even for one algorithm implementation.
> Because it's not good to make a dax subsystem driver (dax/kmem) to
> depend on a ACPI subsystem driver (acpi/hmat).  We need some general
> interface at subsystem level (memory tier here) between them.

I don't understand this argument. For a single algorithm it would be
simpler to just define acpi_hmat_calculate_adistance() and a static
inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
a layer of indirection through notifier blocks. That breaks any
dependency on ACPI and there's plenty of precedent for this approach in
the kernel already.

Thanks,
Alistar.

> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-21 23:52                           ` Alistair Popple
@ 2023-08-22  0:58                             ` Huang, Ying
  2023-08-22  7:11                               ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-22  0:58 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Hi, Alistair,
>>>>
>>>> Sorry for late response.  Just come back from vacation.
>>>
>>> Ditto for this response :-)
>>>
>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>> getting around to following all this up sooner.
>>>
>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>
>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>
>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>
>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>
>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>
>>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>>> interface at the same time.
>>>>>>>>>
>>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>>> or something else.
>>>>>>>>
>>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>
>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>> can use them too.
>>>>>>>
>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>> abstract distance is derived.
>>>>>>
>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>>> more interfaces, we can add them.  For example, the interface you
>>>>>> suggested above.
>>>>>
>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>> is so they can pass it to alloc_memory_type():
>>>>>
>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>
>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>> and the calculation of abstract distance should be done there. That
>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>> calculation code.
>>>>
>>>> In the current design, the abstract distance is the key concept of
>>>> memory types and memory tiers.  And it is used as interface to allocate
>>>> memory types.  This provides more flexibility than some other interfaces
>>>> (e.g. read/write bandwidth/latency).  For example, in current
>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>>>> to support some systems now.  On a system without HMAT/CDAT, it's
>>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>>> quite limited.  I'm not sure whether all systems will provide read/write
>>>> bandwith/latency data for all memory devices.
>>>>
>>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>>> bandwidth/latency data to be used to calculate abstract distance.  For
>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>>> from read/write bandwith/latency to the abstract distance.  Can this
>>>> solve your concerns about the consistency among algorithms?  If so, we
>>>> can do that when we add the second algorithm that needs that.
>>>
>>> I guess it would address my concerns if we did that now. I don't see why
>>> we need to wait for a second implementation for that though - the whole
>>> series seems to be built around adding a framework for supporting
>>> multiple algorithms even though only one exists. So I think we should
>>> support that fully, or simplfy the whole thing and just assume the only
>>> thing that exists is HMAT and get rid of the general interface until a
>>> second algorithm comes along.
>>
>> We will need a general interface even for one algorithm implementation.
>> Because it's not good to make a dax subsystem driver (dax/kmem) to
>> depend on a ACPI subsystem driver (acpi/hmat).  We need some general
>> interface at subsystem level (memory tier here) between them.
>
> I don't understand this argument. For a single algorithm it would be
> simpler to just define acpi_hmat_calculate_adistance() and a static
> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
> a layer of indirection through notifier blocks. That breaks any
> dependency on ACPI and there's plenty of precedent for this approach in
> the kernel already.

ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI.
But HMAT is a driver of ACPI subsystem (controlled via
CONFIG_ACPI_HMAT).  It's not good for a driver of DAX subsystem
(dax/kmem) to depend on a *driver* of ACPI subsystem.

Yes.  Technically, there's no hard wall to prevent this.  But I think
that a good design should make drivers depends on subsystems or drivers
of the same subsystem, NOT drivers of other subsystems.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-22  0:58                             ` Huang, Ying
@ 2023-08-22  7:11                               ` Alistair Popple
  2023-08-23  5:56                                 ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-08-22  7:11 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Hi, Alistair,
>>>>>
>>>>> Sorry for late response.  Just come back from vacation.
>>>>
>>>> Ditto for this response :-)
>>>>
>>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>>> getting around to following all this up sooner.
>>>>
>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>
>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>
>>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>>
>>>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>>>> interface at the same time.
>>>>>>>>>>
>>>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>>>> or something else.
>>>>>>>>>
>>>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>>
>>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>>> can use them too.
>>>>>>>>
>>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>>> abstract distance is derived.
>>>>>>>
>>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>>>> more interfaces, we can add them.  For example, the interface you
>>>>>>> suggested above.
>>>>>>
>>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>>> is so they can pass it to alloc_memory_type():
>>>>>>
>>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>>
>>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>>> and the calculation of abstract distance should be done there. That
>>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>>> calculation code.
>>>>>
>>>>> In the current design, the abstract distance is the key concept of
>>>>> memory types and memory tiers.  And it is used as interface to allocate
>>>>> memory types.  This provides more flexibility than some other interfaces
>>>>> (e.g. read/write bandwidth/latency).  For example, in current
>>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>>>>> to support some systems now.  On a system without HMAT/CDAT, it's
>>>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>>>> quite limited.  I'm not sure whether all systems will provide read/write
>>>>> bandwith/latency data for all memory devices.
>>>>>
>>>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>>>> bandwidth/latency data to be used to calculate abstract distance.  For
>>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>>>> from read/write bandwith/latency to the abstract distance.  Can this
>>>>> solve your concerns about the consistency among algorithms?  If so, we
>>>>> can do that when we add the second algorithm that needs that.
>>>>
>>>> I guess it would address my concerns if we did that now. I don't see why
>>>> we need to wait for a second implementation for that though - the whole
>>>> series seems to be built around adding a framework for supporting
>>>> multiple algorithms even though only one exists. So I think we should
>>>> support that fully, or simplfy the whole thing and just assume the only
>>>> thing that exists is HMAT and get rid of the general interface until a
>>>> second algorithm comes along.
>>>
>>> We will need a general interface even for one algorithm implementation.
>>> Because it's not good to make a dax subsystem driver (dax/kmem) to
>>> depend on a ACPI subsystem driver (acpi/hmat).  We need some general
>>> interface at subsystem level (memory tier here) between them.
>>
>> I don't understand this argument. For a single algorithm it would be
>> simpler to just define acpi_hmat_calculate_adistance() and a static
>> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
>> a layer of indirection through notifier blocks. That breaks any
>> dependency on ACPI and there's plenty of precedent for this approach in
>> the kernel already.
>
> ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI.
> But HMAT is a driver of ACPI subsystem (controlled via
> CONFIG_ACPI_HMAT).  It's not good for a driver of DAX subsystem
> (dax/kmem) to depend on a *driver* of ACPI subsystem.
>
> Yes.  Technically, there's no hard wall to prevent this.  But I think
> that a good design should make drivers depends on subsystems or drivers
> of the same subsystem, NOT drivers of other subsystems.

Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand
where you're coming from but I really don't see the problem with using a
static inline. It doesn't create dependencies (you could still use
dax/kmem without ACPI) and results in smaller and easier to follow code.

IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist()
returns either a default if ACPI HMAT isn't configured or a calculated
value than it is to figure out what notifiers may or may not be
registered at runtime and what priority they may be called in from
mt_calc_adistance().

It appears you think that is a bad design, but I don't understand
why. What does this approach give us that a simpler approach wouldn't?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-08-21 23:33         ` Huang, Ying
@ 2023-08-22  7:36           ` Alistair Popple
  2023-08-23  2:13             ` Huang, Ying
  0 siblings, 1 reply; 41+ messages in thread
From: Alistair Popple @ 2023-08-22  7:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> Huang Ying <ying.huang@intel.com> writes:
>>>>
>>>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>>>> used for slow memory type in kmem driver.  This limits the usage of
>>>>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>>>> memory).
>>>>>
>>>>> So, we use the general abstract distance calculation mechanism in kmem
>>>>> drivers to get more accurate abstract distance on systems with proper
>>>>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>>>>> fallback only.
>>>>>
>>>>> Now, multiple memory types may be managed by kmem.  These memory types
>>>>> are put into the "kmem_memory_types" list and protected by
>>>>> kmem_memory_type_lock.
>>>>
>>>> See below but I wonder if kmem_memory_types could be a common helper
>>>> rather than kdax specific?
>>>>
>>>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> Cc: Wei Xu <weixugc@google.com>
>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>>>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>>>>> ---
>>>>>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>>>>>  include/linux/memory-tiers.h |  2 ++
>>>>>  mm/memory-tiers.c            |  2 +-
>>>>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>> index 898ca9505754..837165037231 100644
>>>>> --- a/drivers/dax/kmem.c
>>>>> +++ b/drivers/dax/kmem.c
>>>>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>>>>  	struct resource *res[];
>>>>>  };
>>>>>  
>>>>> -static struct memory_dev_type *dax_slowmem_type;
>>>>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>>>>> +static LIST_HEAD(kmem_memory_types);
>>>>> +
>>>>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>>>>> +{
>>>>> +	bool found = false;
>>>>> +	struct memory_dev_type *mtype;
>>>>> +
>>>>> +	mutex_lock(&kmem_memory_type_lock);
>>>>> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
>>>>> +		if (mtype->adistance == adist) {
>>>>> +			found = true;
>>>>> +			break;
>>>>> +		}
>>>>> +	}
>>>>> +	if (!found) {
>>>>> +		mtype = alloc_memory_type(adist);
>>>>> +		if (!IS_ERR(mtype))
>>>>> +			list_add(&mtype->list, &kmem_memory_types);
>>>>> +	}
>>>>> +	mutex_unlock(&kmem_memory_type_lock);
>>>>> +
>>>>> +	return mtype;
>>>>> +}
>>>>> +
>>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>  {
>>>>>  	struct device *dev = &dev_dax->dev;
>>>>>  	unsigned long total_len = 0;
>>>>>  	struct dax_kmem_data *data;
>>>>> +	struct memory_dev_type *mtype;
>>>>>  	int i, rc, mapped = 0;
>>>>>  	int numa_node;
>>>>> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>>>>  
>>>>>  	/*
>>>>>  	 * Ensure good NUMA information for the persistent memory.
>>>>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>  		return -EINVAL;
>>>>>  	}
>>>>>  
>>>>> +	mt_calc_adistance(numa_node, &adist);
>>>>> +	mtype = kmem_find_alloc_memorty_type(adist);
>>>>> +	if (IS_ERR(mtype))
>>>>> +		return PTR_ERR(mtype);
>>>>> +
>>>>
>>>> I wrote my own quick and dirty module to test this and wrote basically
>>>> the same code sequence.
>>>>
>>>> I notice your using a list of memory types here though. I think it would
>>>> be nice to have a common helper that other users could call to do the
>>>> mt_calc_adistance() / kmem_find_alloc_memory_type() /
>>>> init_node_memory_type() sequence and cleanup as my naive approach would
>>>> result in a new memory_dev_type per device even though adist might be
>>>> the same. A common helper would make it easy to de-dup those.
>>>
>>> If it's useful, we can move kmem_find_alloc_memory_type() to
>>> memory-tier.c after some revision.  But I tend to move it after we have
>>> the second user.  What do you think about that?
>>
>> Usually I would agree, but this series already introduces a general
>> interface for calculating adist even though there's only one user and
>> implementation. So if we're going to add a general interface I think it
>> would be better to make it more usable now rather than after variations
>> of it have been cut and pasted into other drivers.
>
> In general, I would like to introduce complexity when necessary.  So, we
> can discuss the necessity of the general interface firstly.  We can do
> that in [1/4] of the series.

Do we need one memory_dev_type per adistance or per adistance+device?

If IUC correctly I think it's the former. Logically that means
memory_dev_types should be managed by the memory-tiering subsystem
because they are system wide rather than driver specific resources. That
we need to add the list field to struct memory_dev_type specifically for
use by dax/kmem supports that idea.

Also I'm not sure why you consider moving the
kmem_memory_types/kmem_find_alloc_memory_type()/etc. functions into
mm/memory-tiers.c to add complexity. Isn't it just moving code around or
am I missing some other subtlety that makes this hard? I really think
logically memory-tiering.c is where management of the various
memory_dev_types belongs.

Thanks.
Alistair

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-08-22  7:36           ` Alistair Popple
@ 2023-08-23  2:13             ` Huang, Ying
  2023-08-25  6:00               ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-23  2:13 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>
>>>>> Huang Ying <ying.huang@intel.com> writes:
>>>>>
>>>>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>>>>> used for slow memory type in kmem driver.  This limits the usage of
>>>>>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>>>>> memory).
>>>>>>
>>>>>> So, we use the general abstract distance calculation mechanism in kmem
>>>>>> drivers to get more accurate abstract distance on systems with proper
>>>>>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>>>>>> fallback only.
>>>>>>
>>>>>> Now, multiple memory types may be managed by kmem.  These memory types
>>>>>> are put into the "kmem_memory_types" list and protected by
>>>>>> kmem_memory_type_lock.
>>>>>
>>>>> See below but I wonder if kmem_memory_types could be a common helper
>>>>> rather than kdax specific?
>>>>>
>>>>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>>>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>> Cc: Wei Xu <weixugc@google.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>>>>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>>>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>>>>>> ---
>>>>>>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>>>>>>  include/linux/memory-tiers.h |  2 ++
>>>>>>  mm/memory-tiers.c            |  2 +-
>>>>>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>> index 898ca9505754..837165037231 100644
>>>>>> --- a/drivers/dax/kmem.c
>>>>>> +++ b/drivers/dax/kmem.c
>>>>>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>>>>>  	struct resource *res[];
>>>>>>  };
>>>>>>  
>>>>>> -static struct memory_dev_type *dax_slowmem_type;
>>>>>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>>>>>> +static LIST_HEAD(kmem_memory_types);
>>>>>> +
>>>>>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>>>>>> +{
>>>>>> +	bool found = false;
>>>>>> +	struct memory_dev_type *mtype;
>>>>>> +
>>>>>> +	mutex_lock(&kmem_memory_type_lock);
>>>>>> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
>>>>>> +		if (mtype->adistance == adist) {
>>>>>> +			found = true;
>>>>>> +			break;
>>>>>> +		}
>>>>>> +	}
>>>>>> +	if (!found) {
>>>>>> +		mtype = alloc_memory_type(adist);
>>>>>> +		if (!IS_ERR(mtype))
>>>>>> +			list_add(&mtype->list, &kmem_memory_types);
>>>>>> +	}
>>>>>> +	mutex_unlock(&kmem_memory_type_lock);
>>>>>> +
>>>>>> +	return mtype;
>>>>>> +}
>>>>>> +
>>>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>  {
>>>>>>  	struct device *dev = &dev_dax->dev;
>>>>>>  	unsigned long total_len = 0;
>>>>>>  	struct dax_kmem_data *data;
>>>>>> +	struct memory_dev_type *mtype;
>>>>>>  	int i, rc, mapped = 0;
>>>>>>  	int numa_node;
>>>>>> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>>>>>  
>>>>>>  	/*
>>>>>>  	 * Ensure good NUMA information for the persistent memory.
>>>>>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>  		return -EINVAL;
>>>>>>  	}
>>>>>>  
>>>>>> +	mt_calc_adistance(numa_node, &adist);
>>>>>> +	mtype = kmem_find_alloc_memorty_type(adist);
>>>>>> +	if (IS_ERR(mtype))
>>>>>> +		return PTR_ERR(mtype);
>>>>>> +
>>>>>
>>>>> I wrote my own quick and dirty module to test this and wrote basically
>>>>> the same code sequence.
>>>>>
>>>>> I notice your using a list of memory types here though. I think it would
>>>>> be nice to have a common helper that other users could call to do the
>>>>> mt_calc_adistance() / kmem_find_alloc_memory_type() /
>>>>> init_node_memory_type() sequence and cleanup as my naive approach would
>>>>> result in a new memory_dev_type per device even though adist might be
>>>>> the same. A common helper would make it easy to de-dup those.
>>>>
>>>> If it's useful, we can move kmem_find_alloc_memory_type() to
>>>> memory-tier.c after some revision.  But I tend to move it after we have
>>>> the second user.  What do you think about that?
>>>
>>> Usually I would agree, but this series already introduces a general
>>> interface for calculating adist even though there's only one user and
>>> implementation. So if we're going to add a general interface I think it
>>> would be better to make it more usable now rather than after variations
>>> of it have been cut and pasted into other drivers.
>>
>> In general, I would like to introduce complexity when necessary.  So, we
>> can discuss the necessity of the general interface firstly.  We can do
>> that in [1/4] of the series.
>
> Do we need one memory_dev_type per adistance or per adistance+device?
>
> If IUC correctly I think it's the former. Logically that means
> memory_dev_types should be managed by the memory-tiering subsystem
> because they are system wide rather than driver specific resources. That
> we need to add the list field to struct memory_dev_type specifically for
> use by dax/kmem supports that idea.

In the original design (page 9/10/11 of [1]), memory_dev_type (Memory
Type) is driver specific.

[1] https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Memory devices with same adistance but was enumerated by different
drivers will be put in different memory_dev_type.  As a not-so-realistic
example, even if the PMEM devices in DIMM slots and the CXL.mem devices
in CXL slots have same adistance, their memory_dev_type will be
different.  Because the memory devices enumerated by the same driver
should be managed in the same way, but memory devices enumerated by
different drivers may be not.  For example, if we adjust the adistance
of the PMEM devices to put them in the lower memory tier via
not-yet-implemented abstract_distance_offset sysfs knob, we may not want
to adjust it of CXL.mem devices at the same time.

> Also I'm not sure why you consider moving the
> kmem_memory_types/kmem_find_alloc_memory_type()/etc. functions into
> mm/memory-tiers.c to add complexity. Isn't it just moving code around or
> am I missing some other subtlety that makes this hard? I really think
> logically memory-tiering.c is where management of the various
> memory_dev_types belongs.

IMHO, it depends on whether these functions are shared by at least 2
drivers.  If so, we can put them in mm/memory-tiers.c.  Otherwise, we
should keep them in the driver.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-22  7:11                               ` Alistair Popple
@ 2023-08-23  5:56                                 ` Huang, Ying
  2023-08-25  5:41                                   ` Alistair Popple
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Ying @ 2023-08-23  5:56 UTC (permalink / raw)
  To: Alistair Popple
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang

Alistair Popple <apopple@nvidia.com> writes:

> "Huang, Ying" <ying.huang@intel.com> writes:
>
>> Alistair Popple <apopple@nvidia.com> writes:
>>
>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>
>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>
>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>
>>>>>> Hi, Alistair,
>>>>>>
>>>>>> Sorry for late response.  Just come back from vacation.
>>>>>
>>>>> Ditto for this response :-)
>>>>>
>>>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>>>> getting around to following all this up sooner.
>>>>>
>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>
>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>
>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>
>>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>>
>>>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>>>
>>>>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>>>>> interface at the same time.
>>>>>>>>>>>
>>>>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>>>>> or something else.
>>>>>>>>>>
>>>>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>>>
>>>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>>>> can use them too.
>>>>>>>>>
>>>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>>>> abstract distance is derived.
>>>>>>>>
>>>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>>>>> more interfaces, we can add them.  For example, the interface you
>>>>>>>> suggested above.
>>>>>>>
>>>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>>>> is so they can pass it to alloc_memory_type():
>>>>>>>
>>>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>>>
>>>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>>>> and the calculation of abstract distance should be done there. That
>>>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>>>> calculation code.
>>>>>>
>>>>>> In the current design, the abstract distance is the key concept of
>>>>>> memory types and memory tiers.  And it is used as interface to allocate
>>>>>> memory types.  This provides more flexibility than some other interfaces
>>>>>> (e.g. read/write bandwidth/latency).  For example, in current
>>>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>>>>>> to support some systems now.  On a system without HMAT/CDAT, it's
>>>>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>>>>> quite limited.  I'm not sure whether all systems will provide read/write
>>>>>> bandwith/latency data for all memory devices.
>>>>>>
>>>>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>>>>> bandwidth/latency data to be used to calculate abstract distance.  For
>>>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>>>>> from read/write bandwith/latency to the abstract distance.  Can this
>>>>>> solve your concerns about the consistency among algorithms?  If so, we
>>>>>> can do that when we add the second algorithm that needs that.
>>>>>
>>>>> I guess it would address my concerns if we did that now. I don't see why
>>>>> we need to wait for a second implementation for that though - the whole
>>>>> series seems to be built around adding a framework for supporting
>>>>> multiple algorithms even though only one exists. So I think we should
>>>>> support that fully, or simplfy the whole thing and just assume the only
>>>>> thing that exists is HMAT and get rid of the general interface until a
>>>>> second algorithm comes along.
>>>>
>>>> We will need a general interface even for one algorithm implementation.
>>>> Because it's not good to make a dax subsystem driver (dax/kmem) to
>>>> depend on a ACPI subsystem driver (acpi/hmat).  We need some general
>>>> interface at subsystem level (memory tier here) between them.
>>>
>>> I don't understand this argument. For a single algorithm it would be
>>> simpler to just define acpi_hmat_calculate_adistance() and a static
>>> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
>>> a layer of indirection through notifier blocks. That breaks any
>>> dependency on ACPI and there's plenty of precedent for this approach in
>>> the kernel already.
>>
>> ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI.
>> But HMAT is a driver of ACPI subsystem (controlled via
>> CONFIG_ACPI_HMAT).  It's not good for a driver of DAX subsystem
>> (dax/kmem) to depend on a *driver* of ACPI subsystem.
>>
>> Yes.  Technically, there's no hard wall to prevent this.  But I think
>> that a good design should make drivers depends on subsystems or drivers
>> of the same subsystem, NOT drivers of other subsystems.
>
> Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand
> where you're coming from but I really don't see the problem with using a
> static inline. It doesn't create dependencies (you could still use
> dax/kmem without ACPI) and results in smaller and easier to follow code.
>
> IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist()
> returns either a default if ACPI HMAT isn't configured or a calculated
> value than it is to figure out what notifiers may or may not be
> registered at runtime and what priority they may be called in from
> mt_calc_adistance().
>
> It appears you think that is a bad design, but I don't understand
> why. What does this approach give us that a simpler approach wouldn't?

Think about all these again.  Finally I admit you are right.  The
general interface is better mainly if there are multiple implementations
of the interface.

In this series, we provide just one implementation: HMAT.  And, the
second one: CDAT will be implemented soon.  And, CDAT will use the same
method to translate from read/write bandwidth/latency to adistance.  So,
I suggest to:

- Keep the general interface (and notifier chain), for HMAT and soon
  available CDAT

- Move the code to translate from read/write bandwidth/latency to
  adistance to memory-tiers.c.  Which is used by HMAT now and will be
  used by CDAT soon.  And it can be used by other drivers.

What do you think about that?

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management
  2023-08-23  5:56                                 ` Huang, Ying
@ 2023-08-25  5:41                                   ` Alistair Popple
  0 siblings, 0 replies; 41+ messages in thread
From: Alistair Popple @ 2023-08-25  5:41 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki, Dave Jiang


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>
>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> Hi, Alistair,
>>>>>>>
>>>>>>> Sorry for late response.  Just come back from vacation.
>>>>>>
>>>>>> Ditto for this response :-)
>>>>>>
>>>>>> I see Andrew has taken this into mm-unstable though, so my bad for not
>>>>>> getting around to following all this up sooner.
>>>>>>
>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>
>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>
>>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>>
>>>>>>>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>>>>>>>
>>>>>>>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>>>>> While other memory device drivers can use the general notifier chain
>>>>>>>>>>>>>>> interface at the same time.
>>>>>>>>>>>>
>>>>>>>>>>>> How would that work in practice though? The abstract distance as far as
>>>>>>>>>>>> I can tell doesn't have any meaning other than establishing preferences
>>>>>>>>>>>> for memory demotion order. Therefore all calculations are relative to
>>>>>>>>>>>> the rest of the calculations on the system. So if a driver does it's own
>>>>>>>>>>>> thing how does it choose a sensible distance? IHMO the value here is in
>>>>>>>>>>>> coordinating all that through a standard interface, whether that is HMAT
>>>>>>>>>>>> or something else.
>>>>>>>>>>>
>>>>>>>>>>> Only if different algorithms follow the same basic principle.  For
>>>>>>>>>>> example, the abstract distance of default DRAM nodes are fixed
>>>>>>>>>>> (MEMTIER_ADISTANCE_DRAM).  The abstract distance of the memory device is
>>>>>>>>>>> in linear direct proportion to the memory latency and inversely
>>>>>>>>>>> proportional to the memory bandwidth.  Use the memory latency and
>>>>>>>>>>> bandwidth of default DRAM nodes as base.
>>>>>>>>>>>
>>>>>>>>>>> HMAT and CDAT report the raw memory latency and bandwidth.  If there are
>>>>>>>>>>> some other methods to report the raw memory latency and bandwidth, we
>>>>>>>>>>> can use them too.
>>>>>>>>>>
>>>>>>>>>> Argh! So we could address my concerns by having drivers feed
>>>>>>>>>> latency/bandwidth numbers into a standard calculation algorithm right?
>>>>>>>>>> Ie. Rather than having drivers calculate abstract distance themselves we
>>>>>>>>>> have the notifier chains return the raw performance data from which the
>>>>>>>>>> abstract distance is derived.
>>>>>>>>>
>>>>>>>>> Now, memory device drivers only need a general interface to get the
>>>>>>>>> abstract distance from the NUMA node ID.  In the future, if they need
>>>>>>>>> more interfaces, we can add them.  For example, the interface you
>>>>>>>>> suggested above.
>>>>>>>>
>>>>>>>> Huh? Memory device drivers (ie. dax/kmem.c) don't care about abstract
>>>>>>>> distance, it's a meaningless number. The only reason they care about it
>>>>>>>> is so they can pass it to alloc_memory_type():
>>>>>>>>
>>>>>>>> struct memory_dev_type *alloc_memory_type(int adistance)
>>>>>>>>
>>>>>>>> Instead alloc_memory_type() should be taking bandwidth/latency numbers
>>>>>>>> and the calculation of abstract distance should be done there. That
>>>>>>>> resovles the issues about how drivers are supposed to devine adistance
>>>>>>>> and also means that when CDAT is added we don't have to duplicate the
>>>>>>>> calculation code.
>>>>>>>
>>>>>>> In the current design, the abstract distance is the key concept of
>>>>>>> memory types and memory tiers.  And it is used as interface to allocate
>>>>>>> memory types.  This provides more flexibility than some other interfaces
>>>>>>> (e.g. read/write bandwidth/latency).  For example, in current
>>>>>>> dax/kmem.c, if HMAT isn't available in the system, the default abstract
>>>>>>> distance: MEMTIER_DEFAULT_DAX_ADISTANCE is used.  This is still useful
>>>>>>> to support some systems now.  On a system without HMAT/CDAT, it's
>>>>>>> possible to calculate abstract distance from ACPI SLIT, although this is
>>>>>>> quite limited.  I'm not sure whether all systems will provide read/write
>>>>>>> bandwith/latency data for all memory devices.
>>>>>>>
>>>>>>> HMAT and CDAT or some other mechanisms may provide the read/write
>>>>>>> bandwidth/latency data to be used to calculate abstract distance.  For
>>>>>>> them, we can provide a shared implementation in mm/memory-tiers.c to map
>>>>>>> from read/write bandwith/latency to the abstract distance.  Can this
>>>>>>> solve your concerns about the consistency among algorithms?  If so, we
>>>>>>> can do that when we add the second algorithm that needs that.
>>>>>>
>>>>>> I guess it would address my concerns if we did that now. I don't see why
>>>>>> we need to wait for a second implementation for that though - the whole
>>>>>> series seems to be built around adding a framework for supporting
>>>>>> multiple algorithms even though only one exists. So I think we should
>>>>>> support that fully, or simplfy the whole thing and just assume the only
>>>>>> thing that exists is HMAT and get rid of the general interface until a
>>>>>> second algorithm comes along.
>>>>>
>>>>> We will need a general interface even for one algorithm implementation.
>>>>> Because it's not good to make a dax subsystem driver (dax/kmem) to
>>>>> depend on a ACPI subsystem driver (acpi/hmat).  We need some general
>>>>> interface at subsystem level (memory tier here) between them.
>>>>
>>>> I don't understand this argument. For a single algorithm it would be
>>>> simpler to just define acpi_hmat_calculate_adistance() and a static
>>>> inline version of it that returns -ENOENT when !CONFIG_ACPI than adding
>>>> a layer of indirection through notifier blocks. That breaks any
>>>> dependency on ACPI and there's plenty of precedent for this approach in
>>>> the kernel already.
>>>
>>> ACPI is a subsystem, so it's OK for dax/kmem to depends on CONFIG_ACPI.
>>> But HMAT is a driver of ACPI subsystem (controlled via
>>> CONFIG_ACPI_HMAT).  It's not good for a driver of DAX subsystem
>>> (dax/kmem) to depend on a *driver* of ACPI subsystem.
>>>
>>> Yes.  Technically, there's no hard wall to prevent this.  But I think
>>> that a good design should make drivers depends on subsystems or drivers
>>> of the same subsystem, NOT drivers of other subsystems.
>>
>> Thanks, I wasn't really thinking of HMAT as an ACPI driver. I understand
>> where you're coming from but I really don't see the problem with using a
>> static inline. It doesn't create dependencies (you could still use
>> dax/kmem without ACPI) and results in smaller and easier to follow code.
>>
>> IMHO it's far more obvious that a call to acpi_hmat_calcaulte_adist()
>> returns either a default if ACPI HMAT isn't configured or a calculated
>> value than it is to figure out what notifiers may or may not be
>> registered at runtime and what priority they may be called in from
>> mt_calc_adistance().
>>
>> It appears you think that is a bad design, but I don't understand
>> why. What does this approach give us that a simpler approach wouldn't?
>
> Think about all these again.  Finally I admit you are right.  The
> general interface is better mainly if there are multiple implementations
> of the interface.
>
> In this series, we provide just one implementation: HMAT.  And, the
> second one: CDAT will be implemented soon.  And, CDAT will use the same
> method to translate from read/write bandwidth/latency to adistance.  So,
> I suggest to:
>
> - Keep the general interface (and notifier chain), for HMAT and soon
>   available CDAT
>
> - Move the code to translate from read/write bandwidth/latency to
>   adistance to memory-tiers.c.  Which is used by HMAT now and will be
>   used by CDAT soon.  And it can be used by other drivers.
>
> What do you think about that?

That sounds great. I had kinda assumed CDAT was around the corner, and
was "the other driver" I was thinking of hence the suggestion to make it
a bit more general (or alternatively ignore that and make it specific,
but you've already done the work to make it general so happy to keep
it).

BTW for what it's worth I still think it might be best to have the
notifiers return bandwidth/latency numbers. In practice though that
would actually make my life harder so I'm happy to see where the above
takes us.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface
  2023-08-23  2:13             ` Huang, Ying
@ 2023-08-25  6:00               ` Alistair Popple
  0 siblings, 0 replies; 41+ messages in thread
From: Alistair Popple @ 2023-08-25  6:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, linux-cxl, nvdimm,
	linux-acpi, Aneesh Kumar K  . V, Wei Xu, Dan Williams,
	Dave Hansen, Davidlohr Bueso, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Yang Shi, Rafael J Wysocki


"Huang, Ying" <ying.huang@intel.com> writes:

> Alistair Popple <apopple@nvidia.com> writes:
>
>> "Huang, Ying" <ying.huang@intel.com> writes:
>>
>>> Alistair Popple <apopple@nvidia.com> writes:
>>>
>>>> "Huang, Ying" <ying.huang@intel.com> writes:
>>>>
>>>>> Alistair Popple <apopple@nvidia.com> writes:
>>>>>
>>>>>> Huang Ying <ying.huang@intel.com> writes:
>>>>>>
>>>>>>> Previously, a fixed abstract distance MEMTIER_DEFAULT_DAX_ADISTANCE is
>>>>>>> used for slow memory type in kmem driver.  This limits the usage of
>>>>>>> kmem driver, for example, it cannot be used for HBM (high bandwidth
>>>>>>> memory).
>>>>>>>
>>>>>>> So, we use the general abstract distance calculation mechanism in kmem
>>>>>>> drivers to get more accurate abstract distance on systems with proper
>>>>>>> support.  The original MEMTIER_DEFAULT_DAX_ADISTANCE is used as
>>>>>>> fallback only.
>>>>>>>
>>>>>>> Now, multiple memory types may be managed by kmem.  These memory types
>>>>>>> are put into the "kmem_memory_types" list and protected by
>>>>>>> kmem_memory_type_lock.
>>>>>>
>>>>>> See below but I wonder if kmem_memory_types could be a common helper
>>>>>> rather than kdax specific?
>>>>>>
>>>>>>> Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
>>>>>>> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>> Cc: Wei Xu <weixugc@google.com>
>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>>>>> Cc: Dave Hansen <dave.hansen@intel.com>
>>>>>>> Cc: Davidlohr Bueso <dave@stgolabs.net>
>>>>>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>>>>> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>>> Cc: Michal Hocko <mhocko@kernel.org>
>>>>>>> Cc: Yang Shi <shy828301@gmail.com>
>>>>>>> Cc: Rafael J Wysocki <rafael.j.wysocki@intel.com>
>>>>>>> ---
>>>>>>>  drivers/dax/kmem.c           | 54 +++++++++++++++++++++++++++---------
>>>>>>>  include/linux/memory-tiers.h |  2 ++
>>>>>>>  mm/memory-tiers.c            |  2 +-
>>>>>>>  3 files changed, 44 insertions(+), 14 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>>> index 898ca9505754..837165037231 100644
>>>>>>> --- a/drivers/dax/kmem.c
>>>>>>> +++ b/drivers/dax/kmem.c
>>>>>>> @@ -49,14 +49,40 @@ struct dax_kmem_data {
>>>>>>>  	struct resource *res[];
>>>>>>>  };
>>>>>>>  
>>>>>>> -static struct memory_dev_type *dax_slowmem_type;
>>>>>>> +static DEFINE_MUTEX(kmem_memory_type_lock);
>>>>>>> +static LIST_HEAD(kmem_memory_types);
>>>>>>> +
>>>>>>> +static struct memory_dev_type *kmem_find_alloc_memorty_type(int adist)
>>>>>>> +{
>>>>>>> +	bool found = false;
>>>>>>> +	struct memory_dev_type *mtype;
>>>>>>> +
>>>>>>> +	mutex_lock(&kmem_memory_type_lock);
>>>>>>> +	list_for_each_entry(mtype, &kmem_memory_types, list) {
>>>>>>> +		if (mtype->adistance == adist) {
>>>>>>> +			found = true;
>>>>>>> +			break;
>>>>>>> +		}
>>>>>>> +	}
>>>>>>> +	if (!found) {
>>>>>>> +		mtype = alloc_memory_type(adist);
>>>>>>> +		if (!IS_ERR(mtype))
>>>>>>> +			list_add(&mtype->list, &kmem_memory_types);
>>>>>>> +	}
>>>>>>> +	mutex_unlock(&kmem_memory_type_lock);
>>>>>>> +
>>>>>>> +	return mtype;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>>  {
>>>>>>>  	struct device *dev = &dev_dax->dev;
>>>>>>>  	unsigned long total_len = 0;
>>>>>>>  	struct dax_kmem_data *data;
>>>>>>> +	struct memory_dev_type *mtype;
>>>>>>>  	int i, rc, mapped = 0;
>>>>>>>  	int numa_node;
>>>>>>> +	int adist = MEMTIER_DEFAULT_DAX_ADISTANCE;
>>>>>>>  
>>>>>>>  	/*
>>>>>>>  	 * Ensure good NUMA information for the persistent memory.
>>>>>>> @@ -71,6 +97,11 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>>>  		return -EINVAL;
>>>>>>>  	}
>>>>>>>  
>>>>>>> +	mt_calc_adistance(numa_node, &adist);
>>>>>>> +	mtype = kmem_find_alloc_memorty_type(adist);
>>>>>>> +	if (IS_ERR(mtype))
>>>>>>> +		return PTR_ERR(mtype);
>>>>>>> +
>>>>>>
>>>>>> I wrote my own quick and dirty module to test this and wrote basically
>>>>>> the same code sequence.
>>>>>>
>>>>>> I notice your using a list of memory types here though. I think it would
>>>>>> be nice to have a common helper that other users could call to do the
>>>>>> mt_calc_adistance() / kmem_find_alloc_memory_type() /
>>>>>> init_node_memory_type() sequence and cleanup as my naive approach would
>>>>>> result in a new memory_dev_type per device even though adist might be
>>>>>> the same. A common helper would make it easy to de-dup those.
>>>>>
>>>>> If it's useful, we can move kmem_find_alloc_memory_type() to
>>>>> memory-tier.c after some revision.  But I tend to move it after we have
>>>>> the second user.  What do you think about that?
>>>>
>>>> Usually I would agree, but this series already introduces a general
>>>> interface for calculating adist even though there's only one user and
>>>> implementation. So if we're going to add a general interface I think it
>>>> would be better to make it more usable now rather than after variations
>>>> of it have been cut and pasted into other drivers.
>>>
>>> In general, I would like to introduce complexity when necessary.  So, we
>>> can discuss the necessity of the general interface firstly.  We can do
>>> that in [1/4] of the series.
>>
>> Do we need one memory_dev_type per adistance or per adistance+device?
>>
>> If IUC correctly I think it's the former. Logically that means
>> memory_dev_types should be managed by the memory-tiering subsystem
>> because they are system wide rather than driver specific resources. That
>> we need to add the list field to struct memory_dev_type specifically for
>> use by dax/kmem supports that idea.
>
> In the original design (page 9/10/11 of [1]), memory_dev_type (Memory
> Type) is driver specific.

Oh fair enough. I was making these comments based on the incorrect
understanding that these were a global rather than driver specific
resource. Thanks for correcting that!

>> Also I'm not sure why you consider moving the
>> kmem_memory_types/kmem_find_alloc_memory_type()/etc. functions into
>> mm/memory-tiers.c to add complexity. Isn't it just moving code around or
>> am I missing some other subtlety that makes this hard? I really think
>> logically memory-tiering.c is where management of the various
>> memory_dev_types belongs.
>
> IMHO, it depends on whether these functions are shared by at least 2
> drivers.  If so, we can put them in mm/memory-tiers.c.  Otherwise, we
> should keep them in the driver.

Ok. Not sure I entirely agree because I suspect it would still make the
code clearer even for a single user. But generally you're correct and as
these memory_dev_type's are *supposed* to be driver specific (rather
than one per adistance) I don't think it's such a big issue.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-08-25  6:05 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-21  1:29 [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Huang Ying
2023-07-21  1:29 ` [PATCH RESEND 1/4] memory tiering: add abstract distance calculation algorithms management Huang Ying
2023-07-25  2:13   ` Alistair Popple
2023-07-25  3:14     ` Huang, Ying
2023-07-25  8:26       ` Alistair Popple
2023-07-26  7:33         ` Huang, Ying
2023-07-27  3:42           ` Alistair Popple
2023-07-27  4:02             ` Huang, Ying
2023-07-27  4:07               ` Alistair Popple
2023-07-27  5:41                 ` Huang, Ying
2023-07-28  1:20                   ` Alistair Popple
2023-08-11  3:51                     ` Huang, Ying
2023-08-21 11:26                       ` Alistair Popple
2023-08-21 22:50                         ` Huang, Ying
2023-08-21 23:52                           ` Alistair Popple
2023-08-22  0:58                             ` Huang, Ying
2023-08-22  7:11                               ` Alistair Popple
2023-08-23  5:56                                 ` Huang, Ying
2023-08-25  5:41                                   ` Alistair Popple
2023-07-21  1:29 ` [PATCH RESEND 2/4] acpi, hmat: refactor hmat_register_target_initiators() Huang Ying
2023-07-25  2:44   ` Alistair Popple
2023-08-07 16:55   ` Jonathan Cameron
2023-08-11  1:13     ` Huang, Ying
2023-07-21  1:29 ` [PATCH RESEND 3/4] acpi, hmat: calculate abstract distance with HMAT Huang Ying
2023-07-25  2:45   ` Alistair Popple
2023-07-25  6:47     ` Huang, Ying
2023-08-21 11:53       ` Alistair Popple
2023-08-21 23:28         ` Huang, Ying
2023-07-21  1:29 ` [PATCH RESEND 4/4] dax, kmem: calculate abstract distance with general interface Huang Ying
2023-07-25  3:11   ` Alistair Popple
2023-07-25  7:02     ` Huang, Ying
2023-08-21 12:03       ` Alistair Popple
2023-08-21 23:33         ` Huang, Ying
2023-08-22  7:36           ` Alistair Popple
2023-08-23  2:13             ` Huang, Ying
2023-08-25  6:00               ` Alistair Popple
2023-07-21  4:15 ` [PATCH RESEND 0/4] memory tiering: calculate abstract distance based on ACPI HMAT Alistair Popple
2023-07-24 17:58   ` Andrew Morton
2023-08-01  2:35     ` Bharata B Rao
2023-08-11  6:26       ` Huang, Ying
2023-08-11  7:49         ` Bharata B Rao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.