linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
@ 2023-10-09 20:42 Gregory Price
  2023-10-09 20:42 ` [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore Gregory Price
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Gregory Price @ 2023-10-09 20:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-cxl, akpm, sthanneeru, ying.huang, gregory.price

v2: change memtier mutex to semaphore
    add source-node relative weighting
    add remaining mempolicy integration code

= v2 Notes

Developed in colaboration with original authors to deconflict
similar efforts to extend mempolicy to take weights directly.

== Mutex to Semaphore change:

The memory tiering subsystem is extended in this patch set to have
externally available information (weights), and therefore additional
controls need to be added to ensure values are not changed (or tiers
changed/added/removed) during various calculations.

Since it is expected that many threads will be accessing this data
during allocations, a mutex is not appropriate.

Since write-updates (weight changes, hotplug events) are rare events,
a simple rw semaphore is sufficient.

== Source-node relative weighting:

Tiers can now be weighted differently based on the node requesting
the weight.  For example CPU-Nodes 0 and 1 may have different weights
for the same CXL memory tier, because topologically the number of
NUMA hops is greater (or any other physical topological difference
resulting in different effective latency or bandwidth values)

1. Set weights for DDR (tier4) and CXL(teir22) tiers.
   echo source_node:weight > /path/to/interleave_weight

# Set tier4 weight from node 0 to 85
echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
# Set tier4 weight from node 1 to 65
echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
# Set tier22 weight from node 0 to 15
echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
# Set tier22 weight from node 1 to 10
echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

== Mempolicy integration

Two new functions have been added to memory-tiers.c
* memtier_get_node_weight
  - Get the effective weight for a given node
* memtier_get_total_weight
  - Get the "total effective weight" for a given nodemask.

These functions are used by the following functions in mempolicy:
* interleave_nodes
* offset_il_nodes
* alloc_pages_bulk_array_interleave

The weight values are used to determine how many pages should be
allocated per-node as interleave rounds occur.

To avoid holding the memtier semaphore for long periods of time
(e.g. during the calls that actually allocate pages), there is
a small race condition during bulk allocation between calculating
the total weight of a node mask and fetching each individual
node weight - but this is managed by simply detecting the over/under
allocation conditions and handling them accordingly.

~Gregory

=== original RFC ====

From: Ravi Shankar <ravis.opensrc@micron.com>

Hello,

The current interleave policy operates by interleaving page requests
among nodes defined in the memory policy. To accommodate the
introduction of memory tiers for various memory types (e.g., DDR, CXL,
HBM, PMEM, etc.), a mechanism is needed for interleaving page requests
across these memory types or tiers.

This can be achieved by implementing an interleaving method that
considers the tier weights.
The tier weight will determine the proportion of nodes to select from
those specified in the memory policy.
A tier weight can be assigned to each memory type within the system.

Hasan Al Maruf had put forth a proposal for interleaving between two
tiers, namely the top tier and the low tier. However, this patch was
not adopted due to constraints on the number of available tiers.

https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

New proposed changes:

1. Introducea sysfs entry to allow setting the interleave weight for each
memory tier.
2. Each tier with a default weight of 1, indicating a standard 1:1
proportion.
3. Distribute the weight of that tier in a uniform manner across all nodes.
4. Modifications to the existing interleaving algorithm to support the
implementation of multi-tier interleaving based on tier-weights.

This is inline with Huang, Ying's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Observed a significant increase (165%) in bandwidth utilization
with the newly proposed multi-tier interleaving compared to the
traditional 1:1 interleaving approach between DDR and CXL tier nodes,
where 85% of the bandwidth is allocated to DDR tier and 15% to CXL
tier with MLC -w2 option.

Usage Example:

1. Set weights for DDR (tier4) and CXL(teir22) tiers.
echo 85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
echo 15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

2. Interleave between DRR(tier4, node-0) and CXL (tier22, node-1) using numactl
numactl -i0,1 mlc --loaded_latency W2

Gregory Price (3):
  mm/memory-tiers: change mutex to rw semaphore
  mm/memory-tiers: Introduce sysfs for tier interleave weights
  mm/mempolicy: modify interleave mempolicy to use memtier weights

 include/linux/memory-tiers.h |  16 ++++
 include/linux/mempolicy.h    |   3 +
 mm/memory-tiers.c            | 179 +++++++++++++++++++++++++++++++----
 mm/mempolicy.c               | 148 +++++++++++++++++++++++------
 4 files changed, 297 insertions(+), 49 deletions(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore
  2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
@ 2023-10-09 20:42 ` Gregory Price
  2023-10-09 20:42 ` [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights Gregory Price
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2023-10-09 20:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-cxl, akpm, sthanneeru, ying.huang, gregory.price

Tiers will have externally readable information, such as weights,
which may change at runtime. This information is expected to be
used by task threads during memory allocation so it cannot be
protected by hard mutual exclusion.

To support this, change the tiering mutex to a rw semaphore.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/memory-tiers.c | 39 ++++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 37a4f59d9585..0a3241a2cadc 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -5,6 +5,7 @@
 #include <linux/kobject.h>
 #include <linux/memory.h>
 #include <linux/memory-tiers.h>
+#include <linux/rwsem.h>
 
 #include "internal.h"
 
@@ -33,7 +34,7 @@ struct node_memory_type_map {
 	int map_count;
 };
 
-static DEFINE_MUTEX(memory_tier_lock);
+static DECLARE_RWSEM(memory_tier_sem);
 static LIST_HEAD(memory_tiers);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 static struct memory_dev_type *default_dram_type;
@@ -137,10 +138,10 @@ static ssize_t nodelist_show(struct device *dev,
 	int ret;
 	nodemask_t nmask;
 
-	mutex_lock(&memory_tier_lock);
+	down_read(&memory_tier_sem);
 	nmask = get_memtier_nodemask(to_memory_tier(dev));
 	ret = sysfs_emit(buf, "%*pbl\n", nodemask_pr_args(&nmask));
-	mutex_unlock(&memory_tier_lock);
+	up_read(&memory_tier_sem);
 	return ret;
 }
 static DEVICE_ATTR_RO(nodelist);
@@ -167,7 +168,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
 	int adistance = memtype->adistance;
 	unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
 
-	lockdep_assert_held_once(&memory_tier_lock);
+	lockdep_assert_held_write(&memory_tier_sem);
 
 	adistance = round_down(adistance, memtier_adistance_chunk_size);
 	/*
@@ -230,12 +231,12 @@ static struct memory_tier *__node_get_memory_tier(int node)
 	if (!pgdat)
 		return NULL;
 	/*
-	 * Since we hold memory_tier_lock, we can avoid
+	 * Since we hold memory_tier_sem, we can avoid
 	 * RCU read locks when accessing the details. No
 	 * parallel updates are possible here.
 	 */
 	return rcu_dereference_check(pgdat->memtier,
-				     lockdep_is_held(&memory_tier_lock));
+				     lockdep_is_held(&memory_tier_sem));
 }
 
 #ifdef CONFIG_MIGRATION
@@ -335,7 +336,7 @@ static void disable_all_demotion_targets(void)
 	for_each_node_state(node, N_MEMORY) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
 		/*
-		 * We are holding memory_tier_lock, it is safe
+		 * We are holding memory_tier_sem, it is safe
 		 * to access pgda->memtier.
 		 */
 		memtier = __node_get_memory_tier(node);
@@ -364,7 +365,7 @@ static void establish_demotion_targets(void)
 	int distance, best_distance;
 	nodemask_t tier_nodes, lower_tier;
 
-	lockdep_assert_held_once(&memory_tier_lock);
+	lockdep_assert_held_write(&memory_tier_sem);
 
 	if (!node_demotion)
 		return;
@@ -479,7 +480,7 @@ static struct memory_tier *set_node_memory_tier(int node)
 	pg_data_t *pgdat = NODE_DATA(node);
 
 
-	lockdep_assert_held_once(&memory_tier_lock);
+	lockdep_assert_held_write(&memory_tier_sem);
 
 	if (!node_state(node, N_MEMORY))
 		return ERR_PTR(-EINVAL);
@@ -569,15 +570,15 @@ EXPORT_SYMBOL_GPL(put_memory_type);
 void init_node_memory_type(int node, struct memory_dev_type *memtype)
 {
 
-	mutex_lock(&memory_tier_lock);
+	down_write(&memory_tier_sem);
 	__init_node_memory_type(node, memtype);
-	mutex_unlock(&memory_tier_lock);
+	up_write(&memory_tier_sem);
 }
 EXPORT_SYMBOL_GPL(init_node_memory_type);
 
 void clear_node_memory_type(int node, struct memory_dev_type *memtype)
 {
-	mutex_lock(&memory_tier_lock);
+	down_write(&memory_tier_sem);
 	if (node_memory_types[node].memtype == memtype)
 		node_memory_types[node].map_count--;
 	/*
@@ -588,7 +589,7 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype)
 		node_memory_types[node].memtype = NULL;
 		put_memory_type(memtype);
 	}
-	mutex_unlock(&memory_tier_lock);
+	up_write(&memory_tier_sem);
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
@@ -607,17 +608,17 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
 
 	switch (action) {
 	case MEM_OFFLINE:
-		mutex_lock(&memory_tier_lock);
+		down_write(&memory_tier_sem);
 		if (clear_node_memory_tier(arg->status_change_nid))
 			establish_demotion_targets();
-		mutex_unlock(&memory_tier_lock);
+		up_write(&memory_tier_sem);
 		break;
 	case MEM_ONLINE:
-		mutex_lock(&memory_tier_lock);
+		down_write(&memory_tier_sem);
 		memtier = set_node_memory_tier(arg->status_change_nid);
 		if (!IS_ERR(memtier))
 			establish_demotion_targets();
-		mutex_unlock(&memory_tier_lock);
+		up_write(&memory_tier_sem);
 		break;
 	}
 
@@ -638,7 +639,7 @@ static int __init memory_tier_init(void)
 				GFP_KERNEL);
 	WARN_ON(!node_demotion);
 #endif
-	mutex_lock(&memory_tier_lock);
+	down_write(&memory_tier_sem);
 	/*
 	 * For now we can have 4 faster memory tiers with smaller adistance
 	 * than default DRAM tier.
@@ -661,7 +662,7 @@ static int __init memory_tier_init(void)
 			break;
 	}
 	establish_demotion_targets();
-	mutex_unlock(&memory_tier_lock);
+	up_write(&memory_tier_sem);
 
 	hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI);
 	return 0;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights
  2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
  2023-10-09 20:42 ` [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore Gregory Price
@ 2023-10-09 20:42 ` Gregory Price
  2023-10-09 20:42 ` [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights Gregory Price
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2023-10-09 20:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-cxl, akpm, sthanneeru, ying.huang,
	gregory.price, Ravi Jonnalagadda

Allocating pages across tiers is accomplished by provisioning
interleave weights for each tier, with the distribution based on
these weight values.

Weights are relative to the node requesting it (i.e. the weight
for tier2 from node0 may be different than the weight for tier2
from node1).  This allows for cpu-bound tasks to have more
precise control over the distribution of memory.

To represent this, tiers are captured as an array of weights,
where the index is the source node.

tier->interleave_weight[source_node] = weight;

weights are set with the following sysfs mechanism:

Set tier4 weight from node 0 to 85
echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight

By default, all tiers will have a weight of 1 for all source nodes,
which maintains the default interleave behavior.

Weights are effectively aligned (up) to the number of nodes in the
operating nodemask (i.e. (policy_nodes & tier_nodes)) to simplify
the allocation logic and to avoid having to hold the tiering
semaphore for a long period of time during bulk allocation.

Weights apply to a tier, not each node in the tier.  The weight is
split between the nodes in that tier, similar to hardware interleaving.
However, when the task defines a nodemask that splits a tier's nodes,
the weight will be split between the remaining nodes - retaining the
overall weight of the tier.

Signed-off-by: Srinivasulu Thanneeru <sthanneeru@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 include/linux/memory-tiers.h |  16 ++++
 mm/memory-tiers.c            | 140 ++++++++++++++++++++++++++++++++++-
 2 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 437441cdf78f..a000b9745543 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -19,6 +19,8 @@
  */
 #define MEMTIER_ADISTANCE_DRAM	((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
 
+#define MAX_TIER_INTERLEAVE_WEIGHT 100
+
 struct memory_tier;
 struct memory_dev_type {
 	/* list of memory types that are part of same tier as this type */
@@ -36,6 +38,9 @@ struct memory_dev_type *alloc_memory_type(int adistance);
 void put_memory_type(struct memory_dev_type *memtype);
 void init_node_memory_type(int node, struct memory_dev_type *default_type);
 void clear_node_memory_type(int node, struct memory_dev_type *memtype);
+unsigned char memtier_get_node_weight(int from_node, int target_node,
+				      nodemask_t *pol_nodes);
+unsigned int memtier_get_total_weight(int from_node, nodemask_t *pol_nodes);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -97,5 +102,16 @@ static inline bool node_is_toptier(int node)
 {
 	return true;
 }
+
+unsigned char memtier_get_node_weight(int from_node, int target_node,
+				      nodemask_t *pol_nodes)
+{
+	return 0;
+}
+
+unsigned int memtier_get_total_weight(int from_node, nodemask_t *pol_nodes)
+{
+	return 0;
+}
 #endif	/* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0a3241a2cadc..37fc4b3f69a4 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -14,6 +14,11 @@ struct memory_tier {
 	struct list_head list;
 	/* list of all memory types part of this tier */
 	struct list_head memory_types;
+	/*
+	 * By default all tiers will have weight as 1, which means they
+	 * follow default standard allocation.
+	 */
+	unsigned char interleave_weight[MAX_NUMNODES];
 	/*
 	 * start value of abstract distance. memory tier maps
 	 * an abstract distance  range,
@@ -146,8 +151,72 @@ static ssize_t nodelist_show(struct device *dev,
 }
 static DEVICE_ATTR_RO(nodelist);
 
+static ssize_t interleave_weight_show(struct device *dev,
+				      struct device_attribute *attr,
+				      char *buf)
+{
+	int ret = 0;
+	struct memory_tier *tier = to_memory_tier(dev);
+	int node;
+	int count = 0;
+
+	down_read(&memory_tier_sem);
+	for_each_online_node(node) {
+		if (count > 0)
+			ret += sysfs_emit_at(buf, ret, ",");
+		ret += sysfs_emit_at(buf, ret, "%d:%d", node, tier->interleave_weight[node]);
+		count++;
+	}
+	up_read(&memory_tier_sem);
+	sysfs_emit_at(buf, ret++, "\n");
+
+	return ret;
+}
+
+static ssize_t interleave_weight_store(struct device *dev,
+				       struct device_attribute *attr,
+				       const char *buf, size_t size)
+{
+	unsigned char weight;
+	int from_node;
+	char *delim;
+	int ret;
+	struct memory_tier *tier;
+
+	delim = strchr(buf, ':');
+	if (!delim)
+		return -EINVAL;
+	delim[0] = '\0';
+
+	ret = kstrtou32(buf, 10, &from_node);
+	if (ret)
+		return ret;
+
+	if (from_node >= MAX_NUMNODES || !node_online(from_node))
+		return -EINVAL;
+
+	ret = kstrtou8(delim+1, 0, &weight);
+	if (ret)
+		return ret;
+
+	if (weight > MAX_TIER_INTERLEAVE_WEIGHT)
+		return -EINVAL;
+
+	down_write(&memory_tier_sem);
+	tier = to_memory_tier(dev);
+	if (tier)
+		tier->interleave_weight[from_node] = weight;
+	else
+		ret = -ENODEV;
+	up_write(&memory_tier_sem);
+
+	return size;
+}
+static DEVICE_ATTR_RW(interleave_weight);
+
 static struct attribute *memtier_dev_attrs[] = {
 	&dev_attr_nodelist.attr,
+	&dev_attr_interleave_weight.attr,
 	NULL
 };
 
@@ -239,6 +308,72 @@ static struct memory_tier *__node_get_memory_tier(int node)
 				     lockdep_is_held(&memory_tier_sem));
 }
 
+unsigned char memtier_get_node_weight(int from_node, int target_node,
+				      nodemask_t *pol_nodes)
+{
+	struct memory_tier *tier;
+	unsigned char tier_weight, node_weight = 1;
+	int tier_nodes;
+	nodemask_t tier_nmask, tier_and_pol;
+
+	/*
+	 * If the lock is already held, revert to a low weight temporarily
+	 * This should revert any interleave behavior to basic interleave
+	 * this only happens if weights are being updated or during init
+	 */
+	if (!down_read_trylock(&memory_tier_sem))
+		return 1;
+
+	tier = __node_get_memory_tier(target_node);
+	if (tier) {
+		tier_nmask = get_memtier_nodemask(tier);
+		nodes_and(tier_and_pol, tier_nmask, *pol_nodes);
+		tier_nodes = nodes_weight(tier_and_pol);
+		tier_weight = tier->interleave_weight[from_node];
+		node_weight = tier_weight / tier_nodes;
+		node_weight += (tier_weight % tier_nodes) ? 1 : 0;
+	}
+	up_read(&memory_tier_sem);
+	return node_weight;
+}
+
+unsigned int memtier_get_total_weight(int from_node, nodemask_t *pol_nodes)
+{
+	unsigned int weight = 0;
+	struct memory_tier *tier;
+	unsigned int min = nodes_weight(*pol_nodes);
+	int node;
+	nodemask_t tier_nmask, tier_and_pol;
+	int tier_nodes;
+	unsigned int tier_weight;
+
+	/*
+	 * If the lock is already held, revert to a low weight temporarily
+	 * This should revert any interleave behavior to basic interleave
+	 * this only happens if weights are being updated or during init
+	 */
+	if (!down_read_trylock(&memory_tier_sem))
+		return nodes_weight(*pol_nodes);
+
+	for_each_node_mask(node, *pol_nodes) {
+		tier = __node_get_memory_tier(node);
+		if (!tier) {
+			weight += 1;
+			continue;
+		}
+		tier_nmask = get_memtier_nodemask(tier);
+		nodes_and(tier_and_pol, tier_nmask, *pol_nodes);
+		tier_nodes = nodes_weight(tier_and_pol);
+		/* divide node weight by number of nodes, take ceil */
+		tier_weight = tier->interleave_weight[from_node];
+		weight += tier_weight / tier_nodes;
+		weight += (tier_weight % tier_nodes) ? 1 : 0;
+	}
+	up_read(&memory_tier_sem);
+
+	return weight >= min ? weight : min;
+}
+
 #ifdef CONFIG_MIGRATION
 bool node_is_toptier(int node)
 {
@@ -490,8 +625,11 @@ static struct memory_tier *set_node_memory_tier(int node)
 	memtype = node_memory_types[node].memtype;
 	node_set(node, memtype->nodes);
 	memtier = find_create_memory_tier(memtype);
-	if (!IS_ERR(memtier))
+	if (!IS_ERR(memtier)) {
 		rcu_assign_pointer(pgdat->memtier, memtier);
+		memset(memtier->interleave_weight, 1,
+		       sizeof(memtier->interleave_weight));
+	}
 	return memtier;
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights
  2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
  2023-10-09 20:42 ` [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore Gregory Price
  2023-10-09 20:42 ` [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights Gregory Price
@ 2023-10-09 20:42 ` Gregory Price
  2023-10-11 21:15 ` [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Matthew Wilcox
  2023-10-16  7:57 ` Huang, Ying
  4 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2023-10-09 20:42 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-cxl, akpm, sthanneeru, ying.huang, gregory.price

The memory-tier subsystem implements interleave weighting
for tiers for the purpose of bandwidth optimization.  Each
tier may contain multiple numa nodes, and each tier may have
different weights in relation to each compute node ("from node").

The mempolicy MPOL_INTERLEAVE utilizes the memory-tier subsystem
functions to implement weighted tiering.  By default, since all
tiers default to a weight of 1, the original interleave behavior
is retained.

The mempolicy nodemask does not have to be inclusive of all nodes
in each respective memory tier, though this may lead to a more
complicated calculation in terms of how memory is distributed.

Examples

Weight settings:
echo 0:4 > memory_tier4/interleave_weight
echo 1:3 > memory_tier4/interleave_weight
echo 0:2 > memory_tier22/interleave_weight
echo 1:1 > memory_tier22/interleave_weight

Results:
Tier 1: Nodes(0,1), Weights(4,3) <- from nodes(0,1) respectively
Tier 2: Nodes(2,3), Weights(2,1) <- from nodes(0,1) respectively

Task A:
   cpunode:  0
   nodemask: [0,1]
   weights:  [4]
   allocation result: [0,0,1,1, repeat]
   Notice how weight is split between the nodes

Task B:
   cpunode:  0
   nodemask: [0,2]
   weights:  [4,2]
   allocation result: [0,0,0,0,2,2, repeat]
   Notice how weights are not split, each node
   has the entire weight of the respective tier applied

Task C:
   cpunode: 1
   nodemask: [1,3]
   weights:  [3,1]
   allocation result: [1,1,1,3, repeat]
   Notice the weights differ based on cpunode

Task D:
   cpunode: 0
   nodemask: [0,1,2]
   weights:  [4,2]
   allocation result: [0,0,1,1,2,2]
   Notice how tier1 splits the weight between nodes 0 and 1
   but tier 2 has the entire weight applied to node 2

Task E:
   cpunode:  1
   nodemask: [0,1]
   weights:  [3]
   allocation result: [0,0,1,1]
   Notice how the weight is aligned up to an effective 4, because
   weights are aligned to the number of nodes in the tier.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 include/linux/mempolicy.h |   3 +
 mm/mempolicy.c            | 148 ++++++++++++++++++++++++++++++--------
 2 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..ad57fdfdb57a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -48,6 +48,9 @@ struct mempolicy {
 	nodemask_t nodes;	/* interleave/bind/perfer */
 	int home_node;		/* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */
 
+	/* weighted interleave settings */
+	unsigned char cur_weight;
+
 	union {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f1b00d6ac7ee..131e6e56b2de 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -102,6 +102,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -300,6 +301,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->cur_weight = 0;
 
 	return policy;
 }
@@ -334,6 +336,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 		tmp = *nodes;
 
 	pol->nodes = tmp;
+	pol->cur_weight = 0;
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -881,8 +884,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && new->mode == MPOL_INTERLEAVE) {
 		current->il_prev = MAX_NUMNODES-1;
+		new->cur_weight = 0;
+	}
+
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -1901,12 +1907,23 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
-	unsigned next;
+	unsigned int next;
+	unsigned char next_weight;
 	struct task_struct *me = current;
 
 	next = next_node_in(me->il_prev, policy->nodes);
-	if (next < MAX_NUMNODES)
+	if (!policy->cur_weight) {
+		/* If the node is set, at least 1 allocation is required */
+		next_weight = memtier_get_node_weight(numa_node_id(), next,
+						      &policy->nodes);
+
+		policy->cur_weight = next_weight ? next_weight : 1;
+	}
+
+	policy->cur_weight--;
+	if (next < MAX_NUMNODES && !policy->cur_weight)
 		me->il_prev = next;
+
 	return next;
 }
 
@@ -1965,25 +1982,37 @@ unsigned int mempolicy_slab_node(void)
 static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 {
 	nodemask_t nodemask = pol->nodes;
-	unsigned int target, nnodes;
-	int i;
+	unsigned int target, nnodes, il_weight;
+	unsigned char weight;
 	int nid;
+	int cur_node = numa_node_id();
+
 	/*
 	 * The barrier will stabilize the nodemask in a register or on
 	 * the stack so that it will stop changing under the code.
 	 *
 	 * Between first_node() and next_node(), pol->nodes could be changed
 	 * by other threads. So we put pol->nodes in a local stack.
+	 *
+	 * Additionally, place the cur_node on the stack in case of a migration
 	 */
 	barrier();
 
 	nnodes = nodes_weight(nodemask);
 	if (!nnodes)
-		return numa_node_id();
-	target = (unsigned int)n % nnodes;
+		return cur_node;
+
+	il_weight = memtier_get_total_weight(cur_node, &nodemask);
+	target = (unsigned int)n % il_weight;
 	nid = first_node(nodemask);
-	for (i = 0; i < target; i++)
-		nid = next_node(nid, nodemask);
+	while (target) {
+		weight = memtier_get_node_weight(cur_node, nid, &nodemask);
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+
 	return nid;
 }
 
@@ -2317,32 +2346,93 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
 {
-	int nodes;
-	unsigned long nr_pages_per_node;
-	int delta;
-	int i;
-	unsigned long nr_allocated;
+	struct task_struct *me = current;
 	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	unsigned char weight;
+	unsigned long il_weight;
+	unsigned long req_pages = nr_pages;
+	int nnodes, node, prev_node;
+	int cur_node = numa_node_id();
+	int i;
 
-	nodes = nodes_weight(pol->nodes);
-	nr_pages_per_node = nr_pages / nodes;
-	delta = nr_pages - nodes * nr_pages_per_node;
-
-	for (i = 0; i < nodes; i++) {
-		if (delta) {
-			nr_allocated = __alloc_pages_bulk(gfp,
-					interleave_nodes(pol), NULL,
-					nr_pages_per_node + 1, NULL,
-					page_array);
-			delta--;
-		} else {
-			nr_allocated = __alloc_pages_bulk(gfp,
-					interleave_nodes(pol), NULL,
-					nr_pages_per_node, NULL, page_array);
+	prev_node = me->il_prev;
+	nnodes = nodes_weight(pol->nodes);
+	/* Continue allocating from most recent node */
+	if (pol->cur_weight) {
+		node = next_node_in(prev_node, pol->nodes);
+		node_pages = pol->cur_weight;
+		if (node_pages > nr_pages)
+			node_pages = nr_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (req_pages <= pol->cur_weight) {
+			pol->cur_weight -= req_pages;
+			return total_allocated;
 		}
+		/* Otherwise we adjust req_pages down, and continue from there */
+		req_pages -= pol->cur_weight;
+		pol->cur_weight = 0;
+		prev_node = node;
+	}
 
+	/*
+	 * The memtier lock is not held during allocation, if weights change
+	 * there may be edge-cases (over/under-allocation) to handle.
+	 */
+try_again:
+	il_weight = memtier_get_total_weight(cur_node, &pol->nodes);
+	rounds = req_pages / il_weight;
+	delta = req_pages % il_weight;
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, pol->nodes);
+		weight = memtier_get_node_weight(cur_node, node, &pol->nodes);
+		node_pages = weight * rounds;
+		if (delta > weight) {
+			node_pages += weight;
+			delta -= weight;
+		} else if (delta) {
+			node_pages += delta;
+			delta = 0;
+		}
+		/* The number of requested pages may not hit every node */
+		if (!node_pages)
+			break;
+		/* If an over-allocation would occur, floor it */
+		if (node_pages + total_allocated > nr_pages) {
+			node_pages = nr_pages - total_allocated;
+			delta = 0;
+		}
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
 		page_array += nr_allocated;
 		total_allocated += nr_allocated;
+		prev_node = node;
+	}
+
+	/* If an under-allocation would occur, apply interleave again */
+	if (total_allocated != nr_pages)
+		goto try_again;
+
+	/*
+	 * Finally, we need to update me->il_prev and pol->cur_weight
+	 * if there were overflow pages, but not equivalent to the node
+	 * weight, set the cur_weight to node_weight - delta and the
+	 * me->il_prev to the previous node. Otherwise if it was perfect
+	 * we can simply set il_prev to node and cur_weight to 0
+	 */
+	if (node_pages) {
+		me->il_prev = prev_node;
+		node_pages %= weight;
+		pol->cur_weight = weight - node_pages;
+	} else {
+		me->il_prev = node;
+		pol->cur_weight = 0;
 	}
 
 	return total_allocated;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
  2023-10-11 21:15 ` [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Matthew Wilcox
@ 2023-10-10  1:07   ` Gregory Price
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2023-10-10  1:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm,
	sthanneeru, ying.huang

On Wed, Oct 11, 2023 at 10:15:02PM +0100, Matthew Wilcox wrote:
> On Mon, Oct 09, 2023 at 04:42:56PM -0400, Gregory Price wrote:
> > == Mutex to Semaphore change:
> > 
> > The memory tiering subsystem is extended in this patch set to have
> > externally available information (weights), and therefore additional
> > controls need to be added to ensure values are not changed (or tiers
> > changed/added/removed) during various calculations.
> > 
> > Since it is expected that many threads will be accessing this data
> > during allocations, a mutex is not appropriate.
> > 
> > Since write-updates (weight changes, hotplug events) are rare events,
> > a simple rw semaphore is sufficient.
> 
> Given how you're using it, wouldn't the existing RCU mechanism be
> better than converting this to an rwsem?
> 

... yes, and a smarter person would have just done that first :P

derp derp, thanks, I'll update.

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
  2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
                   ` (2 preceding siblings ...)
  2023-10-09 20:42 ` [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights Gregory Price
@ 2023-10-11 21:15 ` Matthew Wilcox
  2023-10-10  1:07   ` Gregory Price
  2023-10-16  7:57 ` Huang, Ying
  4 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2023-10-11 21:15 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-kernel, linux-cxl, akpm, sthanneeru, ying.huang,
	gregory.price

On Mon, Oct 09, 2023 at 04:42:56PM -0400, Gregory Price wrote:
> == Mutex to Semaphore change:
> 
> The memory tiering subsystem is extended in this patch set to have
> externally available information (weights), and therefore additional
> controls need to be added to ensure values are not changed (or tiers
> changed/added/removed) during various calculations.
> 
> Since it is expected that many threads will be accessing this data
> during allocations, a mutex is not appropriate.
> 
> Since write-updates (weight changes, hotplug events) are rare events,
> a simple rw semaphore is sufficient.

Given how you're using it, wouldn't the existing RCU mechanism be
better than converting this to an rwsem?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
  2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
                   ` (3 preceding siblings ...)
  2023-10-11 21:15 ` [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Matthew Wilcox
@ 2023-10-16  7:57 ` Huang, Ying
  2023-10-17  1:28   ` Gregory Price
  4 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2023-10-16  7:57 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-kernel, linux-cxl, akpm, sthanneeru, gregory.price

Gregory Price <gourry.memverge@gmail.com> writes:

> v2: change memtier mutex to semaphore
>     add source-node relative weighting
>     add remaining mempolicy integration code
>
> = v2 Notes
>
> Developed in colaboration with original authors to deconflict
> similar efforts to extend mempolicy to take weights directly.
>
> == Mutex to Semaphore change:
>
> The memory tiering subsystem is extended in this patch set to have
> externally available information (weights), and therefore additional
> controls need to be added to ensure values are not changed (or tiers
> changed/added/removed) during various calculations.
>
> Since it is expected that many threads will be accessing this data
> during allocations, a mutex is not appropriate.

IIUC, this is a change for performance.  If so, please show some
performance data.

> Since write-updates (weight changes, hotplug events) are rare events,
> a simple rw semaphore is sufficient.
>
> == Source-node relative weighting:
>
> Tiers can now be weighted differently based on the node requesting
> the weight.  For example CPU-Nodes 0 and 1 may have different weights
> for the same CXL memory tier, because topologically the number of
> NUMA hops is greater (or any other physical topological difference
> resulting in different effective latency or bandwidth values)
>
> 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
>    echo source_node:weight > /path/to/interleave_weight

If source_node is considered, why not consider target_node too?  On a
system with only 1 tier (DRAM), do you want weighted interleaving among
NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
Why not just introduce weighted interleaving for NUMA nodes?

> # Set tier4 weight from node 0 to 85
> echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> # Set tier4 weight from node 1 to 65
> echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> # Set tier22 weight from node 0 to 15
> echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> # Set tier22 weight from node 1 to 10
> echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
  2023-10-16  7:57 ` Huang, Ying
@ 2023-10-17  1:28   ` Gregory Price
  2023-10-18  8:29     ` Huang, Ying
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Price @ 2023-10-17  1:28 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm, sthanneeru

On Mon, Oct 16, 2023 at 03:57:52PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > == Mutex to Semaphore change:
> >
> > Since it is expected that many threads will be accessing this data
> > during allocations, a mutex is not appropriate.
> 
> IIUC, this is a change for performance.  If so, please show some
> performance data.
>

This change will be dropped in v3 in favor of the existing
RCU mechanism in memory-tiers.c as pointed out by Matthew.

> > == Source-node relative weighting:
> >
> > 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
> >    echo source_node:weight > /path/to/interleave_weight
> 
> If source_node is considered, why not consider target_node too?  On a
> system with only 1 tier (DRAM), do you want weighted interleaving among
> NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
> Why not just introduce weighted interleaving for NUMA nodes?
>

The short answer: Practicality and ease-of-use.

The long answer: We have been discussing how to make this more flexible..

Personally, I agree with you.  If Task A is on Socket 0, the weight on
Socket 0 DRAM should not be the same as the weight on Socket 1 DRAM.
However, right now, DRAM nodes are lumped into the same tier together,
resulting in them having the same weight.

If you scrollback through the list, you'll find an RFC I posted for
set_mempolicy2 which implements weighted interleave in mm/mempolicy.
However, mm/mempolicy is extremely `current-centric` at the moment,
so that makes changing weights at runtime (in response to a hotplug
event, for example) very difficult.

I still think there is room to extend set_mempolicy to allow
task-defined weights to take preference over tier defined weights.

We have discussed adding the following features to memory-tiers:

1) breaking up tiers to allow 1 tier per node, as opposed to defaulting
   to lumping all nodes of a simlar quality into the same tier

2) enabling movemnet of nodes between tiers (for the purpose of
   reconfiguring due to hotplug and other situations)

For users that require fine-grained control over each individual node,
this would allow for weights to be applied per-node, because a
node=tier. For the majority of use cases, it would allow clumping of
nodes into tiers based on physical topology and performance class, and
then allow for the general weighting to apply.  This seems like the most
obvious use-case that a majority of users would use, and also the
easiest to set-up in the short term.

That said, there are probably 3 or 4 different ways/places to implement
this feature.  The question is what is the clear and obvious way?
I don't have a definitive answer for that, hence the RFC.

There are at least 5 proposals that i know of at the moment

1) mempolicy
2) memory-tiers
3) memory-block interleaving? (weighting among blocks inside a node)
   Maybe relevant if Dynamic Capacity devices arrive, but it seems
   like the wrong place to do this.
4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
5) "just do it in hardware"

> > # Set tier4 weight from node 0 to 85
> > echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> > # Set tier4 weight from node 1 to 65
> > echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
> > # Set tier22 weight from node 0 to 15
> > echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> > # Set tier22 weight from node 1 to 10
> > echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
> 
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
  2023-10-18  8:29     ` Huang, Ying
@ 2023-10-17  2:52       ` Gregory Price
       [not found]         ` <87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Price @ 2023-10-17  2:52 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm, sthanneeru

On Wed, Oct 18, 2023 at 04:29:02PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > There are at least 5 proposals that i know of at the moment
> >
> > 1) mempolicy
> > 2) memory-tiers
> > 3) memory-block interleaving? (weighting among blocks inside a node)
> >    Maybe relevant if Dynamic Capacity devices arrive, but it seems
> >    like the wrong place to do this.
> > 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
> > 5) "just do it in hardware"
> 
> It may be easier to start with the use case.  What is the practical use
> cases in your mind that can not be satisfied with simple per-memory-tier
> weight?  Can you compare the memory layout with different proposals?
>

Before I delve in, one clarifying question:  When you asked whether
weights should be part of node or memory-tiers, i took that to mean
whether it should be part of mempolicy or memory-tiers.

Were you suggesting that weights should actually be part of
drivers/base/node.c?

Because I had not considered that, and this seems reasonable, easy to
implement, and would not require tying mempolicy.c to memory-tiers.c



Beyond this, i think there's been 3 imagined use cases (now, including
this).

a)
numactl --weighted-interleave=Node:weight,0:16,1:4,...

b)
echo weight > /sys/.../memory-tiers/memtier/access0/interleave_weight
numactl --interleave=0,1

c)
echo weight > /sys/bus/node/node0/access0/interleave_weight
numactl --interleave=0,1

d)
options b or c, but with --weighted-interleave=0,1 instead
this requires libnuma changes to pick up, but it retains --interleave
as-is to avoid user confusion.

The downside of an approach like A (which was my original approach), was
that the weights cannot really change should a node be hotplugged. Tasks
would need to detect this and change the policy themselves.  That's not
a good solution.

However in both B and C's design, weights can be rebalanced in response
to any number of events.  Ultimately B and C are equivalent, but
the placement in nodes is cleaner and more intuitive.  If memory-tiers
wants to use/change this information, there's nothing that prevents it.

Assuming this is your meaning, I agree and I will pivot to this.

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
       [not found]         ` <87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2023-10-18  2:47           ` Gregory Price
       [not found]             ` <87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Price @ 2023-10-18  2:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm,
	sthanneeru, Aneesh Kumar K.V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Tim Chen, Yang Shi

On Thu, Oct 19, 2023 at 02:28:42PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> > Were you suggesting that weights should actually be part of
> > drivers/base/node.c?
> 
> Yes.  drivers/base/node.c vs. memory tiers.
>

Then yes I agree this can and probably should be placed there,
especially since I see accessor details are now being exposed at that
level, which can be used to auto-generate weights (assuming HMAT/CDAT
data exposed by devices is actually accurate).

> > Assuming this is your meaning, I agree and I will pivot to this.
> 
> Can you give a not-so-abstract example?  For example, on a system with
> node 0, 1, 2, 3, memory tiers 4 (0, 1), 22 (2, 3), ....  A workload runs
> on CPU of node 0, ...., interleaves memory on node 0, 1, ...  Then
> compare the different behavior (including memory bandwidth) with node
> and memory-tier based solution.

ah, I see.

Example 1: A single-socket system with multiple CXL memory devices
===
CPU Node: node0
CXL Nodes: node1, node2

Bandwidth attributes (in theory):
node0 - 8 channels - ~307GB/s
node1 - x16 link - 64GB/s
node2 - x8 link - 32GB/s

In a system like this, the optimal distribution of memory on an
interleave for maximizing bandwidth is about 76%/16%/8%.

for the sake of simplicity:  --weighted-interleave=0:76,1:16,0:8
but realistically we could make the weights sysfs values in the node

Regardless of the mechanism to engage this, the most effective way to
capture this in the system is by applying weights to nodes, not tiers.
If done in tiers, each node would be assigned to its own tier, making
the mechanism equivalent. So you might as well simplify the whole thing
and chop the memtier component out.

Is this configuration realistic? *shrug* - technically possible. And in
fact most hardware or driver based interleaving mechanisms would not
really be able to manage an interleave region across these nodes, at
least not without placing the x16 driver in x8 mode, or just having the
wrong distribution %'s.



Example 2: A dual-socket system with 1 CXL device per socket
===
CPU Nodes: node0, node1
CXL Nodes: node2, node3 (on sockets 0 and 1 respective)

Bandwidth Attributes (in theory):
nodes 0 & 1 - 8 channels - ~307GB/s ea.
nodes 2 & 3 - x16 link - 64GB/s ea.

This is similar to example #1, but with one difference:  A task running
on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3.
This is because on access to nodes 1 and 3, the cross-socket link (UPI,
or whatever AMD calls it) becomes a bandwidth chokepoint.

So from the perspective of node 0, the "real total" available bandwidth
is about 307GB+64GB+(41.6GB * UPI Links) in the case of intel.  so the
best result you could get is around 307+64+164=535GB/s if you have the
full 4 links.

You'd want to distribute the cross-socket traffic proportional to UPI,
not the total.

This leaves us with weights of:

node0 - 57%
node1 - 26%
node2 - 12%
node3 - 5%

Again, naturally nodes are the place to carry the weights here. In this
scenario, placing it in memory-tiers would require that 1 tier per node
existed.


Example 3: A single-socket system with 2 CXL devices
===
Different than example 1: Both devices are the same.

CPU Node: node0
CXL Nodes: node1, node2

Bandwidth attributes (in theory):
node0 - 8 channels - ~307GB/s
node1/2 - x16 link - 64GB/s

In this case you want the weights to be: 70/15/15% respectively

Caveat: A user may, in fact, use the CXL driver to implement a single
node which naturally interleaves the 2 devices. In this case it's the
same as a 1-socket, 1-device setup which is trivially 1-node-per-tier,
and therefore weights should live with nodes.

In the case of a single memory tier, you could simply make this 70/30.

However, and this the the argument against placing it in the
memory-tier: the user is free to hack-off any of the chosen numa nodes
via mempolicy, which makes the aggregated weight meaningless.

Example:  --weighted-interleave=0,1

Under my current code, if I set the weights to 70/30 in the memory-tiers
code, the result is that node1 inherits the full 30% defined in the
tier, which leads to a suboptimal distribution.  What you actually want
in this case is about 83/17% split.

However, this also presents an issue for placing the weights in the
nodes:  A node weight is meaningless outside the context of the active
context.  If I have 2 nodes and i set their weights to 70/30, and I hack
off node1, i can't have 70% of all memory go to node0, I have to send
100% of the memory to node0 - making the weight functionally
meaningless.

So this would imply a single global weight set on the nodes is ALSO a
bad solution, and instead it would be better to extend set_mempolicy
to have a --weighted-interleave option that reads HMAT/CDAT provided
bandwidth data and generates the weights for the selected nodes as
part of the policy.

The downside of this is that if the HMAT/CDAT data is garbage, the
policy becomes garbage.  To mitigate this, we should consider allowing
userland to override those values explicitly for the purpose of weighted
interleave should the HMAT/CDAT information be garbage/lies.

Another downside to this is that nodemask changes require recalculation
of weights, which may introduce some racey conditions, but that can
probably be managed.


Should we carry weights in node, memtier, or mempolicy?
===
The current analysis suggest carrying it in memory-tier would simply
require memory-tier to make 1 tier per node - which may or may not
be consistent with other goals of the memtier subsystem.

The pros of placing a weight in node is that the "effective" weight in
the context of the entire system can be calculated at the time nodes are
created.  If, at interleave time, the interface required a node+nodemask
then it's probably preferable to forego manual weighting, and simply
calculate based on HMAT/CDAT data.

The downside of placing it in nodes is that mempolicy is free to set the
interleave set to some combination of nodes, and this would prevent any
nodes created after process launch from being used in the interleave set
unless the software detected the hotplug event.  I don't know how much
of a real concern this is, but it is a limitation.

The other option is to add --weighted-interleave, but have mempolicy
generate the weights based on node-provided CDAT/HMAT data (or
overrides), which keep almost everything inside of mempolicy except for
a couple of interfaces to drivers/base/node.c that allow querying of
that data.



Summarize:
===
The weights are actually a function of bandwidth, and can probably be
calculated on the fly - rather than being manually set. However, we may
want to consider allowing the bandwidth attributes being exposed by
CDAT/HMAT to be overridden should the user discover they are functionally
incorrect. (for reference: I have seen this myself, where a device
published 5GB/s but actually achieves 22GB/s).

for reference, they are presently RO:

static DEVICE_ATTR_RO(property)
ACCESS_ATTR(read_bandwidth);
ACCESS_ATTR(read_latency);
ACCESS_ATTR(write_bandwidth);
ACCESS_ATTR(write_latency);


~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
  2023-10-17  1:28   ` Gregory Price
@ 2023-10-18  8:29     ` Huang, Ying
  2023-10-17  2:52       ` Gregory Price
  0 siblings, 1 reply; 15+ messages in thread
From: Huang, Ying @ 2023-10-18  8:29 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm, sthanneeru

Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Oct 16, 2023 at 03:57:52PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > == Mutex to Semaphore change:
>> >
>> > Since it is expected that many threads will be accessing this data
>> > during allocations, a mutex is not appropriate.
>> 
>> IIUC, this is a change for performance.  If so, please show some
>> performance data.
>>
>
> This change will be dropped in v3 in favor of the existing
> RCU mechanism in memory-tiers.c as pointed out by Matthew.
>
>> > == Source-node relative weighting:
>> >
>> > 1. Set weights for DDR (tier4) and CXL(teir22) tiers.
>> >    echo source_node:weight > /path/to/interleave_weight
>> 
>> If source_node is considered, why not consider target_node too?  On a
>> system with only 1 tier (DRAM), do you want weighted interleaving among
>> NUMA nodes?  If so, why tie weighted interleaving with memory tiers?
>> Why not just introduce weighted interleaving for NUMA nodes?
>>
>
> The short answer: Practicality and ease-of-use.
>
> The long answer: We have been discussing how to make this more flexible..
>
> Personally, I agree with you.  If Task A is on Socket 0, the weight on
> Socket 0 DRAM should not be the same as the weight on Socket 1 DRAM.
> However, right now, DRAM nodes are lumped into the same tier together,
> resulting in them having the same weight.
>
> If you scrollback through the list, you'll find an RFC I posted for
> set_mempolicy2 which implements weighted interleave in mm/mempolicy.
> However, mm/mempolicy is extremely `current-centric` at the moment,
> so that makes changing weights at runtime (in response to a hotplug
> event, for example) very difficult.
>
> I still think there is room to extend set_mempolicy to allow
> task-defined weights to take preference over tier defined weights.
>
> We have discussed adding the following features to memory-tiers:
>
> 1) breaking up tiers to allow 1 tier per node, as opposed to defaulting
>    to lumping all nodes of a simlar quality into the same tier
>
> 2) enabling movemnet of nodes between tiers (for the purpose of
>    reconfiguring due to hotplug and other situations)
>
> For users that require fine-grained control over each individual node,
> this would allow for weights to be applied per-node, because a
> node=tier. For the majority of use cases, it would allow clumping of
> nodes into tiers based on physical topology and performance class, and
> then allow for the general weighting to apply.  This seems like the most
> obvious use-case that a majority of users would use, and also the
> easiest to set-up in the short term.
>
> That said, there are probably 3 or 4 different ways/places to implement
> this feature.  The question is what is the clear and obvious way?
> I don't have a definitive answer for that, hence the RFC.
>
> There are at least 5 proposals that i know of at the moment
>
> 1) mempolicy
> 2) memory-tiers
> 3) memory-block interleaving? (weighting among blocks inside a node)
>    Maybe relevant if Dynamic Capacity devices arrive, but it seems
>    like the wrong place to do this.
> 4) multi-device nodes (e.g. cxl create-region ... mem0 mem1...)
> 5) "just do it in hardware"

It may be easier to start with the use case.  What is the practical use
cases in your mind that can not be satisfied with simple per-memory-tier
weight?  Can you compare the memory layout with different proposals?

>> > # Set tier4 weight from node 0 to 85
>> > echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
>> > # Set tier4 weight from node 1 to 65
>> > echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight
>> > # Set tier22 weight from node 0 to 15
>> > echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight
>> > # Set tier22 weight from node 1 to 10
>> > echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
       [not found]             ` <87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2023-10-19 13:26               ` Gregory Price
       [not found]                 ` <87ttqidr7v.fsf@yhuang6-desk2.ccr.corp.intel.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Price @ 2023-10-19 13:26 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm,
	sthanneeru, Aneesh Kumar K.V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Tim Chen, Yang Shi

On Fri, Oct 20, 2023 at 02:11:40PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> >
[...snip...]
> > Example 2: A dual-socket system with 1 CXL device per socket
> > ===
> > CPU Nodes: node0, node1
> > CXL Nodes: node2, node3 (on sockets 0 and 1 respective)
> >
[...snip...]
> > This is similar to example #1, but with one difference:  A task running
> > on node 0 should not treat nodes 0 and 1 the same, nor nodes 2 and 3.
[...snip...]
> > This leaves us with weights of:
> >
> > node0 - 57%
> > node1 - 26%
> > node2 - 12%
> > node3 - 5%
> >
> 
> Does the workload run on CPU of node 0 only?  This appears unreasonable.

Depends.  if a user explicitly launches with `numactl --cpunodebind=0`
then yes, you can force a task (and all its children) to run on node0.

If a workload multi-threaded enough to run on both sockets, then you are
right that you'd want to basically limit cross-socket traffic by binding
individual threads to nodes that don't cross sockets - if at all
feasible this may not be feasible).

But at that point, we're getting into the area of numa-aware software.
That's a bit beyond the scope of this - which is to enable a coarse
grained interleaving solution that can easily be accessed with something
like `numactl --interleave` or `numactl --weighted-interleave`.

> If the memory bandwidth requirement of the workload is so large that CXL
> is used to expand bandwidth, why not run workload on CPU of node 1 and
> use the full memory bandwidth of node 1?

Settings are NOT one size fits all.  You can certainly come up with another
scenario in which these weights are not optimal.

If we're running enough threads that we need multiple sockets to run
them concurrently, then the memory distribution weights become much more
complex.  Without more precise control over task placement and
preventing task migration, you can't really get an "optimal" placement.

What I'm really saying is "Task placement is a more powerful function
for predicting performance than memory placement".  However, user
software would need to implement a pseudo-scheduler and explicit data
placement to be the most optimized.  Beyond this, there is only so much
we can do from a `numactl` perspective.

tl;dr: We can't get a perfect system here, because getting a best-case
for all possible scenarios is an probably undecidable problem. You will
always be able to generate an example wherein the system is not optimal.

> 
> If the workload run on CPU of node 0 and node 1, then the cross-socket
> traffic should be minimized if possible.  That is, threads/processes on
> node 0 should interleave memory of node 0 and node 2, while that on node
> 1 should interleave memory of node 1 and node 3.

This can be done with set_mempolicy() with MPOL_INTERLEAVE and set the
nodemask to the what you describe.  Those tasks need to also prevent
themselves from being migrated as well.  But this can absolutely be
done.

In this scenario, the weights need to be re-calculated to be based on
the bandwidth of the nodes in the mempolicy nodemask, which is what i
described in the last email.

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
       [not found]                 ` <87ttqidr7v.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2023-10-24 15:32                   ` Gregory Price
       [not found]                     ` <87lebrec82.fsf@yhuang6-desk2.ccr.corp.intel.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Price @ 2023-10-24 15:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm,
	sthanneeru, Aneesh Kumar K.V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Tim Chen, Yang Shi

On Mon, Oct 23, 2023 at 10:09:56AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > Depends.  if a user explicitly launches with `numactl --cpunodebind=0`
> > then yes, you can force a task (and all its children) to run on node0.
> 
> IIUC, in your example, the `numactl` command line will be
> 
>   numactl --cpunodebind=0 --weighted-interleave=0,1,2,3
> 
> That is, the CPU is restricted to node 0, while memory is distributed to
> all nodes.  This doesn't sound like reasonable for me.
> 

It being reasonable isn't really relevant. You can do this today with
normal interleave:

numactl --cpunodebind=0 --interleave=0,1,2,3

The only difference between this method and that is the application of
weights.  Doesn't seem reasonable to lock users out of doing it.

> 
> IMHO, we should keep thing as simple as possible, only add complexity if
> necessary.
>

Not allowing it is more complicated than allowing it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
       [not found]                     ` <87lebrec82.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2023-10-25 19:51                       ` Gregory Price
       [not found]                         ` <87a5s0df6p.fsf@yhuang6-desk2.ccr.corp.intel.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Gregory Price @ 2023-10-25 19:51 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm,
	sthanneeru, Aneesh Kumar K.V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Tim Chen, Yang Shi

On Wed, Oct 25, 2023 at 09:13:01AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Mon, Oct 23, 2023 at 10:09:56AM +0800, Huang, Ying wrote:
> >> Gregory Price <gregory.price@memverge.com> writes:
> >> 
> >> > Depends.  if a user explicitly launches with `numactl --cpunodebind=0`
> >> > then yes, you can force a task (and all its children) to run on node0.
> >> 
> >> IIUC, in your example, the `numactl` command line will be
> >> 
> >>   numactl --cpunodebind=0 --weighted-interleave=0,1,2,3
> >> 
> >> That is, the CPU is restricted to node 0, while memory is distributed to
> >> all nodes.  This doesn't sound like reasonable for me.
> >> 
> >
> > It being reasonable isn't really relevant. You can do this today with
> > normal interleave:
> >
> > numactl --cpunodebind=0 --interleave=0,1,2,3
> >
> > The only difference between this method and that is the application of
> > weights.  Doesn't seem reasonable to lock users out of doing it.
> 
> Do you have some real use case?
> 

I don't, but this is how mempolicy and numactl presently work.  You can
do this today with the current kernel.  I'm simply extending it to
include weights.

~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving
       [not found]                         ` <87a5s0df6p.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2023-10-30  4:19                           ` Gregory Price
  0 siblings, 0 replies; 15+ messages in thread
From: Gregory Price @ 2023-10-30  4:19 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-kernel, linux-cxl, akpm,
	sthanneeru, Aneesh Kumar K.V, Wei Xu, Alistair Popple,
	Dan Williams, Dave Hansen, Johannes Weiner, Jonathan Cameron,
	Michal Hocko, Tim Chen, Yang Shi

On Mon, Oct 30, 2023 at 10:20:14AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> The extending adds complexity to the kernel code and changes the kernel
> ABI.  So, IMHO, we need some real life use case to prove the added
> complexity is necessary.
> 
> For example, in [1], Johannes showed the use case to support to add
> per-memory-tier interleave weight.
> 
> [1] https://lore.kernel.org/all/20220607171949.85796-1-hannes@cmpxchg.org/
> 
> --
> Best Regards,
> Huang, Ying

Sorry, I misunderstood your question.

The use case is the same as the N:M interleave strategy between tiers,
and in fact the proposal for weights was directly inspired by the patch
you posted. We're searching for the best way to implement weights.

We've discussed placing these weights in:

1) mempolicy :
   https://lore.kernel.org/linux-cxl/20230914235457.482710-1-gregory.price@memverge.com/

2) tiers
   https://lore.kernel.org/linux-cxl/20231009204259.875232-1-gregory.price@memverge.com/

and now
3) the nodes themselves
   RFC not posted yet

The use case is the exact same as the patch you posted, which is to enable
optimal distribution of memory to maximize memory bandwidth usage.

The use case is straight forward - Consider a machine with the following
numa nodes:

1) Socket 0 - DRAM - ~400GB/s bandwidth local, less cross-socket
2) Socket 1 - DRAM - ~400GB/s bandwidth local, less cross socket
3) CXL Memory Attached to Socket 0 with ~64GB/s per link.
4) CXL Memory Attached to Socket 1 with ~64GB/s per link.

The goal is to enable mempolicy to implement weighted interleave such
that a thread running on socket 0 can effectively spread its memory
across each numa node (or some subset there-of) such that it maximizes
its bandwidth usage across the various devices.

For example, lets consider a system with only 1 & 2 (2 sockets w/ DRAM).

On an Intel System with UPI, the "effective" bandwidth available for a
task on Socket 0 is not 800GB/s, it's about 450-500GB/s split about
300/200 between the sockets (you never get the full amount, and UPI limits
cross-socket bandwidth).

Today `numactl --interleave` will split your memory 50:50 between
sockets, which is just blatantly suboptimal.  In this case you would
prefer a 3:2 distribution (literally weights of 3 and 2 respectively).

The extension to CXL becomes obvious then, as each individual node,
respective to its CPU placement, has a different optimal weight.


Of course the question becomes "what if a task uses more threads than a
single socket has to offer", and the answer there is essentially the
same as the answer today:  Then that process must become "numa-aware" to
make the best use of the available resources.

However, for software capable of exhausting bandwidth with from a single
socket (which on intel takes about 16-20 threads with certain access
patterns), then a weighted-interleave system provided via some interface
like `numactl --weighted-interleave` with weights either set in numa
nodes or mempolicy is sufficient.


~Gregory

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-10-30  4:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-09 20:42 [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 1/3] mm/memory-tiers: change mutex to rw semaphore Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 2/3] mm/memory-tiers: Introduce sysfs for tier interleave weights Gregory Price
2023-10-09 20:42 ` [RFC PATCH v2 3/3] mm/mempolicy: modify interleave mempolicy to use memtier weights Gregory Price
2023-10-11 21:15 ` [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Matthew Wilcox
2023-10-10  1:07   ` Gregory Price
2023-10-16  7:57 ` Huang, Ying
2023-10-17  1:28   ` Gregory Price
2023-10-18  8:29     ` Huang, Ying
2023-10-17  2:52       ` Gregory Price
     [not found]         ` <87edhrunvp.fsf@yhuang6-desk2.ccr.corp.intel.com>
2023-10-18  2:47           ` Gregory Price
     [not found]             ` <87fs25g6w3.fsf@yhuang6-desk2.ccr.corp.intel.com>
2023-10-19 13:26               ` Gregory Price
     [not found]                 ` <87ttqidr7v.fsf@yhuang6-desk2.ccr.corp.intel.com>
2023-10-24 15:32                   ` Gregory Price
     [not found]                     ` <87lebrec82.fsf@yhuang6-desk2.ccr.corp.intel.com>
2023-10-25 19:51                       ` Gregory Price
     [not found]                         ` <87a5s0df6p.fsf@yhuang6-desk2.ccr.corp.intel.com>
2023-10-30  4:19                           ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).