* [PATCH v2 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension @ 2024-01-19 17:57 Gregory Price 2024-01-19 17:57 ` [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Gregory Price @ 2024-01-19 17:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, ying.huang, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Hasan Al Maruf, Hao Wang, Michal Hocko, Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron, Andi Kleen Hi Andrew, Can you please replace the patches on mm-unstable with this line, it has bulk-allocator bug fixes and some design changes at the request of Ying Huang. Full v2 notes are just before the test info. = Cover Letter Weighted interleave is a new interleave policy intended to make use of heterogeneous memory environments appearing with CXL. The existing interleave mechanism does an even round-robin distribution of memory across all nodes in a nodemask, while weighted interleave distributes memory across nodes according to a provided weight. (Weight = # of page allocations per round) Weighted interleave is intended to reduce average latency when bandwidth is pressured - therefore increasing total throughput. In other words: It allows greater use of the total available bandwidth in a heterogeneous hardware environment (different hardware provides different bandwidth capacity). As bandwidth is pressured, latency increases - first linearly and then exponentially. By keeping bandwidth usage distributed according to available bandwidth, we therefore can reduce the average latency of a cacheline fetch. A good explanation of the bandwidth vs latency response curve: https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/ From the article: ``` Constant region: The latency response is fairly constant for the first 40% of the sustained bandwidth. Linear region: In between 40% to 80% of the sustained bandwidth, the latency response increases almost linearly with the bandwidth demand of the system due to contention overhead by numerous memory requests. Exponential region: Between 80% to 100% of the sustained bandwidth, the memory latency is dominated by the contention latency which can be as much as twice the idle latency or more. Maximum sustained bandwidth : Is 65% to 75% of the theoretical maximum bandwidth. ``` As a general rule of thumb: * If bandwidth usage is low, latency does not increase. It is optimal to place data in the nearest (lowest latency) device. * If bandwidth usage is high, latency increases. It is optimal to place data such that bandwidth use is optimized per-device. This is the top line goal: Provide a user a mechanism to target using the "maximum sustained bandwidth" of each hardware component in a heterogenous memory system. For example, the stream benchmark demonstrates that 1:1 (default) interleave is actively harmful, while weighted interleave can be beneficial. Default interleave distributes data such that too much pressure is placed on devices with lower available bandwidth. Stream Benchmark (High level results, 1 Socket + 1 CXL Device) Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) Targeted weights : +2.5% to +4% (consistently better than DRAM) Global means the task-policy was set (set_mempolicy), while targeted means VMA policies were set (mbind2). We see weighted interleave is not always beneficial when applied globally, but is always beneficial when applied to bandwidth-driving memory regions. We implement sysfs entries for "system global" weights which can be set by a daemon or administrator. There are 3 patches in this set: 1) Implement system-global interleave weights as sysfs extension in mm/mempolicy.c. These weights are RCU protected, and the default/minimum weight is 1 for all nodes. In future work, we intend to expose an interface for HMAT/CDAT information to be used during boot to set reasonable system default values based on the memory configuration of the system discovered at boot or during device hotplug. 2) A mild refactor of some interleave-logic for re-use in the new weighted interleave logic. 3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind Included below are some performance and LTP test information, and a sample numactl branch which can be used for testing. = Performance summary = (tests may have different configurations, see extended info below) 1) MLC (W2) : +38% over DRAM. +264% over default interleave. MLC (W5) : +40% over DRAM. +226% over default interleave. 2) Stream : -6% to +4% over DRAM, +430% over default interleave. 3) XSBench : +19% over DRAM. +47% over default interleave. = LTP Testing Summary = existing mempolicy & mbind tests: pass mempolicy & mbind + weighted interleave (global weights): pass = version history v2: - MAJOR: Torture tested bulk allocator, fixed edge conditions tracking the next me->il_node. Added documentation. Prior version was stable, but the resulting me->il_node could be wrong under certain circumstances. - naming: iw_table_mtx -> iw_table_lock - RCU: use synchronize+kfree and simplify the weight structure - default: remove default table, since it's static for now - sysfs setup: simplify setup, if table==NULL presume 1's - node_store: only allocate (sizeof(u8) * nr_node_ids) - allocators: update to deal with NULL table pointer - read_once: __builtin_memcpy -> memcpy - formatting v1: - RCU: This version protects the weight array with RCU. - ktest fix: proper include (types.h) in uapi header - doc: make mpol_params in docs reflect definition - doc: mempolicy.c comments in MPOL_WEIGHTED_INTERLEAVE patch - Dropped task-local weights and syscalls from the proposal until affirmative use cases for task-local weights appear. Link: https://lore.kernel.org/linux-mm/20240103224209.2541-1-gregory.price@memverge.com/ ===================================================================== Performance tests - MLC From - Ravi Jonnalagadda <ravis.opensrc@micron.com> Hardware: Single-socket, multiple CXL memory expanders. Workload: W2 Data Signature: 2:1 read:write DRAM only bandwidth (GBps): 298.8 DRAM + CXL (default interleave) (GBps): 113.04 DRAM + CXL (weighted interleave)(GBps): 412.5 Gain over DRAM only: 1.38x Gain over default interleave: 2.64x Workload: W5 Data Signature: 1:1 read:write DRAM only bandwidth (GBps): 273.2 DRAM + CXL (default interleave) (GBps): 117.23 DRAM + CXL (weighted interleave)(GBps): 382.7 Gain over DRAM only: 1.4x Gain over default interleave: 2.26x ===================================================================== Performance test - Stream From - Gregory Price <gregory.price@memverge.com> Hardware: Single socket, single CXL expander numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times Default interleave : -78% (slower than DRAM) Global weighting : -6% to +4% (workload dependant) mbind2 weights : +2.5% to +4% (consistently better than DRAM) dram only: numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Function Direction BestRateMBs AvgTime MinTime MaxTime Copy: 0->0 200923.2 0.032662 0.031853 0.033301 Scale: 0->0 202123.0 0.032526 0.031664 0.032970 Add: 0->0 208873.2 0.047322 0.045961 0.047884 Triad: 0->0 208523.8 0.047262 0.046038 0.048414 CXL-only: numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 22209.7 0.288661 0.288162 0.289342 Scale: 0->0 22288.2 0.287549 0.287147 0.288291 Add: 0->0 24419.1 0.393372 0.393135 0.393735 Triad: 0->0 24484.6 0.392337 0.392083 0.394331 Based on the above, the optimal weights are ~9:1 echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1 echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2 default interleave: numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 44666.2 0.143671 0.143285 0.144174 Scale: 0->0 44781.6 0.143256 0.142916 0.143713 Add: 0->0 48600.7 0.197719 0.197528 0.197858 Triad: 0->0 48727.5 0.197204 0.197014 0.197439 global weighted interleave: numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc Copy: 0->0 190085.9 0.034289 0.033669 0.034645 Scale: 0->0 207677.4 0.031909 0.030817 0.033061 Add: 0->0 202036.8 0.048737 0.047516 0.053409 Triad: 0->0 217671.5 0.045819 0.044103 0.046755 targted regions w/ global weights (modified stream to mbind2 malloc'd regions)) numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc Copy: 0->0 205827.0 0.031445 0.031094 0.031984 Scale: 0->0 208171.8 0.031320 0.030744 0.032505 Add: 0->0 217352.0 0.045087 0.044168 0.046515 Triad: 0->0 216884.8 0.045062 0.044263 0.046982 ===================================================================== Performance tests - XSBench From - Hyeongtak Ji <hyeongtak.ji@sk.com> Hardware: Single socket, Single CXL memory Expander NUMA node 0: 56 logical cores, 128 GB memory NUMA node 2: 96 GB CXL memory Threads: 56 Lookups: 170,000,000 Summary: +19% over DRAM. +47% over default interleave. Performance tests - XSBench 1. dram only $ numactl -m 0 ./XSBench -s XL –p 5000000 Runtime: 36.235 seconds Lookups/s: 4,691,618 2. default interleave $ numactl –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 55.243 seconds Lookups/s: 3,077,293 3. weighted interleave numactl –w –i 0,2 ./XSBench –s XL –p 5000000 Runtime: 29.262 seconds Lookups/s: 5,809,513 ===================================================================== LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2 = Existing tests set_mempolicy, get_mempolicy, mbind MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but did not adjust tests for weighting. Basically the weights were set to 1, which is the default, and it should behavior like standard MPOL_INTERLEAVE if logic is correct. == set_mempolicy01 : passed 18, failed 0 == set_mempolicy02 : passed 10, failed 0 == set_mempolicy03 : passed 64, failed 0 == set_mempolicy04 : passed 32, failed 0 == set_mempolicy05 - n/a on non-x86 == set_mempolicy06 : passed 10, failed 0 this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE == set_mempolicy07 : passed 32, failed 0 set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE == get_mempolicy01 : passed 12, failed 0 change: added MPOL_WEIGHTED_INTERLEAVE == get_mempolicy02 : passed 2, failed 0 == mbind01 : passed 15, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind02 : passed 4, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind03 : passed 16, failed 0 added MPOL_WEIGHTED_INTERLEAVE == mbind04 : passed 48, failed 0 added MPOL_WEIGHTED_INTERLEAVE ===================================================================== numactl (set_mempolicy) w/ global weighting test numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master command: numactl -w --interleave=0,1 ./eatmem result (weights 1:1): 0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4 7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4 50% distribution is correct result (weights 5:1): 01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4 7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4 16.666% distribution is correct result (weights 1:5): 01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4 7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4 16.666% distribution is correct #include <stdio.h> #include <stdlib.h> #include <string.h> int main (void) { char* mem = malloc(1024*1024*256); memset(mem, 1, 1024*1024*256); for (int i = 0; i < ((1024*1024*256)/4096); i++) { mem = malloc(4096); mem[0] = 1; } printf("done\n"); getchar(); return 0; } ===================================================================== Suggested-by: Gregory Price <gregory.price@memverge.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Hasan Al Maruf <hasanalmaruf@fb.com> Suggested-by: Hao Wang <haowang3@fb.com> Suggested-by: Ying Huang <ying.huang@intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Zhongkun He <hezhongkun.hzk@bytedance.com> Suggested-by: Frank van der Linden <fvdl@google.com> Suggested-by: John Groves <john@jagalactic.com> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com> Suggested-by: Srinivasulu Thanneeru <sthanneeru@micron.com> Suggested-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com> Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Suggested-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Gregory Price (2): mm/mempolicy: refactor a read-once mechanism into a function for re-use mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Rakie Kim (1): mm/mempolicy: implement the sysfs-based weighted_interleave interface .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + ...fs-kernel-mm-mempolicy-weighted-interleave | 26 + .../admin-guide/mm/numa_memory_policy.rst | 9 + include/linux/mempolicy.h | 5 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 491 +++++++++++++++++- 6 files changed, 523 insertions(+), 13 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave -- 2.39.1 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface 2024-01-19 17:57 [PATCH v2 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price @ 2024-01-19 17:57 ` Gregory Price 2024-01-22 8:03 ` Huang, Ying 2024-01-19 17:57 ` [PATCH v2 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price 2024-01-19 17:57 ` [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price 2 siblings, 1 reply; 15+ messages in thread From: Gregory Price @ 2024-01-19 17:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, ying.huang, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams From: Rakie Kim <rakie.kim@sk.com> This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] └── weighted_interleave [2] ├── node0 [3] └── node1 Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] weighted_interleave/: config interface for weighted interleave policy [3] weighted_interleave/nodeN: weight for nodeN If a node value is set to `0`, the system-default value will be used. As of this patch, the system-default for all nodes is always 1. Suggested-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Gregory Price <gregory.price@memverge.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + ...fs-kernel-mm-mempolicy-weighted-interleave | 26 ++ mm/mempolicy.c | 231 ++++++++++++++++++ 3 files changed, 261 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 000000000000..2dcf24f4384a --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,4 @@ +What: /sys/kernel/mm/mempolicy/ +Date: December 2023 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Interface for Mempolicy diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave new file mode 100644 index 000000000000..e6a38139bf0f --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -0,0 +1,26 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: December 2023 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN +Date: December 2023 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Weight configuration interface for nodeN + + The interleave weight for a memory node (N). These weights are + utilized by processes which have set their mempolicy to + MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by + omitting a task-local weight array. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + The minimum weight for a node is always 1. + + Minimum weight: 1 + Maximum weight: 255 + + Writing an empty string or `0` will reset the weight to the + system default. The system default may be set by the kernel + or drivers at boot or during hotplug events. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..ae925216798f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -131,6 +131,16 @@ static struct mempolicy default_policy = { static struct mempolicy preferred_node_policy[MAX_NUMNODES]; +/* + * iw_table is the sysfs-set interleave weight table, a value of 0 denotes + * system-default value should be used. Until system-defaults are implemented, + * the system-default is always 1. + * + * iw_table is RCU protected + */ +static u8 __rcu *iw_table; +static DEFINE_MUTEX(iw_table_lock); + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3067,3 +3077,224 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +#ifdef CONFIG_SYSFS +struct iw_node_attr { + struct kobj_attribute kobj_attr; + int nid; +}; + +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct iw_node_attr *node_attr; + u8 weight; + u8 __rcu *table; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + weight = table ? table[node_attr->nid] : 1; + rcu_read_unlock(); + + return sysfs_emit(buf, "%d\n", weight); +} + +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct iw_node_attr *node_attr; + u8 __rcu *new; + u8 __rcu *old; + u8 weight = 0; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + if (count == 0 || sysfs_streq(buf, "")) + weight = 0; + else if (kstrtou8(buf, 0, &weight)) + return -EINVAL; + + /* + * The default weight is 1 (for now), when the kernel-internal + * default weight array is implemented, this should be updated to + * collect the system-default weight of the node if the user passes 0. + */ + if (!weight) + weight = 1; + + /* We only need to allocate up to the number of possible nodes */ + new = kmalloc(nr_node_ids, GFP_KERNEL); + if (!new) + return -ENOMEM; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + if (old) + memcpy(new, old, nr_node_ids); + else + memset(new, 1, nr_node_ids); + new[node_attr->nid] = weight; + rcu_assign_pointer(iw_table, new); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + return count; +} + +static struct iw_node_attr *node_attrs[MAX_NUMNODES]; + +static void sysfs_wi_node_release(struct iw_node_attr *node_attr, + struct kobject *parent) +{ + if (!node_attr) + return; + sysfs_remove_file(parent, &node_attr->kobj_attr.attr); + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); +} + +static void sysfs_wi_release(struct kobject *wi_kobj) +{ + int i; + + for (i = 0; i < MAX_NUMNODES; i++) + sysfs_wi_node_release(node_attrs[i], wi_kobj); + kobject_put(wi_kobj); +} + +static const struct kobj_type wi_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .release = sysfs_wi_release, +}; + +static int add_weight_node(int nid, struct kobject *wi_kobj) +{ + struct iw_node_attr *node_attr; + char *name; + + node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL); + if (!node_attr) + return -ENOMEM; + + name = kasprintf(GFP_KERNEL, "node%d", nid); + if (!name) { + kfree(node_attr); + return -ENOMEM; + } + + sysfs_attr_init(&node_attr->kobj_attr.attr); + node_attr->kobj_attr.attr.name = name; + node_attr->kobj_attr.attr.mode = 0644; + node_attr->kobj_attr.show = node_show; + node_attr->kobj_attr.store = node_store; + node_attr->nid = nid; + + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); + pr_err("failed to add attribute to weighted_interleave\n"); + return -ENOMEM; + } + + node_attrs[nid] = node_attr; + return 0; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!wi_kobj) + return -ENOMEM; + + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, + "weighted_interleave"); + if (err) { + kfree(wi_kobj); + return err; + } + + memset(node_attrs, 0, sizeof(node_attrs)); + for_each_node_state(nid, N_POSSIBLE) { + err = add_weight_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; +} + +static void mempolicy_kobj_release(struct kobject *kobj) +{ + u8 __rcu *old; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + rcu_assign_pointer(iw_table, NULL); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + /* Never free the default table, it's always in use */ + kfree(old); + kfree(kobj); +} + +static const struct kobj_type mempolicy_ktype = { + .release = mempolicy_kobj_release +}; + +static struct kobject *mempolicy_kobj; +static int __init mempolicy_sysfs_init(void) +{ + int err; + struct kobject *mempolicy_kobj; + + /* A NULL iw_table is interpreted by interleave logic as "all 1s" */ + iw_table = NULL; + mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL); + if (!mempolicy_kobj) { + pr_err("failed to add mempolicy kobject to the system\n"); + return -ENOMEM; + } + err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj, + "mempolicy"); + if (err) { + kfree(mempolicy_kobj); + return err; + } + + err = add_weighted_interleave_group(mempolicy_kobj); + + if (err) { + kobject_put(mempolicy_kobj); + return err; + } + + return err; +} + +static void __exit mempolicy_exit(void) +{ + if (mempolicy_kobj) + kobject_put(mempolicy_kobj); +} + +#else +static int __init mempolicy_sysfs_init(void) +{ + /* A NULL iw_table is interpreted by interleave logic as "all 1s" */ + iw_table = NULL; + return 0; +} + +static void __exit mempolicy_exit(void) { } +#endif /* CONFIG_SYSFS */ +late_initcall(mempolicy_sysfs_init); +module_exit(mempolicy_exit); -- 2.39.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface 2024-01-19 17:57 ` [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price @ 2024-01-22 8:03 ` Huang, Ying 2024-01-22 16:58 ` Gregory Price 0 siblings, 1 reply; 15+ messages in thread From: Huang, Ying @ 2024-01-22 8:03 UTC (permalink / raw) To: Gregory Price Cc: linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams Gregory Price <gourry.memverge@gmail.com> writes: > From: Rakie Kim <rakie.kim@sk.com> > > This patch provides a way to set interleave weight information under > sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN > > The sysfs structure is designed as follows. > > $ tree /sys/kernel/mm/mempolicy/ > /sys/kernel/mm/mempolicy/ [1] > └── weighted_interleave [2] > ├── node0 [3] > └── node1 > > Each file above can be explained as follows. > > [1] mm/mempolicy: configuration interface for mempolicy subsystem > > [2] weighted_interleave/: config interface for weighted interleave policy > > [3] weighted_interleave/nodeN: weight for nodeN > > If a node value is set to `0`, the system-default value will be used. > As of this patch, the system-default for all nodes is always 1. > > Suggested-by: Huang Ying <ying.huang@intel.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Gregory Price <gregory.price@memverge.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > --- > .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + > ...fs-kernel-mm-mempolicy-weighted-interleave | 26 ++ > mm/mempolicy.c | 231 ++++++++++++++++++ > 3 files changed, 261 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy > create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy > new file mode 100644 > index 000000000000..2dcf24f4384a > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy > @@ -0,0 +1,4 @@ > +What: /sys/kernel/mm/mempolicy/ > +Date: December 2023 > +Contact: Linux memory management mailing list <linux-mm@kvack.org> > +Description: Interface for Mempolicy > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > new file mode 100644 > index 000000000000..e6a38139bf0f > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave > @@ -0,0 +1,26 @@ > +What: /sys/kernel/mm/mempolicy/weighted_interleave/ > +Date: December 2023 May be not a big deal. The date should be "January 2024"? > +Contact: Linux memory management mailing list <linux-mm@kvack.org> > +Description: Configuration Interface for the Weighted Interleave policy > + > +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN > +Date: December 2023 > +Contact: Linux memory management mailing list <linux-mm@kvack.org> > +Description: Weight configuration interface for nodeN > + > + The interleave weight for a memory node (N). These weights are > + utilized by processes which have set their mempolicy to s/processes/tasks or memory areas/ > + MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by > + omitting a task-local weight array. Now, we haven't introduced task-local weight array. So, leave this until we introduce that? > + > + These weights only affect new allocations, and changes at runtime > + will not cause migrations on already allocated pages. > + > + The minimum weight for a node is always 1. > + > + Minimum weight: 1 > + Maximum weight: 255 > + > + Writing an empty string or `0` will reset the weight to the > + system default. The system default may be set by the kernel > + or drivers at boot or during hotplug events. > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 10a590ee1c89..ae925216798f 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -131,6 +131,16 @@ static struct mempolicy default_policy = { > > static struct mempolicy preferred_node_policy[MAX_NUMNODES]; > > +/* > + * iw_table is the sysfs-set interleave weight table, a value of 0 denotes > + * system-default value should be used. Until system-defaults are implemented, > + * the system-default is always 1. > + * > + * iw_table is RCU protected > + */ > +static u8 __rcu *iw_table; > +static DEFINE_MUTEX(iw_table_lock); > + > /** > * numa_nearest_node - Find nearest node by state > * @node: Node id to start the search > @@ -3067,3 +3077,224 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) > p += scnprintf(p, buffer + maxlen - p, ":%*pbl", > nodemask_pr_args(&nodes)); > } > + > +#ifdef CONFIG_SYSFS > +struct iw_node_attr { > + struct kobj_attribute kobj_attr; > + int nid; > +}; > + > +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, > + char *buf) > +{ > + struct iw_node_attr *node_attr; > + u8 weight; > + u8 __rcu *table; > + > + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + weight = table ? table[node_attr->nid] : 1; > + rcu_read_unlock(); > + > + return sysfs_emit(buf, "%d\n", weight); > +} > + > +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + struct iw_node_attr *node_attr; > + u8 __rcu *new; > + u8 __rcu *old; > + u8 weight = 0; > + > + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); > + if (count == 0 || sysfs_streq(buf, "")) > + weight = 0; > + else if (kstrtou8(buf, 0, &weight)) > + return -EINVAL; > + > + /* > + * The default weight is 1 (for now), when the kernel-internal > + * default weight array is implemented, this should be updated to > + * collect the system-default weight of the node if the user passes 0. > + */ > + if (!weight) > + weight = 1; From functionality point of view, it's OK to set "weight = 1" here now. But when we add system default weight table in the future, we need to use "weight = 0". Otherwise, we cannot distinguish whether the default value have been customized via sysfs. So, I suggest to use that rule. > + > + /* We only need to allocate up to the number of possible nodes */ This comment appears not necessary. > + new = kmalloc(nr_node_ids, GFP_KERNEL); > + if (!new) > + return -ENOMEM; > + > + mutex_lock(&iw_table_lock); > + old = rcu_dereference_protected(iw_table, > + lockdep_is_held(&iw_table_lock)); > + if (old) > + memcpy(new, old, nr_node_ids); > + else > + memset(new, 1, nr_node_ids); With similar reason as above ("From functionality..."), I suggest to set "0" here. > + new[node_attr->nid] = weight; > + rcu_assign_pointer(iw_table, new); > + mutex_unlock(&iw_table_lock); > + synchronize_rcu(); > + kfree(old); > + return count; > +} > + > +static struct iw_node_attr *node_attrs[MAX_NUMNODES]; node_attrs[] can be allocated dynamically too. Just a suggestion. > + > +static void sysfs_wi_node_release(struct iw_node_attr *node_attr, > + struct kobject *parent) > +{ > + if (!node_attr) > + return; > + sysfs_remove_file(parent, &node_attr->kobj_attr.attr); > + kfree(node_attr->kobj_attr.attr.name); > + kfree(node_attr); > +} > + > +static void sysfs_wi_release(struct kobject *wi_kobj) > +{ > + int i; > + > + for (i = 0; i < MAX_NUMNODES; i++) Nitpick, nr_node_ids should be OK here. > + sysfs_wi_node_release(node_attrs[i], wi_kobj); > + kobject_put(wi_kobj); > +} > + > +static const struct kobj_type wi_ktype = { > + .sysfs_ops = &kobj_sysfs_ops, > + .release = sysfs_wi_release, > +}; > + > +static int add_weight_node(int nid, struct kobject *wi_kobj) > +{ > + struct iw_node_attr *node_attr; > + char *name; > + > + node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL); > + if (!node_attr) > + return -ENOMEM; > + > + name = kasprintf(GFP_KERNEL, "node%d", nid); > + if (!name) { > + kfree(node_attr); > + return -ENOMEM; > + } > + > + sysfs_attr_init(&node_attr->kobj_attr.attr); > + node_attr->kobj_attr.attr.name = name; > + node_attr->kobj_attr.attr.mode = 0644; > + node_attr->kobj_attr.show = node_show; > + node_attr->kobj_attr.store = node_store; > + node_attr->nid = nid; > + > + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { > + kfree(node_attr->kobj_attr.attr.name); > + kfree(node_attr); > + pr_err("failed to add attribute to weighted_interleave\n"); > + return -ENOMEM; > + } > + > + node_attrs[nid] = node_attr; > + return 0; > +} > + > +static int add_weighted_interleave_group(struct kobject *root_kobj) > +{ > + struct kobject *wi_kobj; > + int nid, err; > + > + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); > + if (!wi_kobj) > + return -ENOMEM; > + > + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, > + "weighted_interleave"); > + if (err) { > + kfree(wi_kobj); > + return err; > + } > + > + memset(node_attrs, 0, sizeof(node_attrs)); > + for_each_node_state(nid, N_POSSIBLE) { > + err = add_weight_node(nid, wi_kobj); > + if (err) { > + pr_err("failed to add sysfs [node%d]\n", nid); > + break; > + } > + } > + if (err) > + kobject_put(wi_kobj); > + return 0; > +} > + > +static void mempolicy_kobj_release(struct kobject *kobj) > +{ > + u8 __rcu *old; > + > + mutex_lock(&iw_table_lock); > + old = rcu_dereference_protected(iw_table, > + lockdep_is_held(&iw_table_lock)); > + rcu_assign_pointer(iw_table, NULL); > + mutex_unlock(&iw_table_lock); > + synchronize_rcu(); > + /* Never free the default table, it's always in use */ Obsolete comment? > + kfree(old); It appears unnecessary to free iw_table in error path. But this isn't a big deal because error path will almost never be executed in practice. > + kfree(kobj); > +} > + > +static const struct kobj_type mempolicy_ktype = { > + .release = mempolicy_kobj_release > +}; > + > +static struct kobject *mempolicy_kobj; > +static int __init mempolicy_sysfs_init(void) > +{ > + int err; > + struct kobject *mempolicy_kobj; This overrides the global "mempolicy_kobj" defined before function. But I don't think we need the global definition. > + > + /* A NULL iw_table is interpreted by interleave logic as "all 1s" */ As I suggested above, it will be "all 0s", that is, use default weight. > + iw_table = NULL; The default value is NULL already, it appears unnecessary to do this. > + mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL); > + if (!mempolicy_kobj) { > + pr_err("failed to add mempolicy kobject to the system\n"); > + return -ENOMEM; > + } > + err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj, > + "mempolicy"); > + if (err) { > + kfree(mempolicy_kobj); > + return err; > + } > + > + err = add_weighted_interleave_group(mempolicy_kobj); > + > + if (err) { > + kobject_put(mempolicy_kobj); > + return err; > + } > + > + return err; > +} > + > +static void __exit mempolicy_exit(void) > +{ > + if (mempolicy_kobj) > + kobject_put(mempolicy_kobj); > +} > + > +#else > +static int __init mempolicy_sysfs_init(void) > +{ > + /* A NULL iw_table is interpreted by interleave logic as "all 1s" */ > + iw_table = NULL; > + return 0; > +} > + > +static void __exit mempolicy_exit(void) { } > +#endif /* CONFIG_SYSFS */ > +late_initcall(mempolicy_sysfs_init); > +module_exit(mempolicy_exit); mempolicy.c will not be compiled as module, so we don't need module_exit(). -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface 2024-01-22 8:03 ` Huang, Ying @ 2024-01-22 16:58 ` Gregory Price 0 siblings, 0 replies; 15+ messages in thread From: Gregory Price @ 2024-01-22 16:58 UTC (permalink / raw) To: Huang, Ying Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams On Mon, Jan 22, 2024 at 04:03:53PM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > + /* > > + * The default weight is 1 (for now), when the kernel-internal > > + * default weight array is implemented, this should be updated to > > + * collect the system-default weight of the node if the user passes 0. > > + */ > > + if (!weight) > > + weight = 1; > > From functionality point of view, it's OK to set "weight = 1" here now. > But when we add system default weight table in the future, we need to > use "weight = 0". Otherwise, we cannot distinguish whether the default > value have been customized via sysfs. So, I suggest to use that rule. > [... snip ...] > > + else > > + memset(new, 1, nr_node_ids); > > With similar reason as above ("From functionality..."), I suggest to set > "0" here. > blah - the comment is misleading at best. The future patch should pass 0 through to the sysfs table and the allocators updated to collect the system-default weight of the node. re: doing it this way right now - I chose to do it this way for now because it ultimately simplifies the logic in the allocators - all of which will need to be updated with the future patch set regardless of our implementation choice now. e.g. rcu_read_lock(); table = rcu_dereference(iw_table); if (!policy->wil.cur_weight) policy->wil.cur_weight = table ? table[next] : 1; ^^^ only need single conditional now rcu_read_unlock(); This logic will need to be updated to use default table values, so I chose the simpler implementation and left the change to be explicit at the time the default table is implemented. If you prefer it the other way now, I can change it, but this seemed cleaner and simpler for the time being. > > + new[node_attr->nid] = weight; > > + rcu_assign_pointer(iw_table, new); > > + mutex_unlock(&iw_table_lock); > > + synchronize_rcu(); > > + kfree(old); > > + return count; > > +} > > + > > +static struct iw_node_attr *node_attrs[MAX_NUMNODES]; > > node_attrs[] can be allocated dynamically too. Just a suggestion. > ack to this and other references to nr_node_ids, will change. > > + kfree(old); > > It appears unnecessary to free iw_table in error path. But this isn't a > big deal because error path will almost never be executed in practice. > checkpatch.pl yells at you if you do null checks before kfree :] > > + int err; > > + struct kobject *mempolicy_kobj; > > This overrides the global "mempolicy_kobj" defined before function. But > I don't think we need the global definition. > Assuming the exit path isn't needed then yeah the global isn't needed. > > +static int __init mempolicy_sysfs_init(void) > > +{ > > + /* A NULL iw_table is interpreted by interleave logic as "all 1s" */ > > + iw_table = NULL; > > + return 0; > > +} > > + > > +static void __exit mempolicy_exit(void) { } > > +#endif /* CONFIG_SYSFS */ > > +late_initcall(mempolicy_sysfs_init); > > +module_exit(mempolicy_exit); > > mempolicy.c will not be compiled as module, so we don't need > module_exit(). > ack ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use 2024-01-19 17:57 [PATCH v2 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price 2024-01-19 17:57 ` [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price @ 2024-01-19 17:57 ` Gregory Price 2024-01-19 17:57 ` [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price 2 siblings, 0 replies; 15+ messages in thread From: Gregory Price @ 2024-01-19 17:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, ying.huang, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams move the use of barrier() to force policy->nodemask onto the stack into a function `read_once_policy_nodemask` so that it may be re-used. Suggested-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> --- mm/mempolicy.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ae925216798f..427bddf115df 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1893,6 +1893,20 @@ unsigned int mempolicy_slab_node(void) } } +static unsigned int read_once_policy_nodemask(struct mempolicy *pol, + nodemask_t *mask) +{ + /* + * barrier stabilizes the nodemask locally so that it can be iterated + * over safely without concern for changes. Allocators validate node + * selection does not violate mems_allowed, so this is safe. + */ + barrier(); + memcpy(mask, &pol->nodes, sizeof(nodemask_t)); + barrier(); + return nodes_weight(*mask); +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1900,20 +1914,12 @@ unsigned int mempolicy_slab_node(void) */ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) { - nodemask_t nodemask = pol->nodes; + nodemask_t nodemask; unsigned int target, nnodes; int i; int nid; - /* - * The barrier will stabilize the nodemask in a register or on - * the stack so that it will stop changing under the code. - * - * Between first_node() and next_node(), pol->nodes could be changed - * by other threads. So we put pol->nodes in a local stack. - */ - barrier(); - nnodes = nodes_weight(nodemask); + nnodes = read_once_policy_nodemask(pol, &nodemask); if (!nnodes) return numa_node_id(); target = ilx % nnodes; -- 2.39.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-19 17:57 [PATCH v2 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price 2024-01-19 17:57 ` [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price 2024-01-19 17:57 ` [PATCH v2 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price @ 2024-01-19 17:57 ` Gregory Price 2024-01-23 3:02 ` Huang, Ying 2024-01-23 8:40 ` Huang, Ying 2 siblings, 2 replies; 15+ messages in thread From: Gregory Price @ 2024-01-19 17:57 UTC (permalink / raw) To: linux-mm Cc: linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, ying.huang, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). There are 3 integration points: weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight (pol->cur_weight) will be allocated first, before the remaining bulk calculation is done. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. The calculation of the `interleave index` is done by `get_vma_policy()`, while the actual selection of the node will be later appliex by the relevant weighted_interleave function. Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> Signed-off-by: Gregory Price <gregory.price@memverge.com> Co-developed-by: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Rakie Kim <rakie.kim@sk.com> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> --- .../admin-guide/mm/numa_memory_policy.rst | 9 + include/linux/mempolicy.h | 5 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 234 +++++++++++++++++- 4 files changed, 246 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..a70f20ce1ffb 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocates pages on nodes according to a + weight. For example if nodes [0,1] are weighted [5,2], 5 pages + will be allocated on node0 for every 2 pages allocated on node1. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..c1a083eb0dd5 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -54,6 +54,11 @@ struct mempolicy { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ } w; + + /* Weighted interleave settings */ + struct { + u8 cur_weight; + } wil; }; /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 427bddf115df..aa3b2389d3e0 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -19,6 +19,13 @@ * for anonymous memory. For process policy an process counter * is used. * + * weighted interleave + * Allocate memory interleaved over a set of nodes based on + * a set of weights (per-node), with normal fallback if it + * fails. Otherwise operates the same as interleave. + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated + * on node 0 for every 1 page allocated on node 1. + * * bind Only allocate memory on a specific set of nodes, * no fallback. * FIXME: memory is allocated starting with the first node @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, policy->mode = mode; policy->flags = flags; policy->home_node = NUMA_NO_NODE; + policy->wil.cur_weight = 0; return policy; } @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) current->il_prev = MAX_NUMNODES-1; task_unlock(current); mpol_put(old); @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { + if (pol->wil.cur_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; } +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int next; + struct task_struct *me = current; + u8 __rcu *table; + + next = next_node_in(me->il_prev, policy->nodes); + if (next == MAX_NUMNODES) + return next; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + if (!policy->wil.cur_weight) + policy->wil.cur_weight = table ? table[next] : 1; + rcu_read_unlock(); + + policy->wil.cur_weight--; + if (!policy->wil.cur_weight) + me->il_prev = next; + return next; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1869,6 +1913,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1907,6 +1954,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, return nodes_weight(*mask); } +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask; + unsigned int target, nr_nodes; + u8 __rcu *table; + unsigned int weight_total = 0; + u8 weight; + int nid; + + nr_nodes = read_once_policy_nodemask(pol, &nodemask); + if (!nr_nodes) + return numa_node_id(); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* calculate the total weight */ + for_each_node_mask(nid, nodemask) + weight_total += table ? table[nid] : 1; + + /* Calculate the node offset based on totals */ + target = ilx % weight_total; + nid = first_node(nodemask); + while (target) { + weight = table ? table[nid] : 1; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + rcu_read_unlock(); + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1967,6 +2047,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, *nid = (ilx == NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } return nodemask; @@ -2028,6 +2113,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2127,7 +2213,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * If the policy is interleave or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode != MPOL_INTERLEAVE && + if ((pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE) && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2263,6 +2350,135 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + u8 weight, resume_weight; + u8 __rcu *table; + u8 *weights; + unsigned int weight_total = 0; + unsigned long rem_pages = nr_pages; + nodemask_t nodes; + int nnodes, node, weight_nodes, resume_node; + int prev_node = NUMA_NO_NODE; + bool delta_depleted = false; + int i; + + if (!nr_pages) + return 0; + + nnodes = read_once_policy_nodemask(pol, &nodes); + if (!nnodes) + return 0; + + /* Continue allocating from most recent node and adjust the nr_pages */ + if (pol->wil.cur_weight) { + node = next_node_in(me->il_prev, nodes); + node_pages = pol->wil.cur_weight; + if (node_pages > rem_pages) + node_pages = rem_pages; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <= pol->wil.cur_weight) { + pol->wil.cur_weight -= rem_pages; + return total_allocated; + } + /* Otherwise we adjust nr_pages down, and continue from there */ + rem_pages -= pol->wil.cur_weight; + pol->wil.cur_weight = 0; + prev_node = node; + } + + /* fetch the weights for this operation and calculate total weight */ + weights = kmalloc(nnodes, GFP_KERNEL); + if (!weights) + return total_allocated; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + weight_nodes = 0; + while (weight_nodes < nnodes) { + node = next_node_in(prev_node, nodes); + weight = table ? table[node] : 1; + weights[weight_nodes++] = weight; + weight_total += weight; + } + rcu_read_unlock(); + + /* + * Now we can continue allocating from 0 instead of an offset + * We calculate the number of rounds and any partial rounds so + * that we minimize the number of calls to __alloc_pages_bulk + * This requires us to track which node we should resume from. + * + * if (rounds > 0) and (delta == 0), resume_node will always be + * the current me->il_prev + * + * if (delta > 0) and delta is depleted exactly on a node-weight + * boundary, resume node will be the node last allocated from when + * delta reached 0. + * + * if (delta > 0) and delta is not depleted on a node-weight boundary, + * resume node will be the node prior to the node last allocated from. + * + * (rounds == 0) and (delta == 0) is not possible (earlier exit) + */ + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; + /* If no delta, we'll resume from current prev_node and first weight */ + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[i]; + node_pages = weight * rounds; + /* If a delta exists, add this node's portion of the delta */ + if (delta >= weight) { + node_pages += weight; + delta -= weight; + resume_node = node; + resume_weight = i < (nnodes - 1) ? weights[i+1] : + weights[0]; + /* stop tracking iff (delta == weight) */ + delta_depleted = !delta; + } else if (delta) { /* <= weight */ + /* if delta depleted, resume from this node */ + node_pages += delta; + delta = 0; + resume_node = prev_node; + resume_weight = weight - (node_pages % weight); + delta_depleted = true; /* stop tracking */ + } else if (!delta_depleted) { + /* if there was no delta, track last allocated node */ + resume_node = node; + resume_weight = i < (nnodes - 1) ? weights[i+1] : + weights[0]; + } + /* node_pages can be 0 if an allocation fails and rounds == 0 */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + if (total_allocated == nr_pages) + break; + prev_node = node; + } + /* resume allocating from the calculated node and weight */ + me->il_prev = resume_node; + pol->wil.cur_weight = resume_weight; + kfree(weights); + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2303,6 +2519,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave( + gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2378,6 +2598,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2514,6 +2735,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2888,6 +3113,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2947,6 +3173,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3057,6 +3284,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default: -- 2.39.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-19 17:57 ` [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price @ 2024-01-23 3:02 ` Huang, Ying 2024-01-23 4:54 ` Gregory Price 2024-01-23 8:40 ` Huang, Ying 1 sibling, 1 reply; 15+ messages in thread From: Huang, Ying @ 2024-01-23 3:02 UTC (permalink / raw) To: Gregory Price Cc: linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru Gregory Price <gourry.memverge@gmail.com> writes: > When a system has multiple NUMA nodes and it becomes bandwidth hungry, > using the current MPOL_INTERLEAVE could be an wise option. > > However, if those NUMA nodes consist of different types of memory such > as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin > based interleave policy does not optimally distribute data to make use > of their different bandwidth characteristics. > > Instead, interleave is more effective when the allocation policy follows > each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. > > This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, > enabling weighted interleave between NUMA nodes. Weighted interleave > allows for proportional distribution of memory across multiple numa > nodes, preferably apportioned to match the bandwidth of each node. > > For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), > with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate > weight distribution is (2:1). > > Weights for each node can be assigned via the new sysfs extension: > /sys/kernel/mm/mempolicy/weighted_interleave/ > > For now, the default value of all nodes will be `1`, which matches > the behavior of standard 1:1 round-robin interleave. An extension > will be added in the future to allow default values to be registered > at kernel and device bringup time. > > The policy allocates a number of pages equal to the set weights. For > example, if the weights are (2,1), then 2 pages will be allocated on > node0 for every 1 page allocated on node1. > > The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) > and mbind(2). > > There are 3 integration points: > > weighted_interleave_nodes: > Counts the number of allocations as they occur, and applies the > weight for the current node. When the weight reaches 0, switch > to the next node. > > weighted_interleave_nid: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the node based on the given index. > > bulk_array_weighted_interleave: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the number of "interleave rounds" as > well as any delta ("partial round"). Calculates the number of > pages for each node and allocates them. > > If a node was scheduled for interleave via interleave_nodes, the > current weight (pol->cur_weight) will be allocated first, before > the remaining bulk calculation is done. > > One piece of complexity is the interaction between a recent refactor > which split the logic to acquire the "ilx" (interleave index) of an > allocation and the actually application of the interleave. The > calculation of the `interleave index` is done by `get_vma_policy()`, > while the actual selection of the node will be later appliex by the > relevant weighted_interleave function. > > Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > --- > .../admin-guide/mm/numa_memory_policy.rst | 9 + > include/linux/mempolicy.h | 5 + > include/uapi/linux/mempolicy.h | 1 + > mm/mempolicy.c | 234 +++++++++++++++++- > 4 files changed, 246 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index eca38fa81e0f..a70f20ce1ffb 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY > can fall back to all existing numa nodes. This is effectively > MPOL_PREFERRED allowed for a mask rather than a single node. > > +MPOL_WEIGHTED_INTERLEAVE > + This mode operates the same as MPOL_INTERLEAVE, except that > + interleaving behavior is executed based on weights set in > + /sys/kernel/mm/mempolicy/weighted_interleave/ > + > + Weighted interleave allocates pages on nodes according to a > + weight. For example if nodes [0,1] are weighted [5,2], 5 pages > + will be allocated on node0 for every 2 pages allocated on node1. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..c1a083eb0dd5 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -54,6 +54,11 @@ struct mempolicy { > nodemask_t cpuset_mems_allowed; /* relative to these nodes */ > nodemask_t user_nodemask; /* nodemask passed by user */ > } w; > + > + /* Weighted interleave settings */ > + struct { > + u8 cur_weight; > + } wil; > }; > > /* > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index a8963f7ef4c2..1f9bb10d1a47 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -23,6 +23,7 @@ enum { > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > + MPOL_WEIGHTED_INTERLEAVE, > MPOL_MAX, /* always last member of enum */ > }; > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 427bddf115df..aa3b2389d3e0 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -19,6 +19,13 @@ > * for anonymous memory. For process policy an process counter > * is used. > * > + * weighted interleave > + * Allocate memory interleaved over a set of nodes based on > + * a set of weights (per-node), with normal fallback if it > + * fails. Otherwise operates the same as interleave. > + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated > + * on node 0 for every 1 page allocated on node 1. > + * > * bind Only allocate memory on a specific set of nodes, > * no fallback. > * FIXME: memory is allocated starting with the first node > @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, > policy->mode = mode; > policy->flags = flags; > policy->home_node = NUMA_NO_NODE; > + policy->wil.cur_weight = 0; > > return policy; > } > @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > .create = mpol_new_nodemask, > .rebind = mpol_rebind_preferred, > }, > + [MPOL_WEIGHTED_INTERLEAVE] = { > + .create = mpol_new_nodemask, > + .rebind = mpol_rebind_nodemask, > + }, > }; > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > *nodes = pol->nodes; > break; > case MPOL_LOCAL: > @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } else if (pol == current->mempolicy && > pol->mode == MPOL_INTERLEAVE) { > *policy = next_node_in(current->il_prev, pol->nodes); > + } else if (pol == current->mempolicy && > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { > + if (pol->wil.cur_weight) > + *policy = current->il_prev; > + else > + *policy = next_node_in(current->il_prev, > + pol->nodes); > } else { > err = -EINVAL; > goto out; > @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > pol = __get_vma_policy(vma, addr, ilx); > if (!pol) > pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > + if (pol->mode == MPOL_INTERLEAVE || > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { Should change the comments above get_vma_policy() definition too. > *ilx += vma->vm_pgoff >> order; > *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > } > @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > return zone >= dynamic_policy_zone; > } > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > +{ > + unsigned int next; > + struct task_struct *me = current; > + u8 __rcu *table; > + > + next = next_node_in(me->il_prev, policy->nodes); > + if (next == MAX_NUMNODES) > + return next; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + if (!policy->wil.cur_weight) > + policy->wil.cur_weight = table ? table[next] : 1; > + rcu_read_unlock(); > + > + policy->wil.cur_weight--; > + if (!policy->wil.cur_weight) > + me->il_prev = next; > + return next; > +} > + > /* Do dynamic interleaving for a process */ > static unsigned int interleave_nodes(struct mempolicy *policy) > { > @@ -1869,6 +1913,9 @@ unsigned int mempolicy_slab_node(void) > case MPOL_INTERLEAVE: > return interleave_nodes(policy); > > + case MPOL_WEIGHTED_INTERLEAVE: > + return weighted_interleave_nodes(policy); > + > case MPOL_BIND: > case MPOL_PREFERRED_MANY: > { > @@ -1907,6 +1954,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, > return nodes_weight(*mask); > } > > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > +{ > + nodemask_t nodemask; > + unsigned int target, nr_nodes; > + u8 __rcu *table; > + unsigned int weight_total = 0; > + u8 weight; > + int nid; > + > + nr_nodes = read_once_policy_nodemask(pol, &nodemask); > + if (!nr_nodes) > + return numa_node_id(); > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + /* calculate the total weight */ > + for_each_node_mask(nid, nodemask) > + weight_total += table ? table[nid] : 1; > + > + /* Calculate the node offset based on totals */ > + target = ilx % weight_total; > + nid = first_node(nodemask); > + while (target) { > + weight = table ? table[nid] : 1; > + if (target < weight) > + break; > + target -= weight; > + nid = next_node_in(nid, nodemask); > + } > + rcu_read_unlock(); > + return nid; > +} > + > /* > * Do static interleaving for interleave index @ilx. Returns the ilx'th > * node in pol->nodes (starting from ilx=0), wrapping around if ilx > @@ -1967,6 +2047,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > *nid = (ilx == NO_INTERLEAVE_INDEX) ? > interleave_nodes(pol) : interleave_nid(pol, ilx); > break; > + case MPOL_WEIGHTED_INTERLEAVE: > + *nid = (ilx == NO_INTERLEAVE_INDEX) ? > + weighted_interleave_nodes(pol) : > + weighted_interleave_nid(pol, ilx); > + break; > } > > return nodemask; > @@ -2028,6 +2113,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > case MPOL_PREFERRED_MANY: > case MPOL_BIND: > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > *mask = mempolicy->nodes; > break; > > @@ -2127,7 +2213,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > * If the policy is interleave or does not allow the current > * node in its nodemask, we allocate the standard way. > */ > - if (pol->mode != MPOL_INTERLEAVE && > + if ((pol->mode != MPOL_INTERLEAVE && > + pol->mode != MPOL_WEIGHTED_INTERLEAVE) && > (!nodemask || node_isset(nid, *nodemask))) { > /* > * First, try to allocate THP only on local node, but > @@ -2263,6 +2350,135 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, > return total_allocated; > } > > +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > + struct mempolicy *pol, unsigned long nr_pages, > + struct page **page_array) > +{ > + struct task_struct *me = current; > + unsigned long total_allocated = 0; > + unsigned long nr_allocated; > + unsigned long rounds; > + unsigned long node_pages, delta; > + u8 weight, resume_weight; > + u8 __rcu *table; > + u8 *weights; > + unsigned int weight_total = 0; > + unsigned long rem_pages = nr_pages; > + nodemask_t nodes; > + int nnodes, node, weight_nodes, resume_node; > + int prev_node = NUMA_NO_NODE; It appears that we should initialize prev_node with me->il_prev? Details are as below. > + bool delta_depleted = false; > + int i; > + > + if (!nr_pages) > + return 0; > + > + nnodes = read_once_policy_nodemask(pol, &nodes); > + if (!nnodes) > + return 0; > + > + /* Continue allocating from most recent node and adjust the nr_pages */ > + if (pol->wil.cur_weight) { > + node = next_node_in(me->il_prev, nodes); > + node_pages = pol->wil.cur_weight; > + if (node_pages > rem_pages) > + node_pages = rem_pages; > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + /* if that's all the pages, no need to interleave */ > + if (rem_pages <= pol->wil.cur_weight) { > + pol->wil.cur_weight -= rem_pages; If "pol->wil.cur_weight == 0" here, we need to change me->il_prev? > + return total_allocated; > + } > + /* Otherwise we adjust nr_pages down, and continue from there */ > + rem_pages -= pol->wil.cur_weight; > + pol->wil.cur_weight = 0; > + prev_node = node; > + } else { prev_node = me->il_prev; } > + > + /* fetch the weights for this operation and calculate total weight */ > + weights = kmalloc(nnodes, GFP_KERNEL); > + if (!weights) > + return total_allocated; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + weight_nodes = 0; We can replace "weight_nodes" with "i" and use a "for" loop? > + while (weight_nodes < nnodes) { > + node = next_node_in(prev_node, nodes); IIUC, "node" will not change in the loop, so all "weight" below will be the same value. To keep it simple, I think we can just copy weights from the global iw_table and consider the default value? > + weight = table ? table[node] : 1; > + weights[weight_nodes++] = weight; > + weight_total += weight; > + } > + rcu_read_unlock(); > + > + /* > + * Now we can continue allocating from 0 instead of an offset > + * We calculate the number of rounds and any partial rounds so > + * that we minimize the number of calls to __alloc_pages_bulk > + * This requires us to track which node we should resume from. > + * > + * if (rounds > 0) and (delta == 0), resume_node will always be > + * the current me->il_prev > + * > + * if (delta > 0) and delta is depleted exactly on a node-weight > + * boundary, resume node will be the node last allocated from when > + * delta reached 0. > + * > + * if (delta > 0) and delta is not depleted on a node-weight boundary, > + * resume node will be the node prior to the node last allocated from. > + * > + * (rounds == 0) and (delta == 0) is not possible (earlier exit) > + */ > + rounds = rem_pages / weight_total; > + delta = rem_pages % weight_total; > + /* If no delta, we'll resume from current prev_node and first weight */ > + for (i = 0; i < nnodes; i++) { > + node = next_node_in(prev_node, nodes); > + weight = weights[i]; > + node_pages = weight * rounds; > + /* If a delta exists, add this node's portion of the delta */ > + if (delta >= weight) { > + node_pages += weight; > + delta -= weight; > + resume_node = node; > + resume_weight = i < (nnodes - 1) ? weights[i+1] : > + weights[0]; > + /* stop tracking iff (delta == weight) */ > + delta_depleted = !delta; > + } else if (delta) { /* <= weight */ The comment is unnecessary and wrong. > + /* if delta depleted, resume from this node */ > + node_pages += delta; > + delta = 0; > + resume_node = prev_node; > + resume_weight = weight - (node_pages % weight); > + delta_depleted = true; /* stop tracking */ > + } else if (!delta_depleted) { > + /* if there was no delta, track last allocated node */ > + resume_node = node; > + resume_weight = i < (nnodes - 1) ? weights[i+1] : > + weights[0]; > + } Can the above code be simplified as something like below? resume_node = prev_node; resume_weight = 0; for (...) { ... if (delta > weight) { node_pages += weight; delta -= weight; } else if (delta) { node_pages += delta; /* if delta depleted, resume from this node */ if (delta < weight) { resume_node = prev_node; resume_weight = weight - delta; } else { resume_node = node; } delta = 0; } ... } -- Best Regards, Huang, Ying > + /* node_pages can be 0 if an allocation fails and rounds == 0 */ > + if (!node_pages) > + break; > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + if (total_allocated == nr_pages) > + break; > + prev_node = node; > + } > + /* resume allocating from the calculated node and weight */ > + me->il_prev = resume_node; > + pol->wil.cur_weight = resume_weight; > + kfree(weights); > + return total_allocated; > +} > + > static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, > struct mempolicy *pol, unsigned long nr_pages, > struct page **page_array) > @@ -2303,6 +2519,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, > return alloc_pages_bulk_array_interleave(gfp, pol, > nr_pages, page_array); > > + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) > + return alloc_pages_bulk_array_weighted_interleave( > + gfp, pol, nr_pages, page_array); > + > if (pol->mode == MPOL_PREFERRED_MANY) > return alloc_pages_bulk_array_preferred_many(gfp, > numa_node_id(), pol, nr_pages, page_array); > @@ -2378,6 +2598,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > return !!nodes_equal(a->nodes, b->nodes); > case MPOL_LOCAL: > return true; > @@ -2514,6 +2735,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > polnid = interleave_nid(pol, ilx); > break; > > + case MPOL_WEIGHTED_INTERLEAVE: > + polnid = weighted_interleave_nid(pol, ilx); > + break; > + > case MPOL_PREFERRED: > if (node_isset(curnid, pol->nodes)) > goto out; > @@ -2888,6 +3113,7 @@ static const char * const policy_modes[] = > [MPOL_PREFERRED] = "prefer", > [MPOL_BIND] = "bind", > [MPOL_INTERLEAVE] = "interleave", > + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", > [MPOL_LOCAL] = "local", > [MPOL_PREFERRED_MANY] = "prefer (many)", > }; > @@ -2947,6 +3173,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) > } > break; > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > /* > * Default to online nodes with memory if no nodelist > */ > @@ -3057,6 +3284,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) > case MPOL_PREFERRED_MANY: > case MPOL_BIND: > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > nodes = pol->nodes; > break; > default: ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-23 3:02 ` Huang, Ying @ 2024-01-23 4:54 ` Gregory Price 2024-01-23 5:16 ` Gregory Price 2024-01-23 8:13 ` Huang, Ying 0 siblings, 2 replies; 15+ messages in thread From: Gregory Price @ 2024-01-23 4:54 UTC (permalink / raw) To: Huang, Ying Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru On Tue, Jan 23, 2024 at 11:02:03AM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > + int prev_node = NUMA_NO_NODE; > > It appears that we should initialize prev_node with me->il_prev? > Details are as below. > yeah good catch, was a rebase error from my tested code, where this is the case. patching now. > > + if (rem_pages <= pol->wil.cur_weight) { > > + pol->wil.cur_weight -= rem_pages; > > If "pol->wil.cur_weight == 0" here, we need to change me->il_prev? > you are right, and also need to fetch the next cur_weight. Seems I missed this specific case in my tests. (had this tested with a single node but not 2, so it looked right). Added to my test suite. > We can replace "weight_nodes" with "i" and use a "for" loop? > > > + while (weight_nodes < nnodes) { > > + node = next_node_in(prev_node, nodes); > > IIUC, "node" will not change in the loop, so all "weight" below will be > the same value. To keep it simple, I think we can just copy weights > from the global iw_table and consider the default value? > another rebase error here from my tested code, this should have been node = prev_node; while (...) node = next_node_in(node, nodes); I can change it to a for loop as suggested, but for more info on why I did it this way, see the chunk below > > + } else if (!delta_depleted) { > > + /* if there was no delta, track last allocated node */ > > + resume_node = node; > > + resume_weight = i < (nnodes - 1) ? weights[i+1] : > > + weights[0]; ^ this line acquires the weight of the *NEXT* node another chunk prior to this does the same thing. I suppose i can use next_node_in() instead and just copy the entire weigh array though, if that is preferable. > > + } > > Can the above code be simplified as something like below? > > resume_node = prev_node; > resume_weight = 0; > for (...) { > ... > if (delta > weight) { > node_pages += weight; > delta -= weight; > } else if (delta) { > node_pages += delta; > /* if delta depleted, resume from this node */ > if (delta < weight) { > resume_node = prev_node; > resume_weight = weight - delta; > } else { > resume_node = node; > } > delta = 0; > } > ... > } > I'll take another look at it, but this logic is annoying because of the corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. If it's NUMA_NO_NODE, then the logic you have above will say "the next node has no remaining weights assigned" and skip it on the next call to weighted_interleave_nid or weighted_interleave_nodes. This is incorrect - we want the weight of the first node to be resume_weight, which is what this chunk does: if (delta >= weight) { /* if delta == weight, get next node weight */ resume_weight = i < (nnodes - 1) ? weights[i+1] : weights[0]; else if (delta) { /* delta < weight */ /* there's a remaining weight, use the that for resume weight */ resume_weight = weight - (node_pages % weight); } else if (!delta_depleted) { /* there was never a delta, track the last node and get the weight * of the node AFTER that node, that's the resume weight */ resume_weight = i < (nnodes - 1) ? weights[i+1] : weights[0]; } If il_prev is an actual node, and delta == 0, we want to return with (il_prev = prev_node) but with the weight set to the weight of the first node we're about to allocate from. This is the reason for the annoying logic here: We have to come out of this loop with the actual node and the actual weight. I'll try to clean it up further and get my test suite to pass. ~Gregory ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-23 4:54 ` Gregory Price @ 2024-01-23 5:16 ` Gregory Price 2024-01-23 8:35 ` Huang, Ying 2024-01-23 8:13 ` Huang, Ying 1 sibling, 1 reply; 15+ messages in thread From: Gregory Price @ 2024-01-23 5:16 UTC (permalink / raw) To: Huang, Ying Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: > > > > Can the above code be simplified as something like below? > > > > resume_node = prev_node; --- resume_weight = 0; +++ resume_weight = weights[node]; > > for (...) { > > ... > > } > > > > I'll take another look at it, but this logic is annoying because of the > corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. > After a quick look, as long as no one objects to (me->il_prev) remaining NUMA_NO_NODE while having a weight assigned to pol->wil.cur_weight, then this looks like it can be simplified as above. I don't think it's harmful, but i'll have to take a quick look at what happens on rebind to make sure we don't have a stale weight. ~Gregory ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-23 5:16 ` Gregory Price @ 2024-01-23 8:35 ` Huang, Ying 2024-01-23 21:27 ` Gregory Price 0 siblings, 1 reply; 15+ messages in thread From: Huang, Ying @ 2024-01-23 8:35 UTC (permalink / raw) To: Gregory Price Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru Gregory Price <gregory.price@memverge.com> writes: > On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: >> > >> > Can the above code be simplified as something like below? >> > >> > resume_node = prev_node; > --- resume_weight = 0; > +++ resume_weight = weights[node]; >> > for (...) { >> > ... >> > } >> > >> >> I'll take another look at it, but this logic is annoying because of the >> corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. >> > > After a quick look, as long as no one objects to (me->il_prev) remaining > NUMA_NO_NODE MAX_NUMNODES-1 ? > while having a weight assigned to pol->wil.cur_weight, I think that it is OK. And, IIUC, pol->wil.cur_weight can be 0, as in weighted_interleave_nodes(), if it's 0, it will be assigned to default weight for the node. > then > this looks like it can be simplified as above. > > I don't think it's harmful, but i'll have to take a quick look at what > happens on rebind to make sure we don't have a stale weight. Make sense. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-23 8:35 ` Huang, Ying @ 2024-01-23 21:27 ` Gregory Price 2024-01-24 1:51 ` Huang, Ying 0 siblings, 1 reply; 15+ messages in thread From: Gregory Price @ 2024-01-23 21:27 UTC (permalink / raw) To: Huang, Ying Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru On Tue, Jan 23, 2024 at 04:35:19PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: > >> > > >> > Can the above code be simplified as something like below? > >> > > >> > resume_node = prev_node; > > --- resume_weight = 0; > > +++ resume_weight = weights[node]; > >> > for (...) { > >> > ... > >> > } > >> > > >> > >> I'll take another look at it, but this logic is annoying because of the > >> corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. > >> > > > > After a quick look, as long as no one objects to (me->il_prev) remaining > > NUMA_NO_NODE > > MAX_NUMNODES-1 ? > When setting a new policy, the il_prev gets set to NUMA_NO_NODE. It's not harmful and is just (-1), which is functionally the same as (MAX_NUMNODES-1) for the purpose of iterating the nodemask with next_node_in(). So it's fine to set (resume_node = me->il_prev) as discussed. I have a cleaned up function I'll push when i fix up a few other spots. > > while having a weight assigned to pol->wil.cur_weight, > > I think that it is OK. > > And, IIUC, pol->wil.cur_weight can be 0, as in > weighted_interleave_nodes(), if it's 0, it will be assigned to default > weight for the node. > cur_weight is different than the global weights. cur_weight tells us how many pages are remaining to allocate for the current node. (cur_weight = 0) can happen in two scenarios: - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) - weighted_interleave_nodes decrements it down to 0 Now that i'm looking at it - the second condition should not exist, and we can eliminate it. The logic in weighted_interleave_nodes is actually annoyingly unclear at the moment, so I'm going to re-factor it a bit to be more explicit. ~Gregory ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-23 21:27 ` Gregory Price @ 2024-01-24 1:51 ` Huang, Ying 2024-01-24 18:01 ` Gregory Price 0 siblings, 1 reply; 15+ messages in thread From: Huang, Ying @ 2024-01-24 1:51 UTC (permalink / raw) To: Gregory Price Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru Gregory Price <gregory.price@memverge.com> writes: > On Tue, Jan 23, 2024 at 04:35:19PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> >> > On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: >> >> > >> >> > Can the above code be simplified as something like below? >> >> > >> >> > resume_node = prev_node; >> > --- resume_weight = 0; >> > +++ resume_weight = weights[node]; >> >> > for (...) { >> >> > ... >> >> > } >> >> > >> >> >> >> I'll take another look at it, but this logic is annoying because of the >> >> corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. >> >> >> > >> > After a quick look, as long as no one objects to (me->il_prev) remaining >> > NUMA_NO_NODE >> >> MAX_NUMNODES-1 ? >> > > When setting a new policy, the il_prev gets set to NUMA_NO_NODE. It's IIUC, it is set to MAX_NUMNODES-1 as below, @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) current->il_prev = MAX_NUMNODES-1; task_unlock(current); mpol_put(old); I don't think we need to change this. > not harmful and is just (-1), which is functionally the same as > (MAX_NUMNODES-1) for the purpose of iterating the nodemask with > next_node_in(). So it's fine to set (resume_node = me->il_prev) > as discussed. > > I have a cleaned up function I'll push when i fix up a few other spots. > >> > while having a weight assigned to pol->wil.cur_weight, >> >> I think that it is OK. >> >> And, IIUC, pol->wil.cur_weight can be 0, as in >> weighted_interleave_nodes(), if it's 0, it will be assigned to default >> weight for the node. >> > > cur_weight is different than the global weights. cur_weight tells us > how many pages are remaining to allocate for the current node. > > (cur_weight = 0) can happen in two scenarios: > - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) > - weighted_interleave_nodes decrements it down to 0 > > Now that i'm looking at it - the second condition should not exist, and > we can eliminate it. The logic in weighted_interleave_nodes is actually > annoyingly unclear at the moment, so I'm going to re-factor it a bit to > be more explicit. I am OK with either way. Just a reminder, the first condition may be true in alloc_pages_bulk_array_weighted_interleave() and perhaps some other places. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-24 1:51 ` Huang, Ying @ 2024-01-24 18:01 ` Gregory Price 0 siblings, 0 replies; 15+ messages in thread From: Gregory Price @ 2024-01-24 18:01 UTC (permalink / raw) To: Huang, Ying Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru On Wed, Jan 24, 2024 at 09:51:20AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > > I don't think we need to change this. > Ah you're right it's set to MAX_NUMNODES-1 here, but NUMA_NO_NODE can be passed in as an argument to alloc_pages_bulk_array_mempolicy, like here: vm_area_alloc_pages() if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE) nr = alloc_pages_bulk_array_mempolicy(bulk_gfp, nr_pages_request, pages + nr_allocated); > > (cur_weight = 0) can happen in two scenarios: > > - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) > > - weighted_interleave_nodes decrements it down to 0 > > > > Now that i'm looking at it - the second condition should not exist, and > > we can eliminate it. The logic in weighted_interleave_nodes is actually > > annoyingly unclear at the moment, so I'm going to re-factor it a bit to > > be more explicit. > > I am OK with either way. Just a reminder, the first condition may be > true in alloc_pages_bulk_array_weighted_interleave() and perhaps some > other places. > Yeah, the bulk allocator handles it correctly, it's just a matter of clarity for weighted_interleave_nodes. What isn't necessarily handled correctly is the rebind code. Rebind due to a cgroup/mems_allowed change can cause a stale weight to be carried. Basically cur_weight is not cleared, but the node it applied to may no longer be the next node when next_node_in() is called. The race condition is 1) exceedingly rare, and 2) not necessarily harmful, just inaccurate. The worst case scenario is that a node receives up to 255 additional allocations once after a rebind (but more likely 10-20). I was considering forcing the interleave forward like this: @@ -356,6 +361,10 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) tmp = *nodes; pol->nodes = tmp; + + /* Weighted interleave policies are forced forward to the next node */ + if (pol->mode & MPOL_WEIGHTED_INTERLEAVE) + pol->wil.cur_weight = 0; } But this creates 2 race conditions when we read cur_weight and nodemask in the allocator path. Example 1: 1) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight) 2) rebind changes nodemask and { cur_weight = 0; } 3) bulk allocator sets pol->wil.cur_weight In this scenario, resume_weight is stale coming out of bulk allocations if the resume_node has been removed from the node mask. Example 2: 1) rebind changes nodemask 2) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight) 3) rebind sets { cur_weight = 0; } In this scenario, cur_weight is stale going into bulk allocations. Neither of these can force a violation of mems_allowed, just a mis-application of a weight. I'll need to think on this a bit. We can either leave this as-is, meaning the first allocation after a rebind may apply the wrong weight to a node, or we can try to track the current-interleave-node and validate next_node_in(mask) == current-interleave-node before leaving the allocator path (this may also be just as racey). turns out concurrent counting is still hard :] ~Gregory ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-23 4:54 ` Gregory Price 2024-01-23 5:16 ` Gregory Price @ 2024-01-23 8:13 ` Huang, Ying 1 sibling, 0 replies; 15+ messages in thread From: Huang, Ying @ 2024-01-23 8:13 UTC (permalink / raw) To: Gregory Price Cc: Gregory Price, linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru Gregory Price <gregory.price@memverge.com> writes: > On Tue, Jan 23, 2024 at 11:02:03AM +0800, Huang, Ying wrote: >> Gregory Price <gourry.memverge@gmail.com> writes: >> >> > + int prev_node = NUMA_NO_NODE; >> >> It appears that we should initialize prev_node with me->il_prev? >> Details are as below. >> > > yeah good catch, was a rebase error from my tested code, where this is > the case. patching now. > >> > + if (rem_pages <= pol->wil.cur_weight) { >> > + pol->wil.cur_weight -= rem_pages; >> >> If "pol->wil.cur_weight == 0" here, we need to change me->il_prev? >> > you are right, and also need to fetch the next cur_weight. Seems I > missed this specific case in my tests. (had this tested with a single > node but not 2, so it looked right). > > Added to my test suite. > >> We can replace "weight_nodes" with "i" and use a "for" loop? >> >> > + while (weight_nodes < nnodes) { >> > + node = next_node_in(prev_node, nodes); >> >> IIUC, "node" will not change in the loop, so all "weight" below will be >> the same value. To keep it simple, I think we can just copy weights >> from the global iw_table and consider the default value? >> > > another rebase error here from my tested code, this should have been > node = prev_node; > while (...) > node = next_node_in(node, nodes); > > I can change it to a for loop as suggested, but for more info on why I > did it this way, see the chunk below > >> > + } else if (!delta_depleted) { >> > + /* if there was no delta, track last allocated node */ >> > + resume_node = node; >> > + resume_weight = i < (nnodes - 1) ? weights[i+1] : >> > + weights[0]; > ^ this line acquires the weight of the *NEXT* node > another chunk prior to this does the same > thing. I suppose i can use next_node_in() > instead and just copy the entire weigh array > though, if that is preferable. Yes. I think copy the entire weight array make code logic simpler. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving 2024-01-19 17:57 ` [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price 2024-01-23 3:02 ` Huang, Ying @ 2024-01-23 8:40 ` Huang, Ying 1 sibling, 0 replies; 15+ messages in thread From: Huang, Ying @ 2024-01-23 8:40 UTC (permalink / raw) To: Gregory Price Cc: linux-mm, linux-kernel, linux-doc, linux-fsdevel, linux-api, corbet, akpm, gregory.price, honggyu.kim, rakie.kim, hyeongtak.ji, mhocko, vtavarespetr, jgroves, ravis.opensrc, sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, hannes, dan.j.williams, Srinivasulu Thanneeru Gregory Price <gourry.memverge@gmail.com> writes: > When a system has multiple NUMA nodes and it becomes bandwidth hungry, > using the current MPOL_INTERLEAVE could be an wise option. > > However, if those NUMA nodes consist of different types of memory such > as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin > based interleave policy does not optimally distribute data to make use > of their different bandwidth characteristics. > > Instead, interleave is more effective when the allocation policy follows > each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. > > This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, > enabling weighted interleave between NUMA nodes. Weighted interleave > allows for proportional distribution of memory across multiple numa > nodes, preferably apportioned to match the bandwidth of each node. > > For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), > with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate > weight distribution is (2:1). > > Weights for each node can be assigned via the new sysfs extension: > /sys/kernel/mm/mempolicy/weighted_interleave/ > > For now, the default value of all nodes will be `1`, which matches > the behavior of standard 1:1 round-robin interleave. An extension > will be added in the future to allow default values to be registered > at kernel and device bringup time. > > The policy allocates a number of pages equal to the set weights. For > example, if the weights are (2,1), then 2 pages will be allocated on > node0 for every 1 page allocated on node1. > > The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) > and mbind(2). > > There are 3 integration points: > > weighted_interleave_nodes: > Counts the number of allocations as they occur, and applies the > weight for the current node. When the weight reaches 0, switch > to the next node. > > weighted_interleave_nid: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the node based on the given index. > > bulk_array_weighted_interleave: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the number of "interleave rounds" as > well as any delta ("partial round"). Calculates the number of > pages for each node and allocates them. > > If a node was scheduled for interleave via interleave_nodes, the > current weight (pol->cur_weight) will be allocated first, before > the remaining bulk calculation is done. > > One piece of complexity is the interaction between a recent refactor > which split the logic to acquire the "ilx" (interleave index) of an > allocation and the actually application of the interleave. The > calculation of the `interleave index` is done by `get_vma_policy()`, > while the actual selection of the node will be later appliex by the > relevant weighted_interleave function. > > Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > --- > .../admin-guide/mm/numa_memory_policy.rst | 9 + > include/linux/mempolicy.h | 5 + > include/uapi/linux/mempolicy.h | 1 + > mm/mempolicy.c | 234 +++++++++++++++++- > 4 files changed, 246 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index eca38fa81e0f..a70f20ce1ffb 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY > can fall back to all existing numa nodes. This is effectively > MPOL_PREFERRED allowed for a mask rather than a single node. > > +MPOL_WEIGHTED_INTERLEAVE > + This mode operates the same as MPOL_INTERLEAVE, except that > + interleaving behavior is executed based on weights set in > + /sys/kernel/mm/mempolicy/weighted_interleave/ > + > + Weighted interleave allocates pages on nodes according to a > + weight. For example if nodes [0,1] are weighted [5,2], 5 pages > + will be allocated on node0 for every 2 pages allocated on node1. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..c1a083eb0dd5 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -54,6 +54,11 @@ struct mempolicy { > nodemask_t cpuset_mems_allowed; /* relative to these nodes */ > nodemask_t user_nodemask; /* nodemask passed by user */ > } w; > + > + /* Weighted interleave settings */ > + struct { > + u8 cur_weight; > + } wil; > }; > > /* > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index a8963f7ef4c2..1f9bb10d1a47 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -23,6 +23,7 @@ enum { > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > + MPOL_WEIGHTED_INTERLEAVE, > MPOL_MAX, /* always last member of enum */ > }; > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 427bddf115df..aa3b2389d3e0 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -19,6 +19,13 @@ > * for anonymous memory. For process policy an process counter > * is used. > * > + * weighted interleave > + * Allocate memory interleaved over a set of nodes based on > + * a set of weights (per-node), with normal fallback if it > + * fails. Otherwise operates the same as interleave. > + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated > + * on node 0 for every 1 page allocated on node 1. > + * > * bind Only allocate memory on a specific set of nodes, > * no fallback. > * FIXME: memory is allocated starting with the first node > @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, > policy->mode = mode; > policy->flags = flags; > policy->home_node = NUMA_NO_NODE; > + policy->wil.cur_weight = 0; > > return policy; > } > @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > .create = mpol_new_nodemask, > .rebind = mpol_rebind_preferred, > }, > + [MPOL_WEIGHTED_INTERLEAVE] = { > + .create = mpol_new_nodemask, > + .rebind = mpol_rebind_nodemask, > + }, > }; > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > *nodes = pol->nodes; > break; > case MPOL_LOCAL: > @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } else if (pol == current->mempolicy && > pol->mode == MPOL_INTERLEAVE) { > *policy = next_node_in(current->il_prev, pol->nodes); > + } else if (pol == current->mempolicy && > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { > + if (pol->wil.cur_weight) > + *policy = current->il_prev; > + else > + *policy = next_node_in(current->il_prev, > + pol->nodes); Per my understanding, we should always use "*policy = next_node_in()" here, as in weighted_interleave_nodes(). > } else { > err = -EINVAL; > goto out; > @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > pol = __get_vma_policy(vma, addr, ilx); > if (!pol) > pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > + if (pol->mode == MPOL_INTERLEAVE || > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { > *ilx += vma->vm_pgoff >> order; > *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > } > @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > return zone >= dynamic_policy_zone; > } > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > +{ > + unsigned int next; > + struct task_struct *me = current; > + u8 __rcu *table; > + > + next = next_node_in(me->il_prev, policy->nodes); > + if (next == MAX_NUMNODES) > + return next; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + if (!policy->wil.cur_weight) > + policy->wil.cur_weight = table ? table[next] : 1; > + rcu_read_unlock(); > + > + policy->wil.cur_weight--; > + if (!policy->wil.cur_weight) > + me->il_prev = next; > + return next; > +} > + [snip] -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-01-24 18:01 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-01-19 17:57 [PATCH v2 0/3] mm/mempolicy: weighted interleave mempolicy and sysfs extension Gregory Price 2024-01-19 17:57 ` [PATCH v2 1/3] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price 2024-01-22 8:03 ` Huang, Ying 2024-01-22 16:58 ` Gregory Price 2024-01-19 17:57 ` [PATCH v2 2/3] mm/mempolicy: refactor a read-once mechanism into a function for re-use Gregory Price 2024-01-19 17:57 ` [PATCH v2 3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price 2024-01-23 3:02 ` Huang, Ying 2024-01-23 4:54 ` Gregory Price 2024-01-23 5:16 ` Gregory Price 2024-01-23 8:35 ` Huang, Ying 2024-01-23 21:27 ` Gregory Price 2024-01-24 1:51 ` Huang, Ying 2024-01-24 18:01 ` Gregory Price 2024-01-23 8:13 ` Huang, Ying 2024-01-23 8:40 ` Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).