linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
@ 2023-12-23 18:10 Gregory Price
  2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron

Weighted interleave is a new interleave policy intended to make
use of a the new distributed-memory environment made available
by CXL.  The existing interleave mechanism does an even round-robin
distribution of memory across all nodes in a nodemask, while
weighted interleave can distribute memory across nodes according
the available bandwidth that that node provides.

As tests below show, "default interleave" can cause major performance
degredation due to distribution not matching bandwidth available,
while "weighted interleave" can provide a performance increase.

For example, the stream benchmark demonstrates that default interleave
is actively harmful, where weighted interleave is beneficial.

Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
Targeted weights   : +2.5% to +4% (consistently better than DRAM)

If nothing else, this shows how awful round-robin interleave is.

Rather than implement yet another specific syscall to set one
particular field of a mempolicy, we chose to implement an extensible
mempolicy interface so that future extensions can be captured.

To implement weighted interleave, we need an interface to set the
node weights along with a MPOL_WEIGHTED_INTERLEAVE. We implement a
a sysfs extension for "system global" weights which can be set by
a daemon or administrator, and new extensible syscalls (mempolicy2,
mbind2) which allow task-local weights to be set via user-software.

The benefit of the sysfs extension is that MPOL_WEIGHTED_INTERLEAVE
can be used by the existing set_mempolicy and mbind via numactl.

There are 3 "phases" in the patch set that could be considered
for separate merge candidates, but are presented here as a single
line as the goal is a fully functional MPOL_WEIGHTED_INTERLEAVE.

1) Implement MPOL_WEIGHTED_INTERLEAVE with a sysfs extension for
   setting system-global weights via sysfs.
   (Patches 1 & 2)

2) Refactor mempolicy creation mechanism to use an extensible arg
   struct `struct mempolicy_args` to promote code re-use between
   the original mempolicy/mbind interfaces and the new interfaces.
   (Patches 3-6)

3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
   along with the addition of task-local weights so that per-task
   weights can be registered for MPOL_WEIGHTED_INTERLEAVE.
   (Patches 7-11)

Included at the bottom of this cover letter is linux test project
tests for backward and forward compatibility, some sample software
which can be used for quick tests, as well as a numactl branch
which implements `numactl -w --interleave` for testing.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

= LTP Testing Summary =
existing mempolicy & mbind tests: pass
mempolicy & mbind + weighted interleave (global weights): pass
mempolicy2 & mbind2 + weighted interleave (global weights): pass
mempolicy2 & mbind2 + weighted interleave (local weights): pass

= v5 (full notes at botton =
- stupid mistake fixing arm64 build (get_mbind2 -> mbind2)
- resolve ABI breakage within patch set.
- remove MPOL_F_NODE functionality from mempolicy2
  Suggested-by: Ying Huang <ying.huang@intel.com>
- Documentation and cover letter updates

=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <ravis.opensrc@micron.com>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
From - Gregory Price <gregory.price@memverge.com>

Hardware: Single socket, single CXL expander
numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind2 weights     : +2.5% to +4% (consistently better than DRAM)

dram only:
numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Function     Direction    BestRateMBs     AvgTime      MinTime      MaxTime
Copy:        0->0            200923.2     0.032662     0.031853     0.033301
Scale:       0->0            202123.0     0.032526     0.031664     0.032970
Add:         0->0            208873.2     0.047322     0.045961     0.047884
Triad:       0->0            208523.8     0.047262     0.046038     0.048414

CXL-only:
numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             22209.7     0.288661     0.288162     0.289342
Scale:       0->0             22288.2     0.287549     0.287147     0.288291
Add:         0->0             24419.1     0.393372     0.393135     0.393735
Triad:       0->0             24484.6     0.392337     0.392083     0.394331

Based on the above, the optimal weights are ~9:1
echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2

default interleave:
numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0             44666.2     0.143671     0.143285     0.144174
Scale:       0->0             44781.6     0.143256     0.142916     0.143713
Add:         0->0             48600.7     0.197719     0.197528     0.197858
Triad:       0->0             48727.5     0.197204     0.197014     0.197439

global weighted interleave:
numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
Copy:        0->0            190085.9     0.034289     0.033669     0.034645
Scale:       0->0            207677.4     0.031909     0.030817     0.033061
Add:         0->0            202036.8     0.048737     0.047516     0.053409
Triad:       0->0            217671.5     0.045819     0.044103     0.046755

targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
Copy:        0->0            205827.0     0.031445     0.031094     0.031984
Scale:       0->0            208171.8     0.031320     0.030744     0.032505
Add:         0->0            217352.0     0.045087     0.044168     0.046515
Triad:       0->0            216884.8     0.045062     0.044263     0.046982

=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <hyeongtak.ji@sk.com>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.

Performance tests - XSBench
1. dram only
$ numactl -m 0 ./XSBench -s XL –p 5000000
Runtime:     36.235 seconds
Lookups/s:   4,691,618

2. default interleave
$ numactl –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     55.243 seconds
Lookups/s:   3,077,293

3. weighted interleave
numactl –w –i 0,2 ./XSBench –s XL –p 5000000
Runtime:     29.262 seconds
Lookups/s:   5,809,513

=====================================================================
(Patch 1) : sysfs addition - /sys/kernel/mm/mempolicy/

This feature provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/

    The sysfs structure is designed as follows.

      $ tree /sys/kernel/mm/mempolicy/
      /sys/kernel/mm/mempolicy/
      └── weighted_interleave
          ├── nodeN
          └── nodeN+X

'mempolicy' is added to '/sys/kernel/mm/' as a control group for
the mempolicy subsystem.

Internally, weights are represented as an array of unsigned char

static unsigned char iw_table[MAX_NUMNODES];

char was chosen as most reasonable distributions can be represented
as factors <100, and to minimize memory usage (1KB)

We present possible nodes, instead of online nodes, to simplify the
management interface, considering that a) the table is of size
MAX_NUMNODES anyway to simplify fetching of weights (no need to track
sizes, and MAX_NUMNODES is typically at most 1kb), and b) it simplifies
management of hotplug events, allowing for weights to be set prior to
a node coming online, which may be beneficial for immediate use.

the 'weight' of a node (an unsigned char of value 1-255) is the number
of pages that are allocated during a "weighted interleave" round.
(See 'weighted interleave' for more details').

=====================================================================
(Patch 2) set_mempolicy: MPOL_WEIGHTED_INTERLEAVE

Weighted interleave is a new memory policy that interleaves memory
across numa nodes in the provided nodemask based on the weights
described in patch 1 (sysfs global weights).

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as having local DRAM and CXL memory together, the current round-robin
based interleaving policy doesn't maximize the overall bandwidth
because of their different bandwidth characteristics.

Instead, the interleaving can be more efficient when the allocation
policy follows each NUMA nodes' bandwidth weight rather than having 1:1
round-robin allocation.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
which enables weighted interleaving between NUMA nodes.  Weighted
interleave allows for a proportional distribution of memory across
multiple numa nodes, preferablly apportioned to match the bandwidth
capacity of each node from the perspective of the accessing node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
appropriate weight distribution is (2:1).

Weights will be acquired from the global weight array exposed by the
sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

The policy will then allocate the number of pages according to the
set weights.  For example, if the weights are (2,1), then 2 pages
will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

=====================================================================
(Patches 3-6) Refactoring mempolicy for code-reuse

To avoid multiple paths of mempolicy creation, we should refactor the
existing code to enable the designed extensibility, and refactor
existing users to utilize the new interface (while retaining the
existing userland interface).

This set of patches introduces a new mempolicy_args structure, which
is used to more fully describe a requested mempolicy - to include
existing and future extensions.

/*
 * Describes settings of a mempolicy during set/get syscalls and
 * kernel internal calls to do_set_mempolicy()
 */
struct mempolicy_args {
    unsigned short mode;            /* policy mode */
    unsigned short mode_flags;      /* policy mode flags */
    int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
    nodemask_t *policy_nodes;       /* get/set/mbind */
    unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
};

This arg structure will eventually be utilized by the following
interfaces:
    mpol_new() - new mempolicy creation
    do_get_mempolicy() - acquiring information about mempolicy
    do_set_mempolicy() - setting the task mempolicy
    do_mbind()         - setting a vma mempolicy

do_get_mempolicy() is completely refactored to break it out into
separate functionality based on the flags provided by get_mempolicy(2)
    MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
    MPOL_F_ADDR: acquires information on vma policies
    MPOL_F_NODE: changes the output for the policy arg to node info

We refactor the get_mempolicy syscall flatten the logic based on these
flags, and aloow for set_mempolicy2() to re-use the underlying logic.

The result of this refactor, and the new mempolicy_args structure, is
that extensions like 'sys_set_mempolicy_home_node' can now be directly
integrated into the initial call to 'set_mempolicy2', and that more
complete information about a mempolicy can be returned with a single
call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'


=====================================================================
(Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2

These interfaces are the 'extended' counterpart to their relatives.
They use the userland 'struct mpol_args' structure to communicate a
complete mempolicy configuration to the kernel.  This structure
looks very much like the kernel-internal 'struct mempolicy_args':

struct mpol_args {
        /* Basic mempolicy settings */
        __u16 mode;
        __u16 mode_flags;
        __s32 home_node;
        __u64 pol_maxnodes;
        __aligned_u64 pol_nodes;
        __aligned_u64 *il_weights;      /* of size pol_maxnodes */
};

The basic mempolicy settings which are shared across all interfaces
are captured at the top of the structure, while extensions such as
'policy_node' and 'addr' are collected beneath.

The syscalls are uniform and defined as follows:

long sys_mbind2(unsigned long addr, unsigned long len,
                struct mpol_args *args, size_t usize,
                unsigned long flags);

long sys_get_mempolicy2(struct mpol_args *args, size_t size,
                        unsigned long addr, unsigned long flags);

long sys_set_mempolicy2(struct mpol_args *args, size_t size,
                        unsigned long flags);

The 'flags' argument for mbind2 is the same as 'mbind', except with
the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
field should be utilized.

The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
behavior has been omitted, since get_mempolicy() provides this already.

The 'flags' argument is not used by 'set_mempolicy' at this time, but
may end up allowing the use of MPOL_MF_HOME_NODE if such functionality
is desired.

The extensions can be summed up as follows:

get_mempolicy2 extensions:
    - mode and mode flags are split into separate fields
    - MPOL_F_MEMS_ALLOWED and MPOL_F_NODE are not supported

set_mempolicy2:
    - task-local interleave weights can be set via 'il_weights'

mbind2:
    - home_node field sets policy home node w/ MPOL_MF_HOME_NODE
    - task-local interleave weights can be set via 'il_weights'

=====================================================================
(Patch 11) set_mempolicy2/mbind2: MPOL_WEIGHTED_INTERLEAVE

This patch shows the explicit extension pattern when adding new
policies to mempolicy2/mbind2.  This adds the 'il_weights' field
to mpol_args and adds the logic to fill in task-local weights.

There are now two ways to weight a mempolicy: global and local.
To denote which mode the task is in, we add the internal flag:

MPOL_F_GWEIGHT /* Utilize global weights */

When MPOL_F_GWEIGHT is set, the global weights are used, and
when it is not set, task-local weights are used.

Example logic:
if (pol->flags & MPOL_F_GWEIGHT)
       pol_weights = iw_table;
else
       pol_weights = pol->wil.weights;

set_mempolicy is changed to always set MPOL_F_GWEIGHT, since this
syscall is incapable of passing weights via its interfaces, while
set_mempolicy2 sets MPOL_F_GWEIGHT if MPOL_WEIGHTED_INTERLEAVE
is requested but 'il_weights' in mpol_args is NULL.

The operation of task-local weighted is otherwise exactly the
same - except for what occurs on task migration.

On task migration, the system presently has no way of determining
what the new weights "should be", or what the user "intended".

For this reason, we default all weights to '1' and do not allow
weights to be '0'.  This means, should a migration occur where
one or more new nodes appear in the nodemask - the effective weight
for that node will be '1'.  This avoids a potential allocation
failure condition if a migration occurs and introduces a node
which otherwise did not have a weight.

For this reason, users should use task-local weighting when
migrations are not expected/possible, and global weighting when
migrations are expected/possible.

=====================================================================
Existing LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2

LTP set_mempolicy, get_mempolicy, mbind regression tests:

MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality
but did not adjust tests for weighting.  Basically the weights were
set to 1, which is the default, and it should behavior like standard
MPOL_INTERLEAVE if logic is correct.

== set_mempolicy01
passed   18
failed   0

== set_mempolicy02
passed   10
failed   0

== set_mempolicy03
passed   64
failed   0

== set_mempolicy04
passed   32
failed   0

== set_mempolicy05 - n/a on non-x86

== set_mempolicy06 - set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
passed   10
failed   0

== set_mempolicy07 - set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
passed   32
failed   0

== get_mempolicy01 - added MPOL_WEIGHTED_INTERLEAVE
passed   12
failed   0

== get_mempolicy02
passed   2
failed   0

== mbind01 - added WEIGHTED_INTERLEAVE
passed   15
failed   0

== mbind02 - added WEIGHTED_INTERLEAVE
passed   4
failed   0

== mbind03 - added WEIGHTED_INTERLEAVE
passed   16
failed   0

== mbind04 - added WEIGHTED_INTERLEAVE
passed   48
failed   0

=====================================================================
New LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2

set_mempolicy2, get_mempolicy2, mbind2

Took the original set_mempolicy and get_mempolicy tests, and updated
them to utilize the new mempolicy2 interfaces.  Added additional tests
for setting task-local weights to validate behavior.

== set_mempolicy201 - set_mempolicy01 equiv
passed   18
failed   0

== set_mempolicy202 - set_mempolicy02 equiv
passed   10
failed   0

== set_mempolicy203 - set_mempolicy03 equiv
passed   64
failed   0

== set_mempolicy204 - set_mempolicy04 equiv
passed   32
failed   0

== set_mempolicy205 - set_mempolicy06 equiv
passed   10
failed   0

== set_mempolicy206 - set_mempolicy07 equiv
passed   32
failed   0

== set_mempolicy207 - MPOL_WEIGHTED_INTERLEAVE with task-local weights
passed   6
failed   0

== get_mempolicy201 - get_mempolicy01 equiv
passed   12
failed   0

== get_mempolicy202 - get_mempolicy02 equiv
passed   2
failed   0

== get_mempolicy203 - NEW - fetch global and local weights
passed   6
failed   0

== mbind201 - mbind01 equiv
passed   15
failed   0

== mbind202 - mbind02 equiv
passed   4
failed   0

== mbind203 - mbind03 equiv
passed   16
failed   0

== mbind204 - mbind04 equiv
passed   48
failed   0

=====================================================================
Basic set_mempolicy2 test

set_mempolicy2 w/ weighted interleave, task-local weights and uses
pthread_create to demonstrate the mempolicy is overwritten by child.

Manually validating the distribution via numa_maps

007c0000 weighted interleave:0-1 heap anon=65794 dirty=65794 active=0 N0=54829 N1=10965 kernelpagesize_kB=4
7f3f2c000000 weighted interleave:0-1 anon=32768 dirty=32768 active=0 N0=5461 N1=27307 kernelpagesize_kB=4
7f3f34000000 weighted interleave:0-1 anon=16384 dirty=16384 active=0 N0=2731 N1=13653 kernelpagesize_kB=4
7f3f3bffe000 weighted interleave:0-1 anon=65538 dirty=65538 active=0 N0=10924 N1=54614 kernelpagesize_kB=4
7f3f5c000000 weighted interleave:0-1 anon=16384 dirty=16384 active=0 N0=2731 N1=13653 kernelpagesize_kB=4
7f3f60dfe000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54615 N1=10922 kernelpagesize_kB=4

Expected distribution is 5:1 or 1:5 (less node should be ~16.666%)
1) 10965/65794 : 16.6656... pass
2) 5461/32768  : 16.6656... pass
3) 2731/16384  : 16.6687... pass
4) 10924/65538 : 16.6682... pass
5) 2731/16384  : 16.6687... pass
6) 10922/65537 : 16.6653... pass


#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <numa.h>
#include <errno.h>
#include <numaif.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/uio.h>
#include <sys/types.h>
#include <stdint.h>

#define MPOL_WEIGHTED_INTERLEAVE 6
#define SET_MEMPOLICY2(a, b) syscall(457, a, b, 0)

#define M256 (1024*1024*256)
#define PAGE_SIZE (4096)

struct mpol_args {
        /* Basic mempolicy settings */
        uint16_t mode;
        uint16_t mode_flags;
        int32_t home_node;
        uint64_t pol_maxnodes;
        uint64_t pol_nodes;
        uint64_t il_weights;
};

struct mpol_args wil_args;
struct bitmask *wil_nodes;
unsigned char *weights;
int total_nodes = -1;
pthread_t tid;

void set_mempolicy_call(int which)
{
        weights = (unsigned char *)calloc(total_nodes, sizeof(unsigned char));
        wil_nodes = numa_allocate_nodemask();

        numa_bitmask_setbit(wil_nodes, 0); weights[0] = which ? 1 : 5;
        numa_bitmask_setbit(wil_nodes, 1); weights[1] = which ? 5 : 1;

        memset(&wil_args, 0, sizeof(wil_args));
        wil_args.mode = MPOL_WEIGHTED_INTERLEAVE;
        wil_args.mode_flags = 0;
        wil_args.pol_nodes = wil_nodes->maskp;
        wil_args.pol_maxnodes = total_nodes;
        wil_args.il_weights = weights;

        int ret = SET_MEMPOLICY2(&wil_args, sizeof(wil_args));
        fprintf(stderr, "set_mempolicy2 result: %d(%s)\n", ret, strerror(errno));
}

void *func(void *arg)
{
        char *mainmem = malloc(M256);
        int i;

        set_mempolicy_call(1); /* weight 1 heavier */

        mainmem = malloc(M256);
        memset(mainmem, 1, M256);
        for (i = 0; i < (M256/PAGE_SIZE); i++) {
                mainmem = malloc(PAGE_SIZE);
                mainmem[0] = 1;
        }
        printf("thread done %d\n", getpid());
        getchar();
        return arg;
}

int main()
{
        char * mainmem;
        int i;

        total_nodes = numa_max_node() + 1;

        set_mempolicy_call(0); /* weight 0 heavier */
        pthread_create(&tid, NULL, func, NULL);

        mainmem = malloc(M256);
        memset(mainmem, 1, M256);
        for (i = 0; i < (M256/PAGE_SIZE); i++) {
                mainmem = malloc(PAGE_SIZE);
                mainmem[0] = 1;
        }
        printf("main done %d\n", getpid());
        getchar();

        return 0;
}

=====================================================================
numactl (set_mempolicy) w/ global weighting test
numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master

command: numactl -w --interleave=0,1 ./eatmem

result (weights 1:1):
0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
50% distribution is correct

result (weights 5:1):
01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
16.666% distribution is correct

result (weights 1:5):
01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
16.666% distribution is correct

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void)
{
        char* mem = malloc(1024*1024*256);
        memset(mem, 1, 1024*1024*256);
        for (int i = 0; i  < ((1024*1024*256)/4096); i++)
        {
                mem = malloc(4096);
                mem[0] = 1;
        }
        printf("done\n");
        getchar();
        return 0;
}

=====================================================================
v4:
- CONFIG_MMU COND_SYSCALL fix for mempolicy2/mbind2 syscalls
- ifdef CONFIG_SYSFS handling.  If sysfs is disabled, set global
  weights to 1 to have it default to standard interleave.
- tools/perf config variation syscall table fix
- sysfs attr init build fix
- arch/arm64 syscall wire-ups (Thanks Arnd!)

=====================================================================
v3:
  changes / adds:
- get2(): actually fetch the il_weights (doh!)
- get2(): retrieve home_node
- get2(): addr as arg instead of struct member, drop MPOL_F_NODE flag
          get_mempolicy() can be used for this, don't duplicate warts
- get2(): only copy weights if mode is weighted interleave
- mbind2(): addr/len instead of iovec
            user can use a for loop...
- sysfs: remove possible_nodes
- sysfs: simplify to weighted_interleave/nodeN
- sysfs: add default weight mechanism (echo > nodeN)

  fixes:
- build: syscalls.h mpol_args definition missing
- build: missing `__user` from weights_ptr definition
- bug:   uninitialized weight_total in bulk allocator
- bug:   bad pointer to copy_struct_from_user in mbind2
- bug:   get_mempolicy2 uninitialized data copied to user
- bug:   get_vma_mempolicy policy reference counting
- bug:   MPOL_F_GWEIGHTS not set correctly in set_mempolicy2
- bug:   MPOL_F_GWEIGHTS not set correctly in mbind2
- bug:   get_mempolicy2 error not checked on nodemask userland copy
- bug:   mbind2 did not parse nodemask correctly

  tests:
- ltp branch: https://github.com/gmprice/ltp/tree/mempolicy2
- new set_mempolicy2() tests
     1) set_mempolicy() tests w/ new syscall
     2) weighted interleave validation
- new get_mempolicy2() tests
     1) get_mempolicy() tests w/ new syscall
     2) weighted interleave validation
- new mbind2() tests
     1) mbind() tests w/ new syscall
- new performance tests (MLC) from Ravi @ Micron
     Example:
        Workload:                               W5
        Data Signature:                         1:1 read:write
        DRAM only bandwidth (GBps):             273.2
        DRAM + CXL (default interleave) (GBps): 117.23
        DRAM + CXL (weighted interleave)(GBps): 382.7
        Gain over DRAM only:                    1.4x

=====================================================================
v2:
  changes / adds:
- flatted weight matrix to an array at requested of Ying Huang
- Updated ABI docs per Davidlohr Bueso request
- change uapi structure to use aligned/fixed-length members
- Implemented weight fetch logic in get_mempolicy2
- mbind2 was changed to take (iovec,len) as function arguments
  rather than add them to the uapi structure, since they describe
  where to apply the mempolicy - as opposed to being part of it.

  fixes:
- fixed bug reported by Seungjun Ha <seungjun.ha@samsung.com>
  Link: https://lore.kernel.org/linux-cxl/20231206080944epcms2p76ebb230b9f4595f5cfcd2531d67ab3ce@epcms2p7/
- fixed bug in mbind2 where MPOL_F_GWEIGHTS was not set when il_weights
  was omitted after local weights were added as an option
- fixed bug in interleave logic where an OOB access was made if
  next_node_in returned MAX_NUMNODES
- fixed bug in bulk weighted interleave allocator where over-allocation
  could occur.

  tests:
- LTP: validated existing get_mempolicy, set_mempolicy, and mbind tests
- LTP: validated existing get_mempolicy, set_mempolicy, and mbind with
       MPOL_WEIGHTED_INTERLEAVE added.
- basic set_mempolicy2 tests and numactl -w --interleave tests

  numactl:
- Sample numactl extension for set_mempolicy available here:
  Link: https://github.com/gmprice/numactl/tree/weighted_interleave_master

(added summary of test reports to end of cover letter)

=====================================================================

Suggested-by: Gregory Price <gregory.price@memverge.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Hasan Al Maruf <hasanalmaruf@fb.com>
Suggested-by: Hao Wang <haowang3@fb.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: tj <tj@kernel.org>
Suggested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: John Groves <john@jagalactic.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Srinivasulu Thanneeru <sthanneeru@micron.com>
Suggested-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>

Gregory Price (10):
  mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted
    interleaving
  mm/mempolicy: refactor sanitize_mpol_flags for reuse
  mm/mempolicy: create struct mempolicy_args for creating new
    mempolicies
  mm/mempolicy: refactor kernel_get_mempolicy for code re-use
  mm/mempolicy: allow home_node to be set by mpol_new
  mm/mempolicy: add userland mempolicy arg structure
  mm/mempolicy: add set_mempolicy2 syscall
  mm/mempolicy: add get_mempolicy2 syscall
  mm/mempolicy: add the mbind2 syscall
  mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted
    interleave

Rakie Kim (1):
  mm/mempolicy: implement the sysfs-based weighted_interleave interface

 .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  22 +
 .../admin-guide/mm/numa_memory_policy.rst     |  67 ++
 arch/alpha/kernel/syscalls/syscall.tbl        |   3 +
 arch/arm/tools/syscall.tbl                    |   3 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   6 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   3 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   3 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   3 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   3 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   3 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   3 +
 arch/s390/kernel/syscalls/syscall.tbl         |   3 +
 arch/sh/kernel/syscalls/syscall.tbl           |   3 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   3 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   3 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   3 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   3 +
 include/linux/mempolicy.h                     |  18 +
 include/linux/syscalls.h                      |   8 +
 include/uapi/asm-generic/unistd.h             |   8 +-
 include/uapi/linux/mempolicy.h                |  16 +-
 kernel/sys_ni.c                               |   3 +
 mm/mempolicy.c                                | 932 +++++++++++++++---
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |   3 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |   3 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |   3 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |   3 +
 29 files changed, 1024 insertions(+), 116 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

-- 
2.39.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-27  6:42   ` Huang, Ying
  2023-12-23 18:10 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

From: Rakie Kim <rakie.kim@sk.com>

This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN

The sysfs structure is designed as follows.

  $ tree /sys/kernel/mm/mempolicy/
  /sys/kernel/mm/mempolicy/ [1]
  └── weighted_interleave [2]
      ├── node0 [3]
      └── node1

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] weighted_interleave/: config interface for weighted interleave policy

[3] weighted_interleave/nodeN: weight for nodeN

If sysfs is disabled in the config, the global interleave weights
will default to "1" for all nodes.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
---
 .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  22 +++
 mm/mempolicy.c                                | 156 ++++++++++++++++++
 3 files changed, 182 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..2dcf24f4384a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,4 @@
+What:		/sys/kernel/mm/mempolicy/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for Mempolicy
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..aa27fdf08c19
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,22 @@
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration Interface for the Weighted Interleave policy
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Weight configuration interface for nodeN
+
+		The interleave weight for a memory node (N). These weights are
+		utilized by processes which have set their mempolicy to
+		MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
+		omitting a task-local weight array.
+
+		These weights only affect new allocations, and changes at runtime
+		will not cause migrations on already allocated pages.
+
+		Writing an empty string resets the weight value to 1.
+
+		Minimum weight: 1
+		Maximum weight: 255
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..0e77633b07a5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+static char iw_table[MAX_NUMNODES];
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -3067,3 +3069,157 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+#ifdef CONFIG_SYSFS
+struct iw_node_attr {
+	struct kobj_attribute kobj_attr;
+	int nid;
+};
+
+static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
+			 char *buf)
+{
+	struct iw_node_attr *node_attr;
+
+	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+	return sysfs_emit(buf, "%d\n", iw_table[node_attr->nid]);
+}
+
+static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
+			  const char *buf, size_t count)
+{
+	struct iw_node_attr *node_attr;
+	unsigned char weight = 0;
+
+	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+	/* If no input, set default weight to 1 */
+	if (count == 0 || sysfs_streq(buf, ""))
+		weight = 1;
+	else if (kstrtou8(buf, 0, &weight) || !weight)
+		return -EINVAL;
+
+	iw_table[node_attr->nid] = weight;
+	return count;
+}
+
+static struct iw_node_attr *node_attrs[MAX_NUMNODES];
+
+static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
+				  struct kobject *parent)
+{
+	if (!node_attr)
+		return;
+	sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
+	kfree(node_attr->kobj_attr.attr.name);
+	kfree(node_attr);
+}
+
+static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
+{
+	int i;
+
+	for (i = 0; i < MAX_NUMNODES; i++)
+		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
+	kobject_put(mempolicy_kobj);
+}
+
+static const struct kobj_type mempolicy_ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+	.release = sysfs_mempolicy_release,
+};
+
+static int add_weight_node(int nid, struct kobject *wi_kobj)
+{
+	struct iw_node_attr *node_attr;
+	char *name;
+
+	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
+	if (!node_attr)
+		return -ENOMEM;
+
+	name = kasprintf(GFP_KERNEL, "node%d", nid);
+	if (!name) {
+		kfree(node_attr);
+		return -ENOMEM;
+	}
+
+	sysfs_attr_init(&node_attr->kobj_attr.attr);
+	node_attr->kobj_attr.attr.name = name;
+	node_attr->kobj_attr.attr.mode = 0644;
+	node_attr->kobj_attr.show = node_show;
+	node_attr->kobj_attr.store = node_store;
+	node_attr->nid = nid;
+
+	if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
+		kfree(node_attr->kobj_attr.attr.name);
+		kfree(node_attr);
+		pr_err("failed to add attribute to weighted_interleave\n");
+		return -ENOMEM;
+	}
+
+	node_attrs[nid] = node_attr;
+	return 0;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+	struct kobject *wi_kobj;
+	int nid, err;
+
+	wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+	if (!wi_kobj)
+		return -ENOMEM;
+
+	err = kobject_init_and_add(wi_kobj, &mempolicy_ktype, root_kobj,
+				   "weighted_interleave");
+	if (err) {
+		kfree(wi_kobj);
+		return err;
+	}
+
+	memset(node_attrs, 0, sizeof(node_attrs));
+	for_each_node_state(nid, N_POSSIBLE) {
+		err = add_weight_node(nid, wi_kobj);
+		if (err) {
+			pr_err("failed to add sysfs [node%d]\n", nid);
+			break;
+		}
+	}
+	if (err)
+		kobject_put(wi_kobj);
+	return 0;
+}
+
+static int __init mempolicy_sysfs_init(void)
+{
+	int err;
+	struct kobject *root_kobj;
+
+	memset(&iw_table, 1, sizeof(iw_table));
+
+	root_kobj = kobject_create_and_add("mempolicy", mm_kobj);
+	if (!root_kobj) {
+		pr_err("failed to add mempolicy kobject to the system\n");
+		return -ENOMEM;
+	}
+
+	err = add_weighted_interleave_group(root_kobj);
+
+	if (err)
+		kobject_put(root_kobj);
+	return err;
+
+}
+#else
+static int __init mempolicy_sysfs_init(void)
+{
+	/*
+	 * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
+	 * MPOL_INTERLEAVE behavior, but is still defined separately to
+	 * allow task-local weighted interleave to operate as intended.
+	 */
+	memset(&iw_table, 1, sizeof(iw_table));
+	return 0;
+}
+#endif /* CONFIG_SYSFS */
+late_initcall(mempolicy_sysfs_init);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
  2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-27  8:32   ` Huang, Ying
  2023-12-23 18:10 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Srinivasulu Thanneeru

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as having local DRAM and CXL memory together, the current round-robin
based interleaving policy doesn't maximize the overall bandwidth because
of their different bandwidth characteristics.

Instead, the interleaving can be more efficient when the allocation
policy follows each NUMA nodes' bandwidth weight rather than having 1:1
round-robin allocation.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which
enables weighted interleaving between NUMA nodes.  Weighted interleave
allows for a proportional distribution of memory across multiple numa
nodes, preferablly apportioned to match the bandwidth capacity of each
node from the perspective of the accessing node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
appropriate weight distribution is (2:1).

Weights will be acquired from the global weight matrix exposed by the
sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

The policy will then allocate the number of pages according to the
set weights.  For example, if the weights are (2,1), then 2 pages
will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

There are 3 integration points:

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node. Applied by `mempolicy_slab_node()` and
    `policy_nodemask()`

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Applied by `policy_nodemask()` and `mpol_misplaced()`

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight (pol->cur_weight) will be allocated first, before
    the remaining bulk calculation is done. This simplifies the
    calculation at the cost of an additional allocation call.

One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave.  The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later appliex by the
relevant weighted_interleave function.

If CONFIG_SYSFS is disabled, the weight table will be initialized
to set all nodes to weight 1, but the weighting code is still called.
This is so that task-local weights (future patch) can still be
engaged cleanly without ifdef spaghetti.

Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  11 +
 include/linux/mempolicy.h                     |   5 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 197 +++++++++++++++++-
 4 files changed, 211 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..d2c8e712785b 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY
 	can fall back to all existing numa nodes. This is effectively
 	MPOL_PREFERRED allowed for a mask rather than a single node.
 
+MPOL_WEIGHTED_INTERLEAVE
+	This mode operates the same as MPOL_INTERLEAVE, except that
+	interleaving behavior is executed based on weights set in
+	/sys/kernel/mm/mempolicy/weighted_interleave/
+
+	Weighted interleave allocations pages on nodes according to
+	their weight.  For example if nodes [0,1] are weighted [5,2]
+	respectively, 5 pages will be allocated on node0 for every
+	2 pages allocated on node1.  This can better distribute data
+	according to bandwidth on heterogeneous memory systems.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..ba09167e80f7 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,11 @@ struct mempolicy {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
+
+	/* Weighted interleave settings */
+	struct {
+		unsigned char cur_weight;
+	} wil;
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_WEIGHTED_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e77633b07a5..0a180c670f0c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->wil.cur_weight = 0;
 
 	return policy;
 }
@@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
+	[MPOL_WEIGHTED_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
 		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
@@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*nodes = pol->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
 			*policy = next_node_in(current->il_prev, pol->nodes);
+		} else if (pol == current->mempolicy &&
+				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+			if (pol->wil.cur_weight)
+				*policy = current->il_prev;
+			else
+				*policy = next_node_in(current->il_prev,
+						       pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	pol = __get_vma_policy(vma, addr, ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
 		*ilx += vma->vm_pgoff >> order;
 		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
 	}
@@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+	unsigned int next;
+	struct task_struct *me = current;
+
+	next = next_node_in(me->il_prev, policy->nodes);
+	if (next == MAX_NUMNODES)
+		return next;
+
+	if (!policy->wil.cur_weight)
+		policy->wil.cur_weight = iw_table[next];
+
+	policy->wil.cur_weight--;
+	if (!policy->wil.cur_weight)
+		me->il_prev = next;
+	return next;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
@@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		return weighted_interleave_nodes(policy);
+
 	case MPOL_BIND:
 	case MPOL_PREFERRED_MANY:
 	{
@@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void)
 	}
 }
 
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+	nodemask_t nodemask = pol->nodes;
+	unsigned int target, weight_total = 0;
+	int nid;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char weight;
+
+	barrier();
+
+	/* first ensure we have a valid nodemask */
+	nid = first_node(nodemask);
+	if (nid == MAX_NUMNODES)
+		return nid;
+
+	/* Then collect weights on stack and calculate totals */
+	for_each_node_mask(nid, nodemask) {
+		weight = iw_table[nid];
+		weight_total += weight;
+		weights[nid] = weight;
+	}
+
+	/* Finally, calculate the node offset based on totals */
+	target = (unsigned int)ilx % weight_total;
+	nid = first_node(nodemask);
+	while (target) {
+		weight = weights[nid];
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+	return nid;
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
 			interleave_nodes(pol) : interleave_nid(pol, ilx);
 		break;
+	case MPOL_WEIGHTED_INTERLEAVE:
+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);
+		break;
 	}
 
 	return nodemask;
@@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode != MPOL_INTERLEAVE &&
+		if ((pol->mode != MPOL_INTERLEAVE &&
+		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
 		    (!nodemask || node_isset(nid, *nodemask))) {
 			/*
 			 * First, try to allocate THP only on local node, but
@@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	struct task_struct *me = current;
+	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	unsigned char weight;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned int weight_total = 0;
+	unsigned long rem_pages = nr_pages;
+	nodemask_t nodes = pol->nodes;
+	int nnodes, node, prev_node;
+	int i;
+
+	/* Stabilize the nodemask on the stack */
+	barrier();
+
+	nnodes = nodes_weight(nodes);
+
+	/* Collect weights and save them on stack so they don't change */
+	for_each_node_mask(node, nodes) {
+		weight = iw_table[node];
+		weight_total += weight;
+		weights[node] = weight;
+	}
+
+	/* Continue allocating from most recent node and adjust the nr_pages */
+	if (pol->wil.cur_weight) {
+		node = next_node_in(me->il_prev, nodes);
+		node_pages = pol->wil.cur_weight;
+		if (node_pages > rem_pages)
+			node_pages = rem_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (rem_pages <= pol->wil.cur_weight) {
+			pol->wil.cur_weight -= rem_pages;
+			return total_allocated;
+		}
+		/* Otherwise we adjust nr_pages down, and continue from there */
+		rem_pages -= pol->wil.cur_weight;
+		pol->wil.cur_weight = 0;
+		prev_node = node;
+	}
+
+	/* Now we can continue allocating as if from 0 instead of an offset */
+	rounds = rem_pages / weight_total;
+	delta = rem_pages % weight_total;
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, nodes);
+		weight = weights[node];
+		node_pages = weight * rounds;
+		if (delta) {
+			if (delta > weight) {
+				node_pages += weight;
+				delta -= weight;
+			} else {
+				node_pages += delta;
+				delta = 0;
+			}
+		}
+		/* We may not make it all the way around */
+		if (!node_pages)
+			break;
+		/* If an over-allocation would occur, floor it */
+		if (node_pages + total_allocated > nr_pages) {
+			node_pages = nr_pages - total_allocated;
+			delta = 0;
+		}
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		prev_node = node;
+	}
+
+	/*
+	 * Finally, we need to update me->il_prev and pol->wil.cur_weight
+	 * if there were overflow pages, but not equivalent to the node
+	 * weight, set the cur_weight to node_weight - delta and the
+	 * me->il_prev to the previous node. Otherwise if it was perfect
+	 * we can simply set il_prev to node and cur_weight to 0
+	 */
+	if (node_pages) {
+		me->il_prev = prev_node;
+		node_pages %= weight;
+		pol->wil.cur_weight = weight - node_pages;
+	} else {
+		me->il_prev = node;
+		pol->wil.cur_weight = 0;
+	}
+
+	return total_allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+		return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
+								  nr_pages,
+								  page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		polnid = interleave_nid(pol, ilx);
 		break;
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		polnid = weighted_interleave_nid(pol, ilx);
+		break;
+
 	case MPOL_PREFERRED:
 		if (node_isset(curnid, pol->nodes))
 			goto out;
@@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		}
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		/*
 		 * Default to online nodes with memory if no nodelist
 		 */
@@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default:
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
  2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
  2023-12-23 18:10 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-27  8:39   ` Huang, Ying
  2023-12-23 18:10 ` [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

split sanitize_mpol_flags into sanitize and validate.

Sanitize is used by set_mempolicy to split (int mode) into mode
and mode_flags, and then validates them.

Validate validates already split flags.

Validate will be reused for new syscalls that accept already
split mode and mode_flags.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0a180c670f0c..59ac0da24f56 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1463,24 +1463,39 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
 	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
 }
 
-/* Basic parameter sanity check used by both mbind() and set_mempolicy() */
-static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
+/*
+ * Basic parameter sanity check used by mbind/set_mempolicy
+ * May modify flags to include internal flags (e.g. MPOL_F_MOF/F_MORON)
+ */
+static inline int validate_mpol_flags(unsigned short mode, unsigned short *flags)
 {
-	*flags = *mode & MPOL_MODE_FLAGS;
-	*mode &= ~MPOL_MODE_FLAGS;
-
-	if ((unsigned int)(*mode) >=  MPOL_MAX)
+	if ((unsigned int)(mode) >= MPOL_MAX)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
 	if (*flags & MPOL_F_NUMA_BALANCING) {
-		if (*mode != MPOL_BIND)
+		if (mode != MPOL_BIND)
 			return -EINVAL;
 		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
 	}
 	return 0;
 }
 
+/*
+ * Used by mbind/set_memplicy to split and validate mode/flags
+ * set_mempolicy combines (mode | flags), split them out into separate
+ * fields and return just the mode in mode_arg and flags in flags.
+ */
+static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags)
+{
+	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
+
+	*flags = *mode_arg & MPOL_MODE_FLAGS;
+	*mode_arg = mode;
+
+	return validate_mpol_flags(mode, flags);
+}
+
 static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (2 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-23 18:10 ` [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

This patch adds a new kernel structure `struct mempolicy_args`,
intended to be used for an extensible get/set_mempolicy interface.

This implements the fields required to support the existing syscall
interfaces interfaces, but does not expose any user-facing arg
structure.

mpol_new is refactored to take the argument structure so that future
mempolicy extensions can all be managed in the mempolicy constructor.

The get_mempolicy and mbind syscalls are refactored to utilize the
new argument structure, as are all the callers of mpol_new() and
do_set_mempolicy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 include/linux/mempolicy.h | 11 +++++++
 mm/mempolicy.c            | 69 +++++++++++++++++++++++++++++----------
 2 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ba09167e80f7..0f1c85527626 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -61,6 +61,17 @@ struct mempolicy {
 	} wil;
 };
 
+/*
+ * Describes settings of a mempolicy during set/get syscalls and
+ * kernel internal calls to do_set_mempolicy()
+ */
+struct mempolicy_args {
+	unsigned short mode;		/* policy mode */
+	unsigned short mode_flags;	/* policy mode flags */
+	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
+	nodemask_t *policy_nodes;	/* get/set/mbind */
+};
+
 /*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 59ac0da24f56..42037b7ff6d6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -265,10 +265,12 @@ static int mpol_set_nodemask(struct mempolicy *pol,
  * This function just creates a new policy, does some check and simple
  * initialization. You must invoke mpol_set_nodemask() to set nodes.
  */
-static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
-				  nodemask_t *nodes)
+static struct mempolicy *mpol_new(struct mempolicy_args *args)
 {
 	struct mempolicy *policy;
+	unsigned short mode = args->mode;
+	unsigned short flags = args->mode_flags;
+	nodemask_t *nodes = args->policy_nodes;
 
 	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
@@ -817,8 +819,7 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma,
 }
 
 /* Set the process memory policy */
-static long do_set_mempolicy(unsigned short mode, unsigned short flags,
-			     nodemask_t *nodes)
+static long do_set_mempolicy(struct mempolicy_args *args)
 {
 	struct mempolicy *new, *old;
 	NODEMASK_SCRATCH(scratch);
@@ -827,14 +828,14 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	if (!scratch)
 		return -ENOMEM;
 
-	new = mpol_new(mode, flags, nodes);
+	new = mpol_new(args);
 	if (IS_ERR(new)) {
 		ret = PTR_ERR(new);
 		goto out;
 	}
 
 	task_lock(current);
-	ret = mpol_set_nodemask(new, nodes, scratch);
+	ret = mpol_set_nodemask(new, args->policy_nodes, scratch);
 	if (ret) {
 		task_unlock(current);
 		mpol_put(new);
@@ -1232,8 +1233,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 #endif
 
 static long do_mbind(unsigned long start, unsigned long len,
-		     unsigned short mode, unsigned short mode_flags,
-		     nodemask_t *nmask, unsigned long flags)
+		     struct mempolicy_args *margs, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
@@ -1253,7 +1253,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (margs->mode == MPOL_DEFAULT)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = PAGE_ALIGN(len);
@@ -1264,7 +1264,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	if (end == start)
 		return 0;
 
-	new = mpol_new(mode, mode_flags, nmask);
+	new = mpol_new(margs);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
@@ -1281,7 +1281,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 		NODEMASK_SCRATCH(scratch);
 		if (scratch) {
 			mmap_write_lock(mm);
-			err = mpol_set_nodemask(new, nmask, scratch);
+			err = mpol_set_nodemask(new, margs->policy_nodes,
+						scratch);
 			if (err)
 				mmap_write_unlock(mm);
 		} else
@@ -1295,7 +1296,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 	 * Lock the VMAs before scanning for pages to migrate,
 	 * to ensure we don't miss a concurrently inserted page.
 	 */
-	nr_failed = queue_pages_range(mm, start, end, nmask,
+	nr_failed = queue_pages_range(mm, start, end, margs->policy_nodes,
 			flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist);
 
 	if (nr_failed < 0) {
@@ -1500,6 +1501,7 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
 {
+	struct mempolicy_args margs;
 	unsigned short mode_flags;
 	nodemask_t nodes;
 	int lmode = mode;
@@ -1514,7 +1516,12 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
-	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
+	memset(&margs, 0, sizeof(margs));
+	margs.mode = lmode;
+	margs.mode_flags = mode_flags;
+	margs.policy_nodes = &nodes;
+
+	return do_mbind(start, len, &margs, flags);
 }
 
 SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
@@ -1595,6 +1602,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
 {
+	struct mempolicy_args args;
 	unsigned short mode_flags;
 	nodemask_t nodes;
 	int lmode = mode;
@@ -1608,7 +1616,12 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
-	return do_set_mempolicy(lmode, mode_flags, &nodes);
+	memset(&args, 0, sizeof(args));
+	args.mode = lmode;
+	args.mode_flags = mode_flags;
+	args.policy_nodes = &nodes;
+
+	return do_set_mempolicy(&args);
 }
 
 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
@@ -2890,6 +2903,7 @@ static int shared_policy_replace(struct shared_policy *sp, pgoff_t start,
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 {
 	int ret;
+	struct mempolicy_args margs;
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
 	rwlock_init(&sp->lock);
@@ -2902,8 +2916,12 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		if (!scratch)
 			goto put_mpol;
 
+		memset(&margs, 0, sizeof(margs));
+		margs.mode = mpol->mode;
+		margs.mode_flags = mpol->flags;
+		margs.policy_nodes = &mpol->w.user_nodemask;
 		/* contextualize the tmpfs mount point mempolicy to this file */
-		npol = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
+		npol = mpol_new(&margs);
 		if (IS_ERR(npol))
 			goto free_scratch; /* no valid nodemask intersection */
 
@@ -3011,6 +3029,7 @@ static inline void __init check_numabalancing_enable(void)
 
 void __init numa_policy_init(void)
 {
+	struct mempolicy_args args;
 	nodemask_t interleave_nodes;
 	unsigned long largest = 0;
 	int nid, prefer = 0;
@@ -3056,7 +3075,11 @@ void __init numa_policy_init(void)
 	if (unlikely(nodes_empty(interleave_nodes)))
 		node_set(prefer, interleave_nodes);
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
+	memset(&args, 0, sizeof(args));
+	args.mode = MPOL_INTERLEAVE;
+	args.policy_nodes = &interleave_nodes;
+
+	if (do_set_mempolicy(&args))
 		pr_err("%s: interleaving failed\n", __func__);
 
 	check_numabalancing_enable();
@@ -3065,7 +3088,12 @@ void __init numa_policy_init(void)
 /* Reset policy of current process to default */
 void numa_default_policy(void)
 {
-	do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
+	struct mempolicy_args args;
+
+	memset(&args, 0, sizeof(args));
+	args.mode = MPOL_DEFAULT;
+
+	do_set_mempolicy(&args);
 }
 
 /*
@@ -3095,6 +3123,7 @@ static const char * const policy_modes[] =
  */
 int mpol_parse_str(char *str, struct mempolicy **mpol)
 {
+	struct mempolicy_args margs;
 	struct mempolicy *new = NULL;
 	unsigned short mode_flags;
 	nodemask_t nodes;
@@ -3181,7 +3210,11 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 			goto out;
 	}
 
-	new = mpol_new(mode, mode_flags, &nodes);
+	memset(&margs, 0, sizeof(margs));
+	margs.mode = mode;
+	margs.mode_flags = mode_flags;
+	margs.policy_nodes = &nodes;
+	new = mpol_new(&margs);
 	if (IS_ERR(new))
 		goto out;
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (3 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-23 18:10 ` [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

Pull operation flag checking from inside do_get_mempolicy out
to kernel_get_mempolicy.  This allows us to flatten the
internal code, and break it into separate functions for future
syscalls (get_mempolicy2, process_get_mempolicy) to re-use the
code, even after additional extensions are made.

The primary change is that the flag is treated as the multiplexer
that it actually is.  For get_mempolicy, the flags represents 3
different primary operations:

if (flags & MPOL_F_MEMS_ALLOWED)
	return task->mems_allowed
else if (flags & MPOL_F_ADDR)
	return vma mempolicy information
else
	return task mempolicy information

Plus the behavior modifying flag:

if (flags & MPOL_F_NODE)
	change the return value of (int __user *policy)
	based on whether MPOL_F_ADDR was set.

The original behavior of get_mempolicy is retained, but we utilize
the new mempolicy_args structure to pass the operations down the
stack.  This will allow us to extend the internal functions without
affecting the legacy behavior of get_mempolicy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 244 +++++++++++++++++++++++++++++++------------------
 1 file changed, 154 insertions(+), 90 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42037b7ff6d6..da84dc33a645 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -895,106 +895,109 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr)
 	return ret;
 }
 
-/* Retrieve NUMA policy */
-static long do_get_mempolicy(int *policy, nodemask_t *nmask,
-			     unsigned long addr, unsigned long flags)
+/* Retrieve the mems_allowed for current task */
+static inline long do_get_mems_allowed(nodemask_t *nmask)
 {
-	int err;
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL;
+	task_lock(current);
+	*nmask  = cpuset_current_mems_allowed;
+	task_unlock(current);
+	return 0;
+}
 
-	if (flags &
-		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
-		return -EINVAL;
+/* If the policy has additional node information to retrieve, return it */
+static long do_get_policy_node(struct mempolicy *pol)
+{
+	/*
+	 * For MPOL_INTERLEAVE, the extended node information is the next
+	 * node that will be selected for interleave. For weighted interleave
+	 * we return the next node based on the current weight.
+	 */
+	if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE)
+		return next_node_in(current->il_prev, pol->nodes);
 
-	if (flags & MPOL_F_MEMS_ALLOWED) {
-		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
-			return -EINVAL;
-		*policy = 0;	/* just so it's initialized */
+	if (pol == current->mempolicy &&
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
+		if (pol->wil.cur_weight)
+			return current->il_prev;
+		else
+			return next_node_in(current->il_prev, pol->nodes);
+	}
+	return -EINVAL;
+}
+
+/* Handle user_nodemask condition when fetching nodemask for userspace */
+static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask)
+{
+	if (mpol_store_user_nodemask(pol)) {
+		*nmask = pol->w.user_nodemask;
+	} else {
 		task_lock(current);
-		*nmask  = cpuset_current_mems_allowed;
+		get_policy_nodemask(pol, nmask);
 		task_unlock(current);
-		return 0;
 	}
+}
 
-	if (flags & MPOL_F_ADDR) {
-		pgoff_t ilx;		/* ignored here */
-		/*
-		 * Do NOT fall back to task policy if the
-		 * vma/shared policy at addr is NULL.  We
-		 * want to return MPOL_DEFAULT in this case.
-		 */
-		mmap_read_lock(mm);
-		vma = vma_lookup(mm, addr);
-		if (!vma) {
-			mmap_read_unlock(mm);
-			return -EFAULT;
-		}
-		pol = __get_vma_policy(vma, addr, &ilx);
-	} else if (addr)
-		return -EINVAL;
+/* Retrieve NUMA policy for a VMA assocated with a given address  */
+static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
+				 struct mempolicy_args *args)
+{
+	pgoff_t ilx;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	struct mempolicy *pol = NULL;
 
+	mmap_read_lock(mm);
+	vma = vma_lookup(mm, addr);
+	if (!vma) {
+		mmap_read_unlock(mm);
+		return -EFAULT;
+	}
+	pol = __get_vma_policy(vma, addr, &ilx);
 	if (!pol)
-		pol = &default_policy;	/* indicates default behavior */
+		pol = &default_policy;
+	else
+		mpol_get(pol);
+	mmap_read_unlock(mm);
 
-	if (flags & MPOL_F_NODE) {
-		if (flags & MPOL_F_ADDR) {
-			/*
-			 * Take a refcount on the mpol, because we are about to
-			 * drop the mmap_lock, after which only "pol" remains
-			 * valid, "vma" is stale.
-			 */
-			pol_refcount = pol;
-			vma = NULL;
-			mpol_get(pol);
-			mmap_read_unlock(mm);
-			err = lookup_node(mm, addr);
-			if (err < 0)
-				goto out;
-			*policy = err;
-		} else if (pol == current->mempolicy &&
-				pol->mode == MPOL_INTERLEAVE) {
-			*policy = next_node_in(current->il_prev, pol->nodes);
-		} else if (pol == current->mempolicy &&
-				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
-			if (pol->wil.cur_weight)
-				*policy = current->il_prev;
-			else
-				*policy = next_node_in(current->il_prev,
-						       pol->nodes);
-		} else {
-			err = -EINVAL;
-			goto out;
-		}
-	} else {
-		*policy = pol == &default_policy ? MPOL_DEFAULT :
-						pol->mode;
-		/*
-		 * Internal mempolicy flags must be masked off before exposing
-		 * the policy to userspace.
-		 */
-		*policy |= (pol->flags & MPOL_MODE_FLAGS);
-	}
+	/* Fetch the node for the given address */
+	if (addr_node)
+		*addr_node = lookup_node(mm, addr);
 
-	err = 0;
-	if (nmask) {
-		if (mpol_store_user_nodemask(pol)) {
-			*nmask = pol->w.user_nodemask;
-		} else {
-			task_lock(current);
-			get_policy_nodemask(pol, nmask);
-			task_unlock(current);
-		}
+	args->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode;
+	args->mode_flags = (pol->flags & MPOL_MODE_FLAGS);
+	args->home_node = pol->home_node;
+
+	if (args->policy_nodes)
+		do_get_mempolicy_nodemask(pol, args->policy_nodes);
+
+	if (pol != &default_policy) {
+		mpol_put(pol);
+		mpol_cond_put(pol);
 	}
 
- out:
-	mpol_cond_put(pol);
-	if (vma)
-		mmap_read_unlock(mm);
-	if (pol_refcount)
-		mpol_put(pol_refcount);
-	return err;
+	return 0;
+}
+
+/* Retrieve NUMA policy for the current task */
+static long do_get_task_mempolicy(struct mempolicy_args *args, int *pol_node)
+{
+	struct mempolicy *pol = current->mempolicy;
+
+	if (!pol)
+		pol = &default_policy;	/* indicates default behavior */
+
+	args->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode;
+	/* Internal flags must be masked off before exposing to userspace */
+	args->mode_flags = (pol->flags & MPOL_MODE_FLAGS);
+	args->home_node = NUMA_NO_NODE;
+
+	if (pol_node)
+		*pol_node = do_get_policy_node(pol);
+
+	if (args->policy_nodes)
+		do_get_mempolicy_nodemask(pol, args->policy_nodes);
+
+	return 0;
 }
 
 #ifdef CONFIG_MIGRATION
@@ -1731,16 +1734,77 @@ static int kernel_get_mempolicy(int __user *policy,
 				unsigned long addr,
 				unsigned long flags)
 {
+	struct mempolicy_args args;
 	int err;
-	int pval;
+	int address_node = NUMA_NO_NODE;
+	int pval = 0;
+	int pol_node = 0;
 	nodemask_t nodes;
 
 	if (nmask != NULL && maxnode < nr_node_ids)
 		return -EINVAL;
 
-	addr = untagged_addr(addr);
+	if (flags &
+		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
+		return -EINVAL;
 
-	err = do_get_mempolicy(&pval, &nodes, addr, flags);
+	/* Ensure any data that may be copied to userland is initialized */
+	memset(&args, 0, sizeof(args));
+	args.policy_nodes = &nodes;
+
+	/*
+	 * set_mempolicy was originally multiplexed based on 3 flags:
+	 *   MPOL_F_MEMS_ALLOWED:  fetch task->mems_allowed
+	 *   MPOL_F_ADDR        :  operate on vma->mempolicy
+	 *   MPOL_F_NODE        :  change return value of *policy
+	 *
+	 * Split this behavior out here, rather than internal functions,
+	 * so that the internal functions can be re-used by future
+	 * get_mempolicy2 interfaces and the arg structure made extensible
+	 */
+	if (flags & MPOL_F_MEMS_ALLOWED) {
+		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
+			return -EINVAL;
+		pval = 0;	/* just so it's initialized */
+		err = do_get_mems_allowed(&nodes);
+	} else if (flags & MPOL_F_ADDR) {
+		/* If F_ADDR, we operation on a vma policy (or default) */
+		err = do_get_vma_mempolicy(untagged_addr(addr),
+					   &address_node, &args);
+		if (err)
+			return err;
+		 /* if (F_ADDR | F_NODE), *pval is the address' node */
+		if (flags & MPOL_F_NODE) {
+			/* if we failed to fetch, that's likely an EFAULT */
+			if (address_node < 0)
+				return address_node;
+			pval = address_node;
+		} else
+			pval = args.mode | args.mode_flags;
+	} else {
+		 /* if not F_ADDR and addr != null, EINVAL */
+		if (addr)
+			return -EINVAL;
+
+		err = do_get_task_mempolicy(&args, &pol_node);
+		if (err)
+			return err;
+		/*
+		 * if F_NODE was set and mode was MPOL_INTERLEAVE
+		 * *pval is equal to next interleave node.
+		 *
+		 * if pol_node < 0, this means the mode did not have a
+		 * a compatible policy.  This presently emulates the
+		 * original behavior of (F_NODE) & (!MPOL_INTERLEAVE)
+		 * producing -EINVAL
+		 */
+		if (flags & MPOL_F_NODE) {
+			if (pol_node < 0)
+				return pol_node;
+			pval = pol_node;
+		} else
+			pval = args.mode | args.mode_flags;
+	}
 
 	if (err)
 		return err;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (4 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-23 18:10 ` [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

This patch adds the plumbing into mpol_new() to allow the argument
structure's home_node field to be set during mempolicy creation.

The syscall sys_set_mempolicy_home_node was added to allow a home
node to be registered for a vma.

For set_mempolicy2 and mbind2 syscalls, it would be useful to add
this as an extension to allow the user to submit a fully formed
mempolicy configuration in a single call, rather than require
multiple calls to configure a mempolicy.

This will become particularly useful if/when pidfd interfaces to
change process mempolicies from outside the task appear, as each
call to change the mempolicy does an atomic swap of that policy
in the task, rather than mutate the policy.

Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da84dc33a645..35a0f8630ead 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -306,7 +306,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	atomic_set(&policy->refcnt, 1);
 	policy->mode = mode;
 	policy->flags = flags;
-	policy->home_node = NUMA_NO_NODE;
+	policy->home_node = args->home_node;
 	policy->wil.cur_weight = 0;
 
 	return policy;
@@ -1623,6 +1623,7 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	args.mode = lmode;
 	args.mode_flags = mode_flags;
 	args.policy_nodes = &nodes;
+	args.home_node = NUMA_NO_NODE;
 
 	return do_set_mempolicy(&args);
 }
@@ -2984,6 +2985,8 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		margs.mode = mpol->mode;
 		margs.mode_flags = mpol->flags;
 		margs.policy_nodes = &mpol->w.user_nodemask;
+		margs.home_node = NUMA_NO_NODE;
+
 		/* contextualize the tmpfs mount point mempolicy to this file */
 		npol = mpol_new(&margs);
 		if (IS_ERR(npol))
@@ -3142,6 +3145,7 @@ void __init numa_policy_init(void)
 	memset(&args, 0, sizeof(args));
 	args.mode = MPOL_INTERLEAVE;
 	args.policy_nodes = &interleave_nodes;
+	args.home_node = NUMA_NO_NODE;
 
 	if (do_set_mempolicy(&args))
 		pr_err("%s: interleaving failed\n", __func__);
@@ -3156,6 +3160,7 @@ void numa_default_policy(void)
 
 	memset(&args, 0, sizeof(args));
 	args.mode = MPOL_DEFAULT;
+	args.home_node = NUMA_NO_NODE;
 
 	do_set_mempolicy(&args);
 }
@@ -3278,6 +3283,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	margs.mode = mode;
 	margs.mode_flags = mode_flags;
 	margs.policy_nodes = &nodes;
+	margs.home_node = NUMA_NO_NODE;
+
 	new = mpol_new(&margs);
 	if (IS_ERR(new))
 		goto out;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (5 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2023-12-23 18:10 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha,
	Frank van der Linden

This patch adds the new user-api argument structure intended for
set_mempolicy2 and mbind2.

struct mpol_args {
  __u16 mode;
  __u16 mode_flags;
  __s32 home_node;          /* mbind2: policy home node */
  __u64 pol_maxnodes;
  __aligned_u64 *pol_nodes;
};

This structure is intended to be extensible as new mempolicy extensions
are added.

For example, set_mempolicy_home_node was added to allow vma mempolicies
to have a preferred/home node assigned.  This structure allows the
setting the home node at the time mempolicy is set, rather than
requiring an additional syscalls.

Full breakdown of arguments as of this patch:
    mode:         Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE)

    mode_flags:   Flags previously or'd into mode in set_mempolicy
                  (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES)

    home_node:    for mbind2.  Allows the setting of a policy's home
                  with the use of MPOL_MF_HOME_NODE

    pol_maxnodes: Max number of nodes in the policy nodemask

    pol_nodes:    Policy nodemask

Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst       | 17 +++++++++++++++++
 include/linux/syscalls.h                        |  1 +
 include/uapi/linux/mempolicy.h                  |  8 ++++++++
 3 files changed, 26 insertions(+)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index d2c8e712785b..5ee047b0d981 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -482,6 +482,23 @@ closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+Extended Mempolicy Arguments::
+
+	struct mpol_args {
+		__u16 mode;
+		__u16 mode_flags;
+		__s32 home_node;	 /* mbind2: set home node */
+		__u64 pol_maxnodes;
+		__aligned_u64 pol_nodes; /* nodemask pointer */
+	};
+
+The extended mempolicy argument structure is defined to allow the mempolicy
+interfaces future extensibility without the need for additional system calls.
+
+The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
+all interfaces relative to their non-extended counterparts. Each additional
+field may only apply to specific extended interfaces.  See the respective
+extended interface man page for more details.
 
 Memory Policy Command Line Interface
 ====================================
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fd9d12de7e92..a52395ca3f00 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -74,6 +74,7 @@ struct landlock_ruleset_attr;
 enum landlock_rule_type;
 struct cachestat_range;
 struct cachestat;
+struct mpol_args;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 1f9bb10d1a47..4dd2d2e0d2ed 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -27,6 +27,14 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
+struct mpol_args {
+	__u16 mode;
+	__u16 mode_flags;
+	__s32 home_node;	/* mbind2: policy home node */
+	__u64 pol_maxnodes;
+	__aligned_u64 pol_nodes;
+};
+
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (6 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2024-01-02 14:38   ` Geert Uytterhoeven
  2023-12-23 18:10 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko

set_mempolicy2 is an extensible set_mempolicy interface which allows
a user to set the per-task memory policy.

Defined as:

set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags);

relevant mpol_args fields include the following:

mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
mode_flags:   The MPOL_F_* flags that were previously passed in or'd
              into the mode.  This was split to hopefully allow future
              extensions additional mode/flag space.
home_node:    ignored (see note below)
pol_nodes:    the nodemask to apply for the memory policy
pol_maxnodes: The max number of nodes described by pol_nodes

The usize arg is intended for the user to pass in sizeof(mpol_args)
to allow forward/backward compatibility whenever possible.

The flags argument is intended to future proof the syscall against
future extensions which may require interpreting the arguments in
the structure differently.

Semantics of `set_mempolicy` are otherwise the same as `set_mempolicy`
as of this patch.

As of this patch, setting the home node of a task-policy is not
supported, as this functionality was not supported by set_mempolicy.
Additional research should be done to determine whether adding this
functionality is safe, but doing so would only require setting
MPOL_MF_HOME_NODE and providing a valid home node value.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 10 ++++++
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/arm64/include/asm/unistd.h               |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 ++
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  2 ++
 include/uapi/asm-generic/unistd.h             |  4 ++-
 kernel/sys_ni.c                               |  1 +
 mm/mempolicy.c                                | 36 +++++++++++++++++++
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |  1 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  1 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  1 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |  1 +
 25 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 5ee047b0d981..4720978ab1c2 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -432,6 +432,8 @@ Set [Task] Memory Policy::
 
 	long set_mempolicy(int mode, const unsigned long *nmask,
 					unsigned long maxnode);
+	long set_mempolicy2(struct mpol_args args, size_t size,
+			    unsigned long flags);
 
 Set's the calling task's "task/process memory policy" to mode
 specified by the 'mode' argument and the set of nodes defined by
@@ -440,6 +442,12 @@ specified by the 'mode' argument and the set of nodes defined by
 'mode' argument with the flag (for example: MPOL_INTERLEAVE |
 MPOL_F_STATIC_NODES).
 
+set_mempolicy2() is an extended version of set_mempolicy() capable
+of setting a mempolicy which requires more information than can be
+passed via get_mempolicy().  For example, weighted interleave with
+task-local weights requires a weight array to be passed via the
+'mpol_args->il_weights' argument in the 'struct mpol_args' arg.
+
 See the set_mempolicy(2) man page for more details
 
 
@@ -495,6 +503,8 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
+Extended interfaces (set_mempolicy2) use this argument structure.
+
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
 field may only apply to specific extended interfaces.  See the respective
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 18c842ca6c32..0dc288a1118a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@
 564	common	futex_wake			sys_futex_wake
 565	common	futex_wait			sys_futex_wait
 566	common	futex_requeue			sys_futex_requeue
+567	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 584f9528c996..50172ec0e1f5 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -470,3 +470,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 531effca5f1f..298313d2e0af 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		457
+#define __NR_compat_syscalls		458
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 9f7c1bf99526..cee8d669c342 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -919,6 +919,8 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
 __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_set_mempolicy2 457
+__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7a4b780e82cb..839d90c535f2 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 5b6a0b02b7de..567c8b883735 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -462,3 +462,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index a842b41c8e06..cc0640e16f2f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -395,3 +395,4 @@
 454	n32	futex_wake			sys_futex_wake
 455	n32	futex_wait			sys_futex_wait
 456	n32	futex_requeue			sys_futex_requeue
+457	n32	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 525cc54bc63b..f7262fde98d9 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -444,3 +444,4 @@
 454	o32	futex_wake			sys_futex_wake
 455	o32	futex_wait			sys_futex_wait
 456	o32	futex_requeue			sys_futex_requeue
+457	o32	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index a47798fed54e..e10f0e8bd064 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -455,3 +455,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 7fab411378f2..4f03f5f42b78 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -543,3 +543,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 86fec9b080f6..f98dadc2e9df 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -459,3 +459,4 @@
 454  common	futex_wake		sys_futex_wake			sys_futex_wake
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
+457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 363fae0fe9bf..f47ba9f2d05d 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -459,3 +459,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 7bcaa3d5ea44..53fb16616728 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -502,3 +502,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index c8fac5205803..4b4dc41b24ee 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -461,3 +461,4 @@
 454	i386	futex_wake		sys_futex_wake
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
+457	i386	set_mempolicy2		sys_set_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..1bc2190bec27 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
 454	common	futex_wake		sys_futex_wake
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
+457	common	set_mempolicy2		sys_set_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 06eefa9c1458..e26dc89399eb 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -427,3 +427,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a52395ca3f00..451f0089601f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -823,6 +823,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long addr, unsigned long flags);
 asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
 				unsigned long maxnode);
+asmlinkage long sys_set_mempolicy2(struct mpol_args __user *args, size_t size,
+				   unsigned long flags);
 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *from,
 				const unsigned long __user *to);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 756b013fb832..55486aba099f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake)
 __SYSCALL(__NR_futex_wait, sys_futex_wait)
 #define __NR_futex_requeue 456
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
+#define __NR_set_mempolicy2 457
+__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 457
+#define __NR_syscalls 458
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 9a846439b36a..fa1373c8bff8 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -189,6 +189,7 @@ COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL(get_mempolicy);
 COND_SYSCALL(set_mempolicy);
+COND_SYSCALL(set_mempolicy2);
 COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 35a0f8630ead..d1abb1fc5a53 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1634,6 +1634,42 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
 	return kernel_set_mempolicy(mode, nmask, maxnode);
 }
 
+SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
+		unsigned long, flags)
+{
+	struct mpol_args kargs;
+	struct mempolicy_args margs;
+	int err;
+	nodemask_t policy_nodemask;
+	unsigned long __user *nodes_ptr;
+
+	if (flags)
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return err;
+
+	err = validate_mpol_flags(kargs.mode, &kargs.mode_flags);
+	if (err)
+		return err;
+
+	memset(&margs, 0, sizeof(margs));
+	margs.mode = kargs.mode;
+	margs.mode_flags = kargs.mode_flags;
+	if (kargs.pol_nodes) {
+		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+		err = get_nodes(&policy_nodemask, nodes_ptr,
+				kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.policy_nodes = &policy_nodemask;
+	} else
+		margs.policy_nodes = NULL;
+
+	return do_set_mempolicy(&margs);
+}
+
 static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *old_nodes,
 				const unsigned long __user *new_nodes)
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index 116ff501bf92..bb1351df51d9 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -371,3 +371,4 @@
 454	n64	futex_wake			sys_futex_wake
 455	n64	futex_wait			sys_futex_wait
 456	n64	futex_requeue			sys_futex_requeue
+457	n64	set_mempolicy2			sys_set_mempolicy2
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 7fab411378f2..4f03f5f42b78 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -543,3 +543,4 @@
 454	common	futex_wake			sys_futex_wake
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
+457	common	set_mempolicy2			sys_set_mempolicy2
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index 86fec9b080f6..f98dadc2e9df 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -459,3 +459,4 @@
 454  common	futex_wake		sys_futex_wake			sys_futex_wake
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
+457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index 8cb8bf68721c..21f2579679d4 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -378,6 +378,7 @@
 454	common	futex_wake		sys_futex_wake
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
+457	common 	set_mempolicy2		sys_set_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (7 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
@ 2023-12-23 18:10 ` Gregory Price
  2024-01-02 14:46   ` Geert Uytterhoeven
  2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:10 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko

get_mempolicy2 is an extensible get_mempolicy interface which allows
a user to retrieve the memory policy for a task or address.

Defined as:

get_mempolicy2(struct mpol_args *args, size_t size,
               unsigned long addr, unsigned long flags)

Top level input values:

mpol_args:    The field which collects information about the mempolicy
              returned to userspace.
addr:         if MPOL_F_ADDR is passed in `flags`, this address will be
              used to return the mempolicy details of the vma the
              address belongs to
flags:        if MPOL_F_ADDR, return mempolicy info vma containing addr
              else, returns task mempolicy information

Input values include the following fields of mpol_args:

pol_nodes:    if set, the nodemask of the policy returned here
pol_maxnodes: if pol_nodes is set, must describe max number of nodes
              to be copied to pol_nodes

Output values include the following fields of mpol_args:

mode:         mempolicy mode
mode_flags:   mempolicy mode flags
home_node:    policy home node will be returned here, or -1 if not.
pol_nodes:    if set, the nodemask for the mempolicy
policy_node:  if the policy has extended node information, it will
              be placed here.  For example MPOL_INTERLEAVE will
              return the next node which will be used for allocation

MPOL_F_NODE has been dropped from get_mempolicy2 (EINVAL).
MPOL_F_MEMS_ALLOWED has been dropped from get_mempolicy2 (EINVAL).

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 11 ++++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/arm64/include/asm/unistd.h               |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  2 +
 include/uapi/asm-generic/unistd.h             |  4 +-
 kernel/sys_ni.c                               |  1 +
 mm/mempolicy.c                                | 42 +++++++++++++++++++
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |  1 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  1 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  1 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |  1 +
 25 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 4720978ab1c2..f50b7f7ddbf9 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -456,11 +456,20 @@ Get [Task] Memory Policy or Related Information::
 	long get_mempolicy(int *mode,
 			   const unsigned long *nmask, unsigned long maxnode,
 			   void *addr, int flags);
+	long get_mempolicy2(struct mpol_args args, size_t size,
+			    unsigned long addr, unsigned long flags);
 
 Queries the "task/process memory policy" of the calling task, or the
 policy or location of a specified virtual address, depending on the
 'flags' argument.
 
+get_mempolicy2() is an extended version of get_mempolicy() capable of
+acquiring extended information about a mempolicy, including those
+that can only be set via set_mempolicy2() or mbind2().
+
+MPOL_F_NODE functionality has been removed from get_mempolicy2(),
+but can still be accessed via get_mempolicy().
+
 See the get_mempolicy(2) man page for more details
 
 
@@ -503,7 +512,7 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
-Extended interfaces (set_mempolicy2) use this argument structure.
+Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure.
 
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 0dc288a1118a..0301a8b0a262 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -497,3 +497,4 @@
 565	common	futex_wait			sys_futex_wait
 566	common	futex_requeue			sys_futex_requeue
 567	common	set_mempolicy2			sys_set_mempolicy2
+568	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 50172ec0e1f5..771a33446e8e 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -471,3 +471,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 298313d2e0af..b63f870debaf 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		458
+#define __NR_compat_syscalls		459
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index cee8d669c342..f8d01007aee0 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -921,6 +921,8 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 #define __NR_set_mempolicy2 457
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_get_mempolicy2 458
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 839d90c535f2..048a409e684c 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 567c8b883735..327b01bd6793 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -463,3 +463,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index cc0640e16f2f..921d58e1da23 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -396,3 +396,4 @@
 455	n32	futex_wait			sys_futex_wait
 456	n32	futex_requeue			sys_futex_requeue
 457	n32	set_mempolicy2			sys_set_mempolicy2
+458	n32	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index f7262fde98d9..9271c83c9993 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -445,3 +445,4 @@
 455	o32	futex_wait			sys_futex_wait
 456	o32	futex_requeue			sys_futex_requeue
 457	o32	set_mempolicy2			sys_set_mempolicy2
+458	o32	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index e10f0e8bd064..0654f3f89fc7 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 4f03f5f42b78..ac11d2064e7a 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -544,3 +544,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index f98dadc2e9df..1cdcafe1ccca 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
+458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index f47ba9f2d05d..f71742024c29 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 53fb16616728..2fbf5dbe0620 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -503,3 +503,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4b4dc41b24ee..0af813b9a118 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -462,3 +462,4 @@
 455	i386	futex_wait		sys_futex_wait
 456	i386	futex_requeue		sys_futex_requeue
 457	i386	set_mempolicy2		sys_set_mempolicy2
+458	i386	get_mempolicy2		sys_get_mempolicy2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1bc2190bec27..0b777876fc15 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -379,6 +379,7 @@
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
 457	common	set_mempolicy2		sys_set_mempolicy2
+458	common	get_mempolicy2		sys_get_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index e26dc89399eb..4536c9a4227d 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -428,3 +428,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 451f0089601f..f696855cbe8c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -821,6 +821,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned long addr, unsigned long flags);
+asmlinkage long sys_get_mempolicy2(struct mpol_args __user *args, size_t size,
+				   unsigned long addr, unsigned long flags);
 asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask,
 				unsigned long maxnode);
 asmlinkage long sys_set_mempolicy2(struct mpol_args __user *args, size_t size,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 55486aba099f..719accc731db 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -830,9 +830,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait)
 __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 #define __NR_set_mempolicy2 457
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
+#define __NR_get_mempolicy2 458
+__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 458
+#define __NR_syscalls 459
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index fa1373c8bff8..6afbd3a41319 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -188,6 +188,7 @@ COND_SYSCALL(process_mrelease);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
 COND_SYSCALL(get_mempolicy);
+COND_SYSCALL(get_mempolicy2);
 COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(set_mempolicy2);
 COND_SYSCALL(migrate_pages);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d1abb1fc5a53..f2c12a8ff7b8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1862,6 +1862,48 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy,
 	return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags);
 }
 
+SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
+		unsigned long, addr, unsigned long, flags)
+{
+	struct mpol_args kargs;
+	struct mempolicy_args margs;
+	int err;
+	nodemask_t policy_nodemask;
+	unsigned long __user *nodes_ptr;
+
+	if (flags & ~(MPOL_F_ADDR))
+		return -EINVAL;
+
+	/* initialize any memory liable to be copied to userland */
+	memset(&margs, 0, sizeof(margs));
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return -EINVAL;
+
+	margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL;
+	if (flags & MPOL_F_ADDR)
+		err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs);
+	else
+		err = do_get_task_mempolicy(&margs, NULL);
+
+	if (err)
+		return err;
+
+	kargs.mode = margs.mode;
+	kargs.mode_flags = margs.mode_flags;
+	kargs.home_node = margs.home_node;
+	if (kargs.pol_nodes) {
+		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+		err = copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes,
+					 margs.policy_nodes);
+		if (err)
+			return err;
+	}
+
+	return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0;
+}
+
 bool vma_migratable(struct vm_area_struct *vma)
 {
 	if (vma->vm_flags & (VM_IO | VM_PFNMAP))
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index bb1351df51d9..c34c6877379e 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -372,3 +372,4 @@
 455	n64	futex_wait			sys_futex_wait
 456	n64	futex_requeue			sys_futex_requeue
 457	n64	set_mempolicy2			sys_set_mempolicy2
+458	n64	get_mempolicy2			sys_get_mempolicy2
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 4f03f5f42b78..ac11d2064e7a 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -544,3 +544,4 @@
 455	common	futex_wait			sys_futex_wait
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
+458	common	get_mempolicy2			sys_get_mempolicy2
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index f98dadc2e9df..1cdcafe1ccca 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -460,3 +460,4 @@
 455  common	futex_wait		sys_futex_wait			sys_futex_wait
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
+458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index 21f2579679d4..edf338f32645 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -379,6 +379,7 @@
 455	common	futex_wait		sys_futex_wait
 456	common	futex_requeue		sys_futex_requeue
 457	common 	set_mempolicy2		sys_set_mempolicy2
+458	common 	get_mempolicy2		sys_get_mempolicy2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (8 preceding siblings ...)
  2023-12-23 18:10 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
@ 2023-12-23 18:11 ` Gregory Price
  2024-01-02 14:47   ` Geert Uytterhoeven
  2023-12-23 18:11 ` [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
  2023-12-25  7:54 ` [PATCH v5 00/11] mempolicy2, mbind2, and " Huang, Ying
  11 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko,
	Frank van der Linden

mbind2 is an extensible mbind interface which allows a user to
set the mempolicy for one or more address ranges.

Defined as:

mbind2(unsigned long addr, unsigned long len, struct mpol_args *args,
       size_t size, unsigned long flags)

addr:         address of the memory range to operate on
len:          length of the memory range
flags:        MPOL_MF_HOME_NODE + original mbind() flags

Input values include the following fields of mpol_args:

mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
mode_flags:   The MPOL_F_* flags that were previously passed in or'd
	      into the mode.  This was split to hopefully allow future
	      extensions additional mode/flag space.
home_node:    if (flags & MPOL_MF_HOME_NODE), set home node of policy
	      to this otherwise it is ignored.
pol_maxnodes: The max number of nodes described by pol_nodes
pol_nodes:    the nodemask to apply for the memory policy

The semantics are otherwise the same as mbind(), except that
the home_node can be set.

Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Frank van der Linden <fvdl@google.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     | 12 +++++-
 arch/alpha/kernel/syscalls/syscall.tbl        |  1 +
 arch/arm/tools/syscall.tbl                    |  1 +
 arch/arm64/include/asm/unistd.h               |  2 +-
 arch/arm64/include/asm/unistd32.h             |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl         |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |  1 +
 arch/s390/kernel/syscalls/syscall.tbl         |  1 +
 arch/sh/kernel/syscalls/syscall.tbl           |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |  1 +
 include/linux/syscalls.h                      |  3 ++
 include/uapi/asm-generic/unistd.h             |  4 +-
 include/uapi/linux/mempolicy.h                |  5 ++-
 kernel/sys_ni.c                               |  1 +
 mm/mempolicy.c                                | 43 +++++++++++++++++++
 .../arch/mips/entry/syscalls/syscall_n64.tbl  |  1 +
 .../arch/powerpc/entry/syscalls/syscall.tbl   |  1 +
 .../perf/arch/s390/entry/syscalls/syscall.tbl |  1 +
 .../arch/x86/entry/syscalls/syscall_64.tbl    |  1 +
 26 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index f50b7f7ddbf9..7edee775cd2f 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -478,12 +478,18 @@ Install VMA/Shared Policy for a Range of Task's Address Space::
 	long mbind(void *start, unsigned long len, int mode,
 		   const unsigned long *nmask, unsigned long maxnode,
 		   unsigned flags);
+	long mbind2(void* start, unsigned long len, struct mpol_args args,
+		    size_t size, unsigned long flags);
 
 mbind() installs the policy specified by (mode, nmask, maxnodes) as a
 VMA policy for the range of the calling task's address space specified
 by the 'start' and 'len' arguments.  Additional actions may be
 requested via the 'flags' argument.
 
+mbind2() is an extended version of mbind() capable of setting extended
+mempolicy features. For example, one can set the home node for the memory
+policy without an additional call to set_mempolicy_home_node().
+
 See the mbind(2) man page for more details.
 
 Set home node for a Range of Task's Address Spacec::
@@ -499,6 +505,9 @@ closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+mbind2() also provides a way for the home node to be set at the time the
+mempolicy is set. See the mbind(2) man page for more details.
+
 Extended Mempolicy Arguments::
 
 	struct mpol_args {
@@ -512,7 +521,8 @@ Extended Mempolicy Arguments::
 The extended mempolicy argument structure is defined to allow the mempolicy
 interfaces future extensibility without the need for additional system calls.
 
-Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure.
+Extended interfaces (set_mempolicy2, get_mempolicy2, and mbind2) use this
+this argument structure.
 
 The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to
 all interfaces relative to their non-extended counterparts. Each additional
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 0301a8b0a262..e8239293c35a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -498,3 +498,4 @@
 566	common	futex_requeue			sys_futex_requeue
 567	common	set_mempolicy2			sys_set_mempolicy2
 568	common	get_mempolicy2			sys_get_mempolicy2
+569	common	mbind2				sys_mbind2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 771a33446e8e..a3f39750257a 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -472,3 +472,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index b63f870debaf..abe10a833fcd 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		459
+#define __NR_compat_syscalls		460
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index f8d01007aee0..446b7f034332 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -923,6 +923,8 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 #define __NR_get_mempolicy2 458
 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#define __NR_mbind2 459
+__SYSCALL(__NR_mbind2, sys_mbind2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 048a409e684c..9a12dface18e 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -458,3 +458,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 327b01bd6793..6cb740123137 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -464,3 +464,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 921d58e1da23..52cf720f8ae2 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -397,3 +397,4 @@
 456	n32	futex_requeue			sys_futex_requeue
 457	n32	set_mempolicy2			sys_set_mempolicy2
 458	n32	get_mempolicy2			sys_get_mempolicy2
+459	n32	mbind2				sys_mbind2
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 9271c83c9993..fd37c5301a48 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -446,3 +446,4 @@
 456	o32	futex_requeue			sys_futex_requeue
 457	o32	set_mempolicy2			sys_set_mempolicy2
 458	o32	get_mempolicy2			sys_get_mempolicy2
+459	o32	mbind2				sys_mbind2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 0654f3f89fc7..fcd67bc405b1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -457,3 +457,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index ac11d2064e7a..89715417014c 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -545,3 +545,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 1cdcafe1ccca..c8304e0d0aa7 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
 458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
+459  common	mbind2			sys_mbind2			sys_mbind2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index f71742024c29..e5c51b6c367f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 2fbf5dbe0620..74527f585500 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -504,3 +504,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 0af813b9a118..be2e2aa17dd8 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -463,3 +463,4 @@
 456	i386	futex_requeue		sys_futex_requeue
 457	i386	set_mempolicy2		sys_set_mempolicy2
 458	i386	get_mempolicy2		sys_get_mempolicy2
+459	i386	mbind2			sys_mbind2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 0b777876fc15..6e2347eb8773 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -380,6 +380,7 @@
 456	common	futex_requeue		sys_futex_requeue
 457	common	set_mempolicy2		sys_set_mempolicy2
 458	common	get_mempolicy2		sys_get_mempolicy2
+459	common	mbind2			sys_mbind2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 4536c9a4227d..f00a21317dc0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -429,3 +429,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f696855cbe8c..b42622ea9ed9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -817,6 +817,9 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len,
 				const unsigned long __user *nmask,
 				unsigned long maxnode,
 				unsigned flags);
+asmlinkage long sys_mbind2(unsigned long start, unsigned long len,
+			   const struct mpol_args __user *uargs, size_t usize,
+			   unsigned long flags);
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
 				unsigned long maxnode,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 719accc731db..cd31599bb9cc 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -832,9 +832,11 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue)
 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2)
 #define __NR_get_mempolicy2 458
 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2)
+#define __NR_mbind2 459
+__SYSCALL(__NR_mbind2, sys_mbind2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 459
+#define __NR_syscalls 460
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 4dd2d2e0d2ed..8880b753a446 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -52,13 +52,14 @@ struct mpol_args {
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
 #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
 
-/* Flags for mbind */
+/* Flags for mbind/mbind2 */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
 #define MPOL_MF_MOVE	 (1<<1)	/* Move pages owned by this process to conform
 				   to policy */
 #define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
 #define MPOL_MF_LAZY	 (1<<3)	/* UNSUPPORTED FLAG: Lazy migrate on fault */
-#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+#define MPOL_MF_HOME_NODE (1<<4)	/* mbind2: set home node */
+#define MPOL_MF_INTERNAL (1<<5)	/* Internal flags start here */
 
 #define MPOL_MF_VALID	(MPOL_MF_STRICT   | 	\
 			 MPOL_MF_MOVE     | 	\
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6afbd3a41319..2483b5afa99f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -187,6 +187,7 @@ COND_SYSCALL(process_madvise);
 COND_SYSCALL(process_mrelease);
 COND_SYSCALL(remap_file_pages);
 COND_SYSCALL(mbind);
+COND_SYSCALL(mbind2);
 COND_SYSCALL(get_mempolicy);
 COND_SYSCALL(get_mempolicy2);
 COND_SYSCALL(set_mempolicy);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f2c12a8ff7b8..b5aca779249a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1601,6 +1601,49 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 	return kernel_mbind(start, len, mode, nmask, maxnode, flags);
 }
 
+SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
+		const struct mpol_args __user *, uargs, size_t, usize,
+		unsigned long, flags)
+{
+	struct mpol_args kargs;
+	struct mempolicy_args margs;
+	nodemask_t policy_nodes;
+	unsigned long __user *nodes_ptr;
+	int err;
+
+	if (!start || !len)
+		return -EINVAL;
+
+	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
+	if (err)
+		return -EINVAL;
+
+	err = validate_mpol_flags(kargs.mode, &kargs.mode_flags);
+	if (err)
+		return err;
+
+	margs.mode = kargs.mode;
+	margs.mode_flags = kargs.mode_flags;
+
+	/* if home node given, validate it is online */
+	if (flags & MPOL_MF_HOME_NODE) {
+		if ((kargs.home_node >= MAX_NUMNODES) ||
+			!node_online(kargs.home_node))
+			return -EINVAL;
+		margs.home_node = kargs.home_node;
+	} else
+		margs.home_node = NUMA_NO_NODE;
+	flags &= ~MPOL_MF_HOME_NODE;
+
+	nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
+	err = get_nodes(&policy_nodes, nodes_ptr, kargs.pol_maxnodes);
+	if (err)
+		return err;
+	margs.policy_nodes = &policy_nodes;
+
+	return do_mbind(untagged_addr(start), len, &margs, flags);
+}
+
 /* Set the process memory policy */
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
diff --git a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
index c34c6877379e..4fd9f742d903 100644
--- a/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
+++ b/tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl
@@ -373,3 +373,4 @@
 456	n64	futex_requeue			sys_futex_requeue
 457	n64	set_mempolicy2			sys_set_mempolicy2
 458	n64	get_mempolicy2			sys_get_mempolicy2
+459	n64	mbind2				sys_mbind2
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index ac11d2064e7a..89715417014c 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -545,3 +545,4 @@
 456	common	futex_requeue			sys_futex_requeue
 457	common	set_mempolicy2			sys_set_mempolicy2
 458	common	get_mempolicy2			sys_get_mempolicy2
+459	common	mbind2				sys_mbind2
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index 1cdcafe1ccca..c8304e0d0aa7 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -461,3 +461,4 @@
 456  common	futex_requeue		sys_futex_requeue		sys_futex_requeue
 457  common	set_mempolicy2		sys_set_mempolicy2		sys_set_mempolicy2
 458  common	get_mempolicy2		sys_get_mempolicy2		sys_get_mempolicy2
+459  common	mbind2			sys_mbind2			sys_mbind2
diff --git a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
index edf338f32645..3fc74241da5d 100644
--- a/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/tools/perf/arch/x86/entry/syscalls/syscall_64.tbl
@@ -380,6 +380,7 @@
 456	common	futex_requeue		sys_futex_requeue
 457	common 	set_mempolicy2		sys_set_mempolicy2
 458	common 	get_mempolicy2		sys_get_mempolicy2
+459	common 	mbind2			sys_mbind2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (9 preceding siblings ...)
  2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
@ 2023-12-23 18:11 ` Gregory Price
  2023-12-25  7:54 ` [PATCH v5 00/11] mempolicy2, mbind2, and " Huang, Ying
  11 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-23 18:11 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-doc, linux-fsdevel, linux-kernel, linux-api, x86, akpm,
	arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha

Extend set_mempolicy2 and mbind2 to support weighted interleave, and
demonstrate the extensibility of the mpol_args structure.

To support weighted interleave we add interleave weight fields to the
following structures:

Kernel Internal:  (include/linux/mempolicy.h)
struct mempolicy {
	/* task-local weights to apply to weighted interleave */
	unsigned char weights[MAX_NUMNODES];
}
struct mempolicy_args {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	unsigned char *il_weights;	/* of size MAX_NUMNODES */
}

UAPI: (/include/uapi/linux/mempolicy.h)
struct mpol_args {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	unsigned char *il_weights;	/* of size pol_max_nodes */
}

The task-local weights are a single, one-dimensional array of weights
that apply to all possible nodes on the system.  If a node is set in
the mempolicy nodemask, the weight in `il_weights` must be >= 1,
otherwise set_mempolicy2() will return -EINVAL.  If a node is not
set in pol_nodemask, the weight will default to `1` in the task policy.

The default value of `1` is required to handle the situation where a
task migrates to a set of nodes for which weights were not set (up to
and including the local numa node).  For example, a migrated task whose
nodemask changes entirely will have all its weights defaulted back
to `1`, or if the nodemask changes to include a mix of nodes that
were not previously accounted for - the weighted interleave may be
suboptimal.

If migrations are expected, a task should prefer not to use task-local
interleave weights, and instead utilize the global settings for natural
re-weighting on migration.

To support global vs local weighting,  we add the kernel-internal flag:
MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */

This flag is set when il_weights is omitted by set_mempolicy2(), or
when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal
mode_flag dictates whether global weights or task-local weights are
utilized by the the various weighted interleave functions:

* weighted_interleave_nodes
* weighted_interleave_nid
* alloc_pages_bulk_array_weighted_interleave

if (pol->flags & MPOL_F_GWEIGHT)
	pol_weights = iw_table;
else
	pol_weights = pol->wil.weights;

To simplify creations and duplication of mempolicies, the weights are
added as a structure directly within mempolicy. This allows the
existing logic in __mpol_dup to copy the weights without additional
allocations:

if (old == current->mempolicy) {
	task_lock(current);
	*new = *old;
	task_unlock(current);
} else
	*new = *old

Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  10 ++
 include/linux/mempolicy.h                     |   2 +
 include/uapi/linux/mempolicy.h                |   2 +
 mm/mempolicy.c                                | 129 +++++++++++++++++-
 4 files changed, 139 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 7edee775cd2f..4f52a9108576 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE
 	This mode operates the same as MPOL_INTERLEAVE, except that
 	interleaving behavior is executed based on weights set in
 	/sys/kernel/mm/mempolicy/weighted_interleave/
+	when configured to utilize global weights, or based on task-local
+	weights configured with set_mempolicy2(2) or mbind2(2).
 
 	Weighted interleave allocations pages on nodes according to
 	their weight.  For example if nodes [0,1] are weighted [5,2]
@@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE
 	2 pages allocated on node1.  This can better distribute data
 	according to bandwidth on heterogeneous memory systems.
 
+	When utilizing task-local weights, weights are not rebalanced
+	in the event of a task migration.  If a weight has not been
+	explicitly set for a node set in the new nodemask, the
+	value of that weight defaults to "1".  For this reason, if
+	migrations are expected or possible, users should consider
+	utilizing global interleave weights.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
@@ -516,6 +525,7 @@ Extended Mempolicy Arguments::
 		__s32 home_node;	 /* mbind2: set home node */
 		__u64 pol_maxnodes;
 		__aligned_u64 pol_nodes; /* nodemask pointer */
+		__aligned_u64 il_weights;  /* u8 buf of size pol_maxnodes */
 	};
 
 The extended mempolicy argument structure is defined to allow the mempolicy
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 0f1c85527626..06ec3a3b0f22 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -58,6 +58,7 @@ struct mempolicy {
 	/* Weighted interleave settings */
 	struct {
 		unsigned char cur_weight;
+		unsigned char weights[MAX_NUMNODES];
 	} wil;
 };
 
@@ -70,6 +71,7 @@ struct mempolicy_args {
 	unsigned short mode_flags;	/* policy mode flags */
 	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
 	nodemask_t *policy_nodes;	/* get/set/mbind */
+	unsigned char *il_weights;	/* for mode MPOL_WEIGHTED_INTERLEAVE */
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8880b753a446..16ee2359ef55 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -33,6 +33,7 @@ struct mpol_args {
 	__s32 home_node;	/* mbind2: policy home node */
 	__u64 pol_maxnodes;
 	__aligned_u64 pol_nodes;
+	__aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */
 };
 
 /* Flags for set_mempolicy */
@@ -73,6 +74,7 @@ struct mpol_args {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_GWEIGHT	(1 << 5) /* Utilize global weights */
 
 /*
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b5aca779249a..6bed4151e0c2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -271,6 +271,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	unsigned short mode = args->mode;
 	unsigned short flags = args->mode_flags;
 	nodemask_t *nodes = args->policy_nodes;
+	int node;
 
 	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
@@ -297,6 +298,19 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 		    (flags & MPOL_F_STATIC_NODES) ||
 		    (flags & MPOL_F_RELATIVE_NODES))
 			return ERR_PTR(-EINVAL);
+	} else if (mode == MPOL_WEIGHTED_INTERLEAVE) {
+		/* weighted interleave requires a nodemask and weights > 0 */
+		if (nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		if (args->il_weights) {
+			node = first_node(*nodes);
+			while (node != MAX_NUMNODES) {
+				if (!args->il_weights[node])
+					return ERR_PTR(-EINVAL);
+				node = next_node(node, *nodes);
+			}
+		} else if (!(args->mode_flags & MPOL_F_GWEIGHT))
+			return ERR_PTR(-EINVAL);
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 
@@ -309,6 +323,17 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	policy->home_node = args->home_node;
 	policy->wil.cur_weight = 0;
 
+	if (policy->mode == MPOL_WEIGHTED_INTERLEAVE && args->il_weights) {
+		policy->wil.cur_weight = 0;
+		/* Minimum weight value is always 1 */
+		memset(policy->wil.weights, 1, MAX_NUMNODES);
+		node = first_node(*nodes);
+		while (node != MAX_NUMNODES) {
+			policy->wil.weights[node] = args->il_weights[node];
+			node = next_node(node, *nodes);
+		}
+	}
+
 	return policy;
 }
 
@@ -937,6 +962,17 @@ static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask)
 	}
 }
 
+static void do_get_mempolicy_il_weights(struct mempolicy *pol,
+					unsigned char weights[MAX_NUMNODES])
+{
+	if (pol->mode != MPOL_WEIGHTED_INTERLEAVE)
+		memset(weights, 0, MAX_NUMNODES);
+	else if (pol->flags & MPOL_F_GWEIGHT)
+		memcpy(weights, iw_table, MAX_NUMNODES);
+	else
+		memcpy(weights, pol->wil.weights, MAX_NUMNODES);
+}
+
 /* Retrieve NUMA policy for a VMA assocated with a given address  */
 static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
 				 struct mempolicy_args *args)
@@ -970,6 +1006,9 @@ static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
 	if (args->policy_nodes)
 		do_get_mempolicy_nodemask(pol, args->policy_nodes);
 
+	if (args->il_weights)
+		do_get_mempolicy_il_weights(pol, args->il_weights);
+
 	if (pol != &default_policy) {
 		mpol_put(pol);
 		mpol_cond_put(pol);
@@ -997,6 +1036,9 @@ static long do_get_task_mempolicy(struct mempolicy_args *args, int *pol_node)
 	if (args->policy_nodes)
 		do_get_mempolicy_nodemask(pol, args->policy_nodes);
 
+	if (args->il_weights)
+		do_get_mempolicy_il_weights(pol, args->il_weights);
+
 	return 0;
 }
 
@@ -1519,6 +1561,9 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
+	if (mode & MPOL_WEIGHTED_INTERLEAVE)
+		mode_flags |= MPOL_F_GWEIGHT;
+
 	memset(&margs, 0, sizeof(margs));
 	margs.mode = lmode;
 	margs.mode_flags = mode_flags;
@@ -1609,6 +1654,8 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
 	struct mempolicy_args margs;
 	nodemask_t policy_nodes;
 	unsigned long __user *nodes_ptr;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char __user *weights_ptr;
 	int err;
 
 	if (!start || !len)
@@ -1641,6 +1688,23 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
 		return err;
 	margs.policy_nodes = &policy_nodes;
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		if (weights_ptr) {
+			err = copy_struct_from_user(weights,
+						    sizeof(weights),
+						    weights_ptr,
+						    kargs.pol_maxnodes);
+			if (err)
+				return err;
+			margs.il_weights = weights;
+		} else {
+			margs.il_weights = NULL;
+			margs.mode_flags |= MPOL_F_GWEIGHT;
+		}
+	} else
+		margs.il_weights = NULL;
+
 	return do_mbind(untagged_addr(start), len, &margs, flags);
 }
 
@@ -1662,6 +1726,9 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
+	if (mode & MPOL_WEIGHTED_INTERLEAVE)
+		mode_flags |= MPOL_F_GWEIGHT;
+
 	memset(&args, 0, sizeof(args));
 	args.mode = lmode;
 	args.mode_flags = mode_flags;
@@ -1685,6 +1752,8 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char __user *weights_ptr;
 
 	if (flags)
 		return -EINVAL;
@@ -1710,6 +1779,20 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	} else
 		margs.policy_nodes = NULL;
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_struct_from_user(weights,
+					    sizeof(weights),
+					    weights_ptr,
+					    kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.il_weights = weights;
+	} else {
+		margs.il_weights = NULL;
+		margs.mode_flags |= MPOL_F_GWEIGHT;
+	}
+
 	return do_set_mempolicy(&margs);
 }
 
@@ -1913,17 +1996,25 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	unsigned char __user *weights_ptr;
+	unsigned char weights[MAX_NUMNODES];
 
 	if (flags & ~(MPOL_F_ADDR))
 		return -EINVAL;
 
 	/* initialize any memory liable to be copied to userland */
 	memset(&margs, 0, sizeof(margs));
+	memset(weights, 0, sizeof(weights));
 
 	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
 	if (err)
 		return -EINVAL;
 
+	if (kargs.il_weights)
+		margs.il_weights = weights;
+	else
+		margs.il_weights = NULL;
+
 	margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL;
 	if (flags & MPOL_F_ADDR)
 		err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs);
@@ -1944,6 +2035,13 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 			return err;
 	}
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_to_user(weights_ptr, weights, kargs.pol_maxnodes);
+		if (err)
+			return err;
+	}
+
 	return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0;
 }
 
@@ -2060,13 +2158,18 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
 	unsigned int next;
 	struct task_struct *me = current;
+	unsigned char next_weight;
 
 	next = next_node_in(me->il_prev, policy->nodes);
 	if (next == MAX_NUMNODES)
 		return next;
 
-	if (!policy->wil.cur_weight)
-		policy->wil.cur_weight = iw_table[next];
+	if (!policy->wil.cur_weight) {
+		next_weight = (policy->flags & MPOL_F_GWEIGHT) ?
+				iw_table[next] :
+				policy->wil.weights[next];
+		policy->wil.cur_weight = next_weight ? next_weight : 1;
+	}
 
 	policy->wil.cur_weight--;
 	if (!policy->wil.cur_weight)
@@ -2140,6 +2243,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	nodemask_t nodemask = pol->nodes;
 	unsigned int target, weight_total = 0;
 	int nid;
+	unsigned char *pol_weights;
 	unsigned char weights[MAX_NUMNODES];
 	unsigned char weight;
 
@@ -2151,8 +2255,13 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 		return nid;
 
 	/* Then collect weights on stack and calculate totals */
+	if (pol->flags & MPOL_F_GWEIGHT)
+		pol_weights = iw_table;
+	else
+		pol_weights = pol->wil.weights;
+
 	for_each_node_mask(nid, nodemask) {
-		weight = iw_table[nid];
+		weight = pol_weights[nid];
 		weight_total += weight;
 		weights[nid] = weight;
 	}
@@ -2550,6 +2659,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	unsigned long nr_allocated;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
+	unsigned char *pol_weights;
 	unsigned char weight;
 	unsigned char weights[MAX_NUMNODES];
 	unsigned int weight_total = 0;
@@ -2563,9 +2673,14 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 
 	nnodes = nodes_weight(nodes);
 
+	if (pol->flags & MPOL_F_GWEIGHT)
+		pol_weights = iw_table;
+	else
+		pol_weights = pol->wil.weights;
+
 	/* Collect weights and save them on stack so they don't change */
 	for_each_node_mask(node, nodes) {
-		weight = iw_table[node];
+		weight = pol_weights[node];
 		weight_total += weight;
 		weights[node] = weight;
 	}
@@ -3090,6 +3205,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 {
 	int ret;
 	struct mempolicy_args margs;
+	unsigned char weights[MAX_NUMNODES];
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
 	rwlock_init(&sp->lock);
@@ -3107,6 +3223,11 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		margs.mode_flags = mpol->flags;
 		margs.policy_nodes = &mpol->w.user_nodemask;
 		margs.home_node = NUMA_NO_NODE;
+		if (margs.mode == MPOL_WEIGHTED_INTERLEAVE &&
+		    !(margs.mode_flags & MPOL_F_GWEIGHT)) {
+			memcpy(weights, mpol->wil.weights, sizeof(weights));
+			margs.il_weights = weights;
+		}
 
 		/* contextualize the tmpfs mount point mempolicy to this file */
 		npol = mpol_new(&margs);
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
                   ` (10 preceding siblings ...)
  2023-12-23 18:11 ` [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
@ 2023-12-25  7:54 ` Huang, Ying
  2023-12-26  7:45   ` Gregory Price
  11 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2023-12-25  7:54 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	gregory.price, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron

Gregory Price <gourry.memverge@gmail.com> writes:

> Weighted interleave is a new interleave policy intended to make
> use of a the new distributed-memory environment made available
> by CXL.  The existing interleave mechanism does an even round-robin
> distribution of memory across all nodes in a nodemask, while
> weighted interleave can distribute memory across nodes according
> the available bandwidth that that node provides.
>
> As tests below show, "default interleave" can cause major performance
> degredation due to distribution not matching bandwidth available,
> while "weighted interleave" can provide a performance increase.
>
> For example, the stream benchmark demonstrates that default interleave
> is actively harmful, where weighted interleave is beneficial.
>
> Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
> Default interleave : -78% (slower than DRAM)
> Global weighting   : -6% to +4% (workload dependant)
> Targeted weights   : +2.5% to +4% (consistently better than DRAM)
>
> If nothing else, this shows how awful round-robin interleave is.

I guess the performance of the default policy, local (fast memory)
first, may be even better in some situation?  For example, before the
bandwidth of DRAM is saturated?

I understand that you may want to limit the memory usage of the fast
memory too.  But IMHO, that is another requirements.  That should be
enforced by something like per-node memory limit.

> Rather than implement yet another specific syscall to set one
> particular field of a mempolicy, we chose to implement an extensible
> mempolicy interface so that future extensions can be captured.
>
> To implement weighted interleave, we need an interface to set the
> node weights along with a MPOL_WEIGHTED_INTERLEAVE. We implement a
> a sysfs extension for "system global" weights which can be set by
> a daemon or administrator, and new extensible syscalls (mempolicy2,
> mbind2) which allow task-local weights to be set via user-software.
>
> The benefit of the sysfs extension is that MPOL_WEIGHTED_INTERLEAVE
> can be used by the existing set_mempolicy and mbind via numactl.
>
> There are 3 "phases" in the patch set that could be considered
> for separate merge candidates, but are presented here as a single
> line as the goal is a fully functional MPOL_WEIGHTED_INTERLEAVE.
>
> 1) Implement MPOL_WEIGHTED_INTERLEAVE with a sysfs extension for
>    setting system-global weights via sysfs.
>    (Patches 1 & 2)
>
> 2) Refactor mempolicy creation mechanism to use an extensible arg
>    struct `struct mempolicy_args` to promote code re-use between
>    the original mempolicy/mbind interfaces and the new interfaces.
>    (Patches 3-6)
>
> 3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
>    along with the addition of task-local weights so that per-task
>    weights can be registered for MPOL_WEIGHTED_INTERLEAVE.
>    (Patches 7-11)
>
> Included at the bottom of this cover letter is linux test project
> tests for backward and forward compatibility, some sample software
> which can be used for quick tests, as well as a numactl branch
> which implements `numactl -w --interleave` for testing.
>
> = Performance summary =
> (tests may have different configurations, see extended info below)
> 1) MLC (W2) : +38% over DRAM. +264% over default interleave.
>    MLC (W5) : +40% over DRAM. +226% over default interleave.
> 2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
> 3) XSBench  : +19% over DRAM. +47% over default interleave.
>
> = LTP Testing Summary =
> existing mempolicy & mbind tests: pass
> mempolicy & mbind + weighted interleave (global weights): pass
> mempolicy2 & mbind2 + weighted interleave (global weights): pass
> mempolicy2 & mbind2 + weighted interleave (local weights): pass
>

[snip]

> =====================================================================
> (Patches 3-6) Refactoring mempolicy for code-reuse
>
> To avoid multiple paths of mempolicy creation, we should refactor the
> existing code to enable the designed extensibility, and refactor
> existing users to utilize the new interface (while retaining the
> existing userland interface).
>
> This set of patches introduces a new mempolicy_args structure, which
> is used to more fully describe a requested mempolicy - to include
> existing and future extensions.
>
> /*
>  * Describes settings of a mempolicy during set/get syscalls and
>  * kernel internal calls to do_set_mempolicy()
>  */
> struct mempolicy_args {
>     unsigned short mode;            /* policy mode */
>     unsigned short mode_flags;      /* policy mode flags */
>     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
>     nodemask_t *policy_nodes;       /* get/set/mbind */
>     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
> };

According to

https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/

it appears that "parameter" are better than "argument" for struct name
here.  It appears that current kernel source supports this too.

$ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l
411
$ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l
25

> This arg structure will eventually be utilized by the following
> interfaces:
>     mpol_new() - new mempolicy creation
>     do_get_mempolicy() - acquiring information about mempolicy
>     do_set_mempolicy() - setting the task mempolicy
>     do_mbind()         - setting a vma mempolicy
>
> do_get_mempolicy() is completely refactored to break it out into
> separate functionality based on the flags provided by get_mempolicy(2)
>     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
>     MPOL_F_ADDR: acquires information on vma policies
>     MPOL_F_NODE: changes the output for the policy arg to node info
>
> We refactor the get_mempolicy syscall flatten the logic based on these
> flags, and aloow for set_mempolicy2() to re-use the underlying logic.
>
> The result of this refactor, and the new mempolicy_args structure, is
> that extensions like 'sys_set_mempolicy_home_node' can now be directly
> integrated into the initial call to 'set_mempolicy2', and that more
> complete information about a mempolicy can be returned with a single
> call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
>
>
> =====================================================================
> (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>
> These interfaces are the 'extended' counterpart to their relatives.
> They use the userland 'struct mpol_args' structure to communicate a
> complete mempolicy configuration to the kernel.  This structure
> looks very much like the kernel-internal 'struct mempolicy_args':
>
> struct mpol_args {
>         /* Basic mempolicy settings */
>         __u16 mode;
>         __u16 mode_flags;
>         __s32 home_node;
>         __u64 pol_maxnodes;

I understand that we want to avoid hole in struct.  But I still feel
uncomfortable to use __u64 for a small.  But I don't have solution too.
Anyone else has some idea?

>         __aligned_u64 pol_nodes;
>         __aligned_u64 *il_weights;      /* of size pol_maxnodes */

Typo?  Should be,

         __aligned_u64 il_weights;      /* of size pol_maxnodes */

?

Found this in some patch descriptions too.

> };
>
> The basic mempolicy settings which are shared across all interfaces
> are captured at the top of the structure, while extensions such as
> 'policy_node' and 'addr' are collected beneath.
>
> The syscalls are uniform and defined as follows:
>
> long sys_mbind2(unsigned long addr, unsigned long len,
>                 struct mpol_args *args, size_t usize,
>                 unsigned long flags);
>
> long sys_get_mempolicy2(struct mpol_args *args, size_t size,
>                         unsigned long addr, unsigned long flags);
>
> long sys_set_mempolicy2(struct mpol_args *args, size_t size,
>                         unsigned long flags);
>
> The 'flags' argument for mbind2 is the same as 'mbind', except with
> the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
> field should be utilized.
>
> The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
> allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
> behavior has been omitted, since get_mempolicy() provides this already.

I still think that it's a good idea to make it possible to deprecate
get_mempolicy().  How about use a union as follows?

struct mpol_mems_allowed {
         __u64 maxnodes;
         __aligned_u64 nodemask;
};

union mpol_info {
        struct mpol_args args;
        struct mpol_mems_allowed mems_allowed;
        __s32 node;
};

> The 'flags' argument is not used by 'set_mempolicy' at this time, but
> may end up allowing the use of MPOL_MF_HOME_NODE if such functionality
> is desired.
>
> The extensions can be summed up as follows:
>
> get_mempolicy2 extensions:
>     - mode and mode flags are split into separate fields
>     - MPOL_F_MEMS_ALLOWED and MPOL_F_NODE are not supported
>
> set_mempolicy2:
>     - task-local interleave weights can be set via 'il_weights'
>
> mbind2:
>     - home_node field sets policy home node w/ MPOL_MF_HOME_NODE
>     - task-local interleave weights can be set via 'il_weights'
>

--
Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2023-12-27  6:42   ` Huang, Ying
@ 2023-12-26  6:48     ` Gregory Price
  2024-01-02  7:41       ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-26  6:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Wed, Dec 27, 2023 at 02:42:15PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > +		These weights only affect new allocations, and changes at runtime
> > +		will not cause migrations on already allocated pages.
> > +
> > +		Writing an empty string resets the weight value to 1.
> 
> I still think that it's a good idea to provide some better default
> weight value with HMAT or CDAT if available.  So, better not to make "1"
> as part of ABI?
> 

That's the eventual goal, but this is just the initial mechanism.

My current thought is that the CXL driver will apply weights as the
system iterates through devices and creates numa nodes.  In the
meantime, you have to give the "possible" nodes a default value to
prevent nodes onlined after boot from showing up with 0-value.

Not allowing 0-value weights is simply easier in many respects.

> > +
> > +		Minimum weight: 1
> 
> Can weight be "0"?  Do we need a way to specify that a node don't want
> to participate weighted interleave?
> 

In this code, weight cannot be 0.  My thoguht is that removing the node
from the nodemask is the way to denote 0.

The problem with 0 is hotplug, migration, and cpusets.mems_allowed.  

Example issue:  Use set local weights to [1,0,1,0] for nodes [0-3],
and has a cpusets.mems_allowed mask of (0, 2).

Lets say the user migrates the task via cgroups from nodes (0,2) to
(1,3).

The task will instantly crash as basically OOM because weights of
[1,0,1,0] will prevent memory from being allocations.

Not allowing nodes weights of 0 is defensive.  Instead, simply removing
the node from the nodemask and/or mems_allowed is both equivalent to and
the preferred way to apply a weight of 0.

> > +		Maximum weight: 255
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 10a590ee1c89..0e77633b07a5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
> >  
> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
> >  
> > +static char iw_table[MAX_NUMNODES];
> > +
> 
> It's kind of obscure whether "char" is "signed" or "unsigned".  Given
> the max weight is 255 above, it's better to use "u8"?
>

bah, stupid mistake.  I will switch this to u8.

> And, we may need a way to specify whether the weight has been overridden
> by the user.
> A special value (such as 255) can be used for that.  If
> so, the maximum weight should be 254 instead of 255.  As a user space
> interface, is it better to use 100 as the maximum value?
> 

There's global weights and local weights.  These are the global weights.

Local weights are stored in task->mempolicy.wil.il_weights.

(policy->mode_flags & MPOL_F_GWEIGHT) denotes the override.
This is set if (mempolicy_args->il_weights) was provided.

This simplifies the interface.

(note: local weights are not introduced until the last patch 11/11)

> > +
> > +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_NUMNODES; i++)
> > +		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
> 
> IIUC, if this is called in error path (such as, in
> add_weighted_interleave_group()), some node_attrs[] element may be
> "NULL"?
> 

The null check is present in sysfs_wi_node_release

if (!node_attr)
	return;

Is it preferable to pull this out? Seemed more defensive to put it
inside the function.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-27  8:32   ` Huang, Ying
@ 2023-12-26  7:01     ` Gregory Price
  2023-12-26  8:06       ` Gregory Price
                         ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-26  7:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> > +{
> > +	nodemask_t nodemask = pol->nodes;
> > +	unsigned int target, weight_total = 0;
> > +	int nid;
> > +	unsigned char weights[MAX_NUMNODES];
> 
> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
> large?
> 

I've been struggling with a good solution to this.  We need a local copy
of weights to prevent weights from changing out from under us during
allocation (which may take quite some time), but it seemed unwise to
to allocate 1KB heap in this particular path.

Is my concern unfounded?  If so, I can go ahead and add the allocation
code.

> > +	unsigned char weight;
> > +
> > +	barrier();
> 
> Memory barrier needs comments.
> 

Barrier is to stabilize nodemask on the stack, but yes i'll carry the
comment from interleave_nid into this barrier as well.

> > +
> > +	/* first ensure we have a valid nodemask */
> > +	nid = first_node(nodemask);
> > +	if (nid == MAX_NUMNODES)
> > +		return nid;
> 
> It appears that this isn't necessary, because we can check whether
> weight_total == 0 after the next loop.
> 

fair, will snip.

> > +
> > +	/* Then collect weights on stack and calculate totals */
> > +	for_each_node_mask(nid, nodemask) {
> > +		weight = iw_table[nid];
> > +		weight_total += weight;
> > +		weights[nid] = weight;
> > +	}
> > +
> > +	/* Finally, calculate the node offset based on totals */
> > +	target = (unsigned int)ilx % weight_total;
> 
> Why use type casting?
> 

Artifact of old prototypes, snipped.

> > +
> > +	/* Stabilize the nodemask on the stack */
> > +	barrier();
> 
> I don't think barrier() is needed to wait for memory operations for
> stack.  It's usually used for cross-processor memory order.
>

This is present in the old interleave code.  To the best of my
understanding, the concern is for mempolicy->nodemask rebinding that can
occur when cgroups.cpusets.mems_allowed changes.

so we can't iterate over (mempolicy->nodemask), we have to take a local
copy.

My *best* understanding of the barrier here is to prevent the compiler
from reordering operations such that it attempts to optimize out the
local copy (or do lazy-fetch).

It is present in the original interleave code, so I pulled it forward to
this, but I have not tested whether this is a bit paranoid or not.

from `interleave_nid`:

 /*
  * The barrier will stabilize the nodemask in a register or on
  * the stack so that it will stop changing under the code.
  *
  * Between first_node() and next_node(), pol->nodes could be changed
  * by other threads. So we put pol->nodes in a local stack.
  */
 barrier();

> > +		/* Otherwise we adjust nr_pages down, and continue from there */
> > +		rem_pages -= pol->wil.cur_weight;
> > +		pol->wil.cur_weight = 0;
> > +		prev_node = node;
> 
> If pol->wil.cur_weight == 0, prev_node will be used without being
> initialized below.
> 

pol->wil.cur_weight is not used below.

> > +	}
> > +
> > +	/* Now we can continue allocating as if from 0 instead of an offset */
> > +	rounds = rem_pages / weight_total;
> > +	delta = rem_pages % weight_total;
> > +	for (i = 0; i < nnodes; i++) {
> > +		node = next_node_in(prev_node, nodes);
> > +		weight = weights[node];
> > +		node_pages = weight * rounds;
> > +		if (delta) {
> > +			if (delta > weight) {
> > +				node_pages += weight;
> > +				delta -= weight;
> > +			} else {
> > +				node_pages += delta;
> > +				delta = 0;
> > +			}
> > +		}
> > +		/* We may not make it all the way around */
> > +		if (!node_pages)
> > +			break;
> > +		/* If an over-allocation would occur, floor it */
> > +		if (node_pages + total_allocated > nr_pages) {
> 
> Why is this possible?
> 

this may have been a paranoid artifact from an early prototype, will
snip and validate.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2023-12-27  8:39   ` Huang, Ying
@ 2023-12-26  7:05     ` Gregory Price
  2023-12-26 11:48       ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-26  7:05 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Wed, Dec 27, 2023 at 04:39:29PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > + * fields and return just the mode in mode_arg and flags in flags.
> > + */
> > +static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags)
> > +{
> > +	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
> > +
> > +	*flags = *mode_arg & MPOL_MODE_FLAGS;
> > +	*mode_arg = mode;
> 
> It appears that it's unnecessary to introduce a local variable to split
> mode/flags.  Just reuse the original code?
> 

ack

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-25  7:54 ` [PATCH v5 00/11] mempolicy2, mbind2, and " Huang, Ying
@ 2023-12-26  7:45   ` Gregory Price
  2024-01-02  4:27     ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-26  7:45 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron

On Mon, Dec 25, 2023 at 03:54:18PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > For example, the stream benchmark demonstrates that default interleave
> > is actively harmful, where weighted interleave is beneficial.
> >
> > Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
> > Default interleave : -78% (slower than DRAM)
> > Global weighting   : -6% to +4% (workload dependant)
> > Targeted weights   : +2.5% to +4% (consistently better than DRAM)
> >
> > If nothing else, this shows how awful round-robin interleave is.
> 
> I guess the performance of the default policy, local (fast memory)
> first, may be even better in some situation?  For example, before the
> bandwidth of DRAM is saturated?
> 

Yes - but it's more complicated than that.

Global weighting here means we did `numactl -w --interleave ...`, which
means *all* memory regions will be interleaved.  Code, stack, heap, etc.

Targeted weights means we used mbind2() with local weights, which only
targted specific heap regions.

The default policy was better than global weighting likely as a result
of things like stack/code being distributed to higher latency memory
produced a measurable overhead.

To provide this, we only applied weights to bandwidth driving regions,
and as a result we demonstrated a measurable performance increase.

So yes, the defautl policy may be better in some situations - but that
will be true of any policy.

> I understand that you may want to limit the memory usage of the fast
> memory too.  But IMHO, that is another requirements.  That should be
> enforced by something like per-node memory limit.
> 

This interface does not limit memory usage of a particular node, it 
distributes data according to the requested policy.

Nuanced distinction, but important.  If nodes become exhausted, tasks
are still free to allocate memory from any node in the nodemask, even if
it violates the requested mempolicy.

This is consistent with the existing behavior of mempolicy.

> > =====================================================================
> > (Patches 3-6) Refactoring mempolicy for code-reuse
> >
> > To avoid multiple paths of mempolicy creation, we should refactor the
> > existing code to enable the designed extensibility, and refactor
> > existing users to utilize the new interface (while retaining the
> > existing userland interface).
> >
> > This set of patches introduces a new mempolicy_args structure, which
> > is used to more fully describe a requested mempolicy - to include
> > existing and future extensions.
> >
> > /*
> >  * Describes settings of a mempolicy during set/get syscalls and
> >  * kernel internal calls to do_set_mempolicy()
> >  */
> > struct mempolicy_args {
> >     unsigned short mode;            /* policy mode */
> >     unsigned short mode_flags;      /* policy mode flags */
> >     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
> >     nodemask_t *policy_nodes;       /* get/set/mbind */
> >     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
> > };
> 
> According to
> 
> https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/
> 
> it appears that "parameter" are better than "argument" for struct name
> here.  It appears that current kernel source supports this too.
> 
> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l
> 411
> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l
> 25
> 

Will change.

> > This arg structure will eventually be utilized by the following
> > interfaces:
> >     mpol_new() - new mempolicy creation
> >     do_get_mempolicy() - acquiring information about mempolicy
> >     do_set_mempolicy() - setting the task mempolicy
> >     do_mbind()         - setting a vma mempolicy
> >
> > do_get_mempolicy() is completely refactored to break it out into
> > separate functionality based on the flags provided by get_mempolicy(2)
> >     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
> >     MPOL_F_ADDR: acquires information on vma policies
> >     MPOL_F_NODE: changes the output for the policy arg to node info
> >
> > We refactor the get_mempolicy syscall flatten the logic based on these
> > flags, and aloow for set_mempolicy2() to re-use the underlying logic.
> >
> > The result of this refactor, and the new mempolicy_args structure, is
> > that extensions like 'sys_set_mempolicy_home_node' can now be directly
> > integrated into the initial call to 'set_mempolicy2', and that more
> > complete information about a mempolicy can be returned with a single
> > call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
> >
> >
> > =====================================================================
> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
> >
> > These interfaces are the 'extended' counterpart to their relatives.
> > They use the userland 'struct mpol_args' structure to communicate a
> > complete mempolicy configuration to the kernel.  This structure
> > looks very much like the kernel-internal 'struct mempolicy_args':
> >
> > struct mpol_args {
> >         /* Basic mempolicy settings */
> >         __u16 mode;
> >         __u16 mode_flags;
> >         __s32 home_node;
> >         __u64 pol_maxnodes;
> 
> I understand that we want to avoid hole in struct.  But I still feel
> uncomfortable to use __u64 for a small.  But I don't have solution too.
> Anyone else has some idea?
>

maxnode has been an `unsigned long` in every other interface for quite
some time.  Seems better to keep this consistent rather than it suddenly
become `unsigned long` over here and `unsigned short` over there.

> >         __aligned_u64 pol_nodes;
> >         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
> 
> Typo?  Should be,
> 

derp derp

> >
> > The 'flags' argument for mbind2 is the same as 'mbind', except with
> > the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
> > field should be utilized.
> >
> > The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
> > allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
> > behavior has been omitted, since get_mempolicy() provides this already.
> 
> I still think that it's a good idea to make it possible to deprecate
> get_mempolicy().  How about use a union as follows?
> 
> struct mpol_mems_allowed {
>          __u64 maxnodes;
>          __aligned_u64 nodemask;
> };
> 
> union mpol_info {
>         struct mpol_args args;
>         struct mpol_mems_allowed mems_allowed;
>         __s32 node;
> };
> 

See my other email.  I've come around to see mems_allowed as a wart that
needs to be removed.  The same information is already available via
sysfs cpusets.mems and cpusets.mems_effective.

Additionally, mems_allowed isn't even technically part of the mempolicy,
so if we did want an interface to acquire the infomation, you'd prefer
to just implement a stand-alone syscall.

The sysfs interface seems sufficient though.


`policy_node` is a similar "why does this even exist" type feature,
except that it can still be used from get_mempolicy() and if there is an
actual reason to extend it to get_mempolicy2() it can be added to
mpol_params.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-26  7:01     ` Gregory Price
@ 2023-12-26  8:06       ` Gregory Price
  2023-12-26 11:32       ` Gregory Price
  2024-01-02  8:42       ` Huang, Ying
  2 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-26  8:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Tue, Dec 26, 2023 at 02:01:57AM -0500, Gregory Price wrote:
> > 
> > If pol->wil.cur_weight == 0, prev_node will be used without being
> > initialized below.
> > 
> 
> pol->wil.cur_weight is not used below.
> 

disregard, i misread your comment. prev_node should be initialized, to 
NO_NUMA_NODE.  Will fix.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-26  7:01     ` Gregory Price
  2023-12-26  8:06       ` Gregory Price
@ 2023-12-26 11:32       ` Gregory Price
  2024-01-02  8:42       ` Huang, Ying
  2 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2023-12-26 11:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Tue, Dec 26, 2023 at 02:01:57AM -0500, Gregory Price wrote:
> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> > Gregory Price <gourry.memverge@gmail.com> writes:
> 
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.
> 
> > > +
> > > +	/* first ensure we have a valid nodemask */
> > > +	nid = first_node(nodemask);
> > > +	if (nid == MAX_NUMNODES)
> > > +		return nid;
> > 
> > It appears that this isn't necessary, because we can check whether
> > weight_total == 0 after the next loop.
> > 
> 
> fair, will snip.
> 

Follow up - this is only possible if the nodemask is invalid / has no
nodes, so it's better to check for this explicitly.  If nodemask is
valid, then it's not possible to have a weight_total of 0, because
weights cannot be 0.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2023-12-26  7:05     ` Gregory Price
@ 2023-12-26 11:48       ` Gregory Price
  2024-01-02  9:09         ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2023-12-26 11:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Tue, Dec 26, 2023 at 02:05:35AM -0500, Gregory Price wrote:
> On Wed, Dec 27, 2023 at 04:39:29PM +0800, Huang, Ying wrote:
> > Gregory Price <gourry.memverge@gmail.com> writes:
> > 
> > > +	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
> > > +
> > > +	*flags = *mode_arg & MPOL_MODE_FLAGS;
> > > +	*mode_arg = mode;
> > 
> > It appears that it's unnecessary to introduce a local variable to split
> > mode/flags.  Just reuse the original code?
> > 

Revisiting during fixes: Note the change from int to short.

I chose to make this explicit because validate_mpol_flags takes a short.

I'm fairly sure changing it back throws a truncation warning.

~Gregroy

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
@ 2023-12-27  6:42   ` Huang, Ying
  2023-12-26  6:48     ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2023-12-27  6:42 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	gregory.price, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

Gregory Price <gourry.memverge@gmail.com> writes:

> From: Rakie Kim <rakie.kim@sk.com>
>
> This patch provides a way to set interleave weight information under
> sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
>
> The sysfs structure is designed as follows.
>
>   $ tree /sys/kernel/mm/mempolicy/
>   /sys/kernel/mm/mempolicy/ [1]
>   └── weighted_interleave [2]
>       ├── node0 [3]
>       └── node1
>
> Each file above can be explained as follows.
>
> [1] mm/mempolicy: configuration interface for mempolicy subsystem
>
> [2] weighted_interleave/: config interface for weighted interleave policy
>
> [3] weighted_interleave/nodeN: weight for nodeN
>
> If sysfs is disabled in the config, the global interleave weights
> will default to "1" for all nodes.
>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Gregory Price <gregory.price@memverge.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> ---
>  .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
>  ...fs-kernel-mm-mempolicy-weighted-interleave |  22 +++
>  mm/mempolicy.c                                | 156 ++++++++++++++++++
>  3 files changed, 182 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> new file mode 100644
> index 000000000000..2dcf24f4384a
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> @@ -0,0 +1,4 @@
> +What:		/sys/kernel/mm/mempolicy/
> +Date:		December 2023
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Interface for Mempolicy
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> new file mode 100644
> index 000000000000..aa27fdf08c19
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -0,0 +1,22 @@
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/
> +Date:		December 2023
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Configuration Interface for the Weighted Interleave policy
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN
> +Date:		December 2023
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Weight configuration interface for nodeN
> +
> +		The interleave weight for a memory node (N). These weights are
> +		utilized by processes which have set their mempolicy to
> +		MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
> +		omitting a task-local weight array.
> +
> +		These weights only affect new allocations, and changes at runtime
> +		will not cause migrations on already allocated pages.
> +
> +		Writing an empty string resets the weight value to 1.

I still think that it's a good idea to provide some better default
weight value with HMAT or CDAT if available.  So, better not to make "1"
as part of ABI?

> +
> +		Minimum weight: 1

Can weight be "0"?  Do we need a way to specify that a node don't want
to participate weighted interleave?

> +		Maximum weight: 255
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..0e77633b07a5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
>  
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
> +static char iw_table[MAX_NUMNODES];
> +

It's kind of obscure whether "char" is "signed" or "unsigned".  Given
the max weight is 255 above, it's better to use "u8"?

And, we may need a way to specify whether the weight has been overridden
by the user.  A special value (such as 255) can be used for that.  If
so, the maximum weight should be 254 instead of 255.  As a user space
interface, is it better to use 100 as the maximum value?

>  /**
>   * numa_nearest_node - Find nearest node by state
>   * @node: Node id to start the search
> @@ -3067,3 +3069,157 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
>  			       nodemask_pr_args(&nodes));
>  }
> +
> +#ifdef CONFIG_SYSFS
> +struct iw_node_attr {
> +	struct kobj_attribute kobj_attr;
> +	int nid;
> +};
> +
> +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
> +			 char *buf)
> +{
> +	struct iw_node_attr *node_attr;
> +
> +	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +	return sysfs_emit(buf, "%d\n", iw_table[node_attr->nid]);
> +}
> +
> +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> +			  const char *buf, size_t count)
> +{
> +	struct iw_node_attr *node_attr;
> +	unsigned char weight = 0;
> +
> +	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +	/* If no input, set default weight to 1 */
> +	if (count == 0 || sysfs_streq(buf, ""))
> +		weight = 1;
> +	else if (kstrtou8(buf, 0, &weight) || !weight)
> +		return -EINVAL;
> +
> +	iw_table[node_attr->nid] = weight;

kstrtou8(), "unsigned char weight", "char iw_table[]" isn't completely
consistent.  It's better to make them consistent?

> +	return count;
> +}
> +
> +static struct iw_node_attr *node_attrs[MAX_NUMNODES];
> +
> +static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
> +				  struct kobject *parent)
> +{
> +	if (!node_attr)
> +		return;
> +	sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
> +	kfree(node_attr->kobj_attr.attr.name);
> +	kfree(node_attr);
> +}
> +
> +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);

IIUC, if this is called in error path (such as, in
add_weighted_interleave_group()), some node_attrs[] element may be
"NULL"?

> +	kobject_put(mempolicy_kobj);
> +}
> +
> +static const struct kobj_type mempolicy_ktype = {
> +	.sysfs_ops = &kobj_sysfs_ops,
> +	.release = sysfs_mempolicy_release,
> +};
> +
> +static int add_weight_node(int nid, struct kobject *wi_kobj)
> +{
> +	struct iw_node_attr *node_attr;
> +	char *name;
> +
> +	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
> +	if (!node_attr)
> +		return -ENOMEM;
> +
> +	name = kasprintf(GFP_KERNEL, "node%d", nid);
> +	if (!name) {
> +		kfree(node_attr);
> +		return -ENOMEM;
> +	}
> +
> +	sysfs_attr_init(&node_attr->kobj_attr.attr);
> +	node_attr->kobj_attr.attr.name = name;
> +	node_attr->kobj_attr.attr.mode = 0644;
> +	node_attr->kobj_attr.show = node_show;
> +	node_attr->kobj_attr.store = node_store;
> +	node_attr->nid = nid;
> +
> +	if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
> +		kfree(node_attr->kobj_attr.attr.name);
> +		kfree(node_attr);
> +		pr_err("failed to add attribute to weighted_interleave\n");
> +		return -ENOMEM;
> +	}
> +
> +	node_attrs[nid] = node_attr;
> +	return 0;
> +}
> +
> +static int add_weighted_interleave_group(struct kobject *root_kobj)
> +{
> +	struct kobject *wi_kobj;
> +	int nid, err;
> +
> +	wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
> +	if (!wi_kobj)
> +		return -ENOMEM;
> +
> +	err = kobject_init_and_add(wi_kobj, &mempolicy_ktype, root_kobj,
> +				   "weighted_interleave");
> +	if (err) {
> +		kfree(wi_kobj);
> +		return err;
> +	}
> +
> +	memset(node_attrs, 0, sizeof(node_attrs));
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		err = add_weight_node(nid, wi_kobj);
> +		if (err) {
> +			pr_err("failed to add sysfs [node%d]\n", nid);
> +			break;
> +		}
> +	}
> +	if (err)
> +		kobject_put(wi_kobj);
> +	return 0;
> +}
> +
> +static int __init mempolicy_sysfs_init(void)
> +{
> +	int err;
> +	struct kobject *root_kobj;
> +
> +	memset(&iw_table, 1, sizeof(iw_table));
> +
> +	root_kobj = kobject_create_and_add("mempolicy", mm_kobj);
> +	if (!root_kobj) {
> +		pr_err("failed to add mempolicy kobject to the system\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = add_weighted_interleave_group(root_kobj);
> +
> +	if (err)
> +		kobject_put(root_kobj);
> +	return err;
> +
> +}
> +#else
> +static int __init mempolicy_sysfs_init(void)
> +{
> +	/*
> +	 * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
> +	 * MPOL_INTERLEAVE behavior, but is still defined separately to
> +	 * allow task-local weighted interleave to operate as intended.
> +	 */
> +	memset(&iw_table, 1, sizeof(iw_table));
> +	return 0;
> +}
> +#endif /* CONFIG_SYSFS */
> +late_initcall(mempolicy_sysfs_init);

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-23 18:10 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
@ 2023-12-27  8:32   ` Huang, Ying
  2023-12-26  7:01     ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2023-12-27  8:32 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	gregory.price, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

Gregory Price <gourry.memverge@gmail.com> writes:

> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as having local DRAM and CXL memory together, the current round-robin
> based interleaving policy doesn't maximize the overall bandwidth because
> of their different bandwidth characteristics.
>
> Instead, the interleaving can be more efficient when the allocation
> policy follows each NUMA nodes' bandwidth weight rather than having 1:1
> round-robin allocation.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which
> enables weighted interleaving between NUMA nodes.  Weighted interleave
> allows for a proportional distribution of memory across multiple numa
> nodes, preferablly apportioned to match the bandwidth capacity of each
> node from the perspective of the accessing node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
> appropriate weight distribution is (2:1).
>
> Weights will be acquired from the global weight matrix exposed by the
> sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/
>
> The policy will then allocate the number of pages according to the
> set weights.  For example, if the weights are (2,1), then 2 pages
> will be allocated on node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> There are 3 integration points:
>
> weighted_interleave_nodes:
>     Counts the number of allocations as they occur, and applies the
>     weight for the current node.  When the weight reaches 0, switch
>     to the next node. Applied by `mempolicy_slab_node()` and
>     `policy_nodemask()`
>
> weighted_interleave_nid:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the node based on the given index.
>     Applied by `policy_nodemask()` and `mpol_misplaced()`
>
> bulk_array_weighted_interleave:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the number of "interleave rounds" as
>     well as any delta ("partial round").  Calculates the number of
>     pages for each node and allocates them.
>
>     If a node was scheduled for interleave via interleave_nodes, the
>     current weight (pol->cur_weight) will be allocated first, before
>     the remaining bulk calculation is done. This simplifies the
>     calculation at the cost of an additional allocation call.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actually application of the interleave.  The
> calculation of the `interleave index` is done by `get_vma_policy()`,
> while the actual selection of the node will be later appliex by the
> relevant weighted_interleave function.
>
> If CONFIG_SYSFS is disabled, the weight table will be initialized
> to set all nodes to weight 1, but the weighting code is still called.
> This is so that task-local weights (future patch) can still be
> engaged cleanly without ifdef spaghetti.
>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     |  11 +
>  include/linux/mempolicy.h                     |   5 +
>  include/uapi/linux/mempolicy.h                |   1 +
>  mm/mempolicy.c                                | 197 +++++++++++++++++-
>  4 files changed, 211 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..d2c8e712785b 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY
>  	can fall back to all existing numa nodes. This is effectively
>  	MPOL_PREFERRED allowed for a mask rather than a single node.
>  
> +MPOL_WEIGHTED_INTERLEAVE
> +	This mode operates the same as MPOL_INTERLEAVE, except that
> +	interleaving behavior is executed based on weights set in
> +	/sys/kernel/mm/mempolicy/weighted_interleave/
> +
> +	Weighted interleave allocations pages on nodes according to
> +	their weight.  For example if nodes [0,1] are weighted [5,2]
> +	respectively, 5 pages will be allocated on node0 for every
> +	2 pages allocated on node1.  This can better distribute data
> +	according to bandwidth on heterogeneous memory systems.
> +
>  NUMA memory policy supports the following optional mode flags:
>  
>  MPOL_F_STATIC_NODES
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..ba09167e80f7 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,11 @@ struct mempolicy {
>  		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
>  		nodemask_t user_nodemask;	/* nodemask passed by user */
>  	} w;
> +
> +	/* Weighted interleave settings */
> +	struct {
> +		unsigned char cur_weight;
> +	} wil;
>  };
>  
>  /*
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
>  	MPOL_PREFERRED_MANY,
> +	MPOL_WEIGHTED_INTERLEAVE,
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0e77633b07a5..0a180c670f0c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
>  	policy->mode = mode;
>  	policy->flags = flags;
>  	policy->home_node = NUMA_NO_NODE;
> +	policy->wil.cur_weight = 0;
>  
>  	return policy;
>  }
> @@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_preferred,
>  	},
> +	[MPOL_WEIGHTED_INTERLEAVE] = {
> +		.create = mpol_new_nodemask,
> +		.rebind = mpol_rebind_nodemask,
> +	},
>  };
>  
>  static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>  
>  	old = current->mempolicy;
>  	current->mempolicy = new;
> -	if (new && new->mode == MPOL_INTERLEAVE)
> +	if (new && (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
>  		current->il_prev = MAX_NUMNODES-1;
>  	task_unlock(current);
>  	mpol_put(old);
> @@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*nodes = pol->nodes;
>  		break;
>  	case MPOL_LOCAL:
> @@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
>  			*policy = next_node_in(current->il_prev, pol->nodes);
> +		} else if (pol == current->mempolicy &&
> +				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> +			if (pol->wil.cur_weight)
> +				*policy = current->il_prev;
> +			else
> +				*policy = next_node_in(current->il_prev,
> +						       pol->nodes);
>  		} else {
>  			err = -EINVAL;
>  			goto out;
> @@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  	pol = __get_vma_policy(vma, addr, ilx);
>  	if (!pol)
>  		pol = get_task_policy(current);
> -	if (pol->mode == MPOL_INTERLEAVE) {
> +	if (pol->mode == MPOL_INTERLEAVE ||
> +	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
>  		*ilx += vma->vm_pgoff >> order;
>  		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
>  	}
> @@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  	return zone >= dynamic_policy_zone;
>  }
>  
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> +	unsigned int next;
> +	struct task_struct *me = current;
> +
> +	next = next_node_in(me->il_prev, policy->nodes);
> +	if (next == MAX_NUMNODES)
> +		return next;
> +
> +	if (!policy->wil.cur_weight)
> +		policy->wil.cur_weight = iw_table[next];
> +
> +	policy->wil.cur_weight--;
> +	if (!policy->wil.cur_weight)
> +		me->il_prev = next;
> +	return next;
> +}
> +
>  /* Do dynamic interleaving for a process */
>  static unsigned int interleave_nodes(struct mempolicy *policy)
>  {
> @@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void)
>  	case MPOL_INTERLEAVE:
>  		return interleave_nodes(policy);
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		return weighted_interleave_nodes(policy);
> +
>  	case MPOL_BIND:
>  	case MPOL_PREFERRED_MANY:
>  	{
> @@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void)
>  	}
>  }
>  
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> +	nodemask_t nodemask = pol->nodes;
> +	unsigned int target, weight_total = 0;
> +	int nid;
> +	unsigned char weights[MAX_NUMNODES];

MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
large?

> +	unsigned char weight;
> +
> +	barrier();

Memory barrier needs comments.

> +
> +	/* first ensure we have a valid nodemask */
> +	nid = first_node(nodemask);
> +	if (nid == MAX_NUMNODES)
> +		return nid;

It appears that this isn't necessary, because we can check whether
weight_total == 0 after the next loop.

> +
> +	/* Then collect weights on stack and calculate totals */
> +	for_each_node_mask(nid, nodemask) {
> +		weight = iw_table[nid];
> +		weight_total += weight;
> +		weights[nid] = weight;
> +	}
> +
> +	/* Finally, calculate the node offset based on totals */
> +	target = (unsigned int)ilx % weight_total;

Why use type casting?

> +	nid = first_node(nodemask);
> +	while (target) {
> +		weight = weights[nid];
> +		if (target < weight)
> +			break;
> +		target -= weight;
> +		nid = next_node_in(nid, nodemask);
> +	}
> +	return nid;
> +}
> +
>  /*
>   * Do static interleaving for interleave index @ilx.  Returns the ilx'th
>   * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>  		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
>  			interleave_nodes(pol) : interleave_nid(pol, ilx);
>  		break;
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
> +			weighted_interleave_nodes(pol) :
> +			weighted_interleave_nid(pol, ilx);
> +		break;
>  	}
>  
>  	return nodemask;
> @@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*mask = mempolicy->nodes;
>  		break;
>  
> @@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		 * If the policy is interleave or does not allow the current
>  		 * node in its nodemask, we allocate the standard way.
>  		 */
> -		if (pol->mode != MPOL_INTERLEAVE &&
> +		if ((pol->mode != MPOL_INTERLEAVE &&
> +		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
>  		    (!nodemask || node_isset(nid, *nodemask))) {
>  			/*
>  			 * First, try to allocate THP only on local node, but
> @@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
>  	return total_allocated;
>  }
>  
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> +		struct mempolicy *pol, unsigned long nr_pages,
> +		struct page **page_array)
> +{
> +	struct task_struct *me = current;
> +	unsigned long total_allocated = 0;
> +	unsigned long nr_allocated;
> +	unsigned long rounds;
> +	unsigned long node_pages, delta;
> +	unsigned char weight;
> +	unsigned char weights[MAX_NUMNODES];
> +	unsigned int weight_total = 0;
> +	unsigned long rem_pages = nr_pages;
> +	nodemask_t nodes = pol->nodes;
> +	int nnodes, node, prev_node;
> +	int i;
> +
> +	/* Stabilize the nodemask on the stack */
> +	barrier();

I don't think barrier() is needed to wait for memory operations for
stack.  It's usually used for cross-processor memory order.

> +
> +	nnodes = nodes_weight(nodes);
> +
> +	/* Collect weights and save them on stack so they don't change */
> +	for_each_node_mask(node, nodes) {
> +		weight = iw_table[node];
> +		weight_total += weight;
> +		weights[node] = weight;
> +	}
> +
> +	/* Continue allocating from most recent node and adjust the nr_pages */
> +	if (pol->wil.cur_weight) {
> +		node = next_node_in(me->il_prev, nodes);
> +		node_pages = pol->wil.cur_weight;
> +		if (node_pages > rem_pages)
> +			node_pages = rem_pages;
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		/* if that's all the pages, no need to interleave */
> +		if (rem_pages <= pol->wil.cur_weight) {
> +			pol->wil.cur_weight -= rem_pages;
> +			return total_allocated;
> +		}
> +		/* Otherwise we adjust nr_pages down, and continue from there */
> +		rem_pages -= pol->wil.cur_weight;
> +		pol->wil.cur_weight = 0;
> +		prev_node = node;

If pol->wil.cur_weight == 0, prev_node will be used without being
initialized below.

> +	}
> +
> +	/* Now we can continue allocating as if from 0 instead of an offset */
> +	rounds = rem_pages / weight_total;
> +	delta = rem_pages % weight_total;
> +	for (i = 0; i < nnodes; i++) {
> +		node = next_node_in(prev_node, nodes);
> +		weight = weights[node];
> +		node_pages = weight * rounds;
> +		if (delta) {
> +			if (delta > weight) {
> +				node_pages += weight;
> +				delta -= weight;
> +			} else {
> +				node_pages += delta;
> +				delta = 0;
> +			}
> +		}
> +		/* We may not make it all the way around */
> +		if (!node_pages)
> +			break;
> +		/* If an over-allocation would occur, floor it */
> +		if (node_pages + total_allocated > nr_pages) {

Why is this possible?

> +			node_pages = nr_pages - total_allocated;
> +			delta = 0;
> +		}
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		prev_node = node;
> +	}
> +
> +	/*
> +	 * Finally, we need to update me->il_prev and pol->wil.cur_weight
> +	 * if there were overflow pages, but not equivalent to the node
> +	 * weight, set the cur_weight to node_weight - delta and the
> +	 * me->il_prev to the previous node. Otherwise if it was perfect
> +	 * we can simply set il_prev to node and cur_weight to 0
> +	 */
> +	if (node_pages) {
> +		me->il_prev = prev_node;
> +		node_pages %= weight;
> +		pol->wil.cur_weight = weight - node_pages;
> +	} else {
> +		me->il_prev = node;
> +		pol->wil.cur_weight = 0;
> +	}
> +
> +	return total_allocated;
> +}
> +
>  static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
>  		struct mempolicy *pol, unsigned long nr_pages,
>  		struct page **page_array)
> @@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
>  		return alloc_pages_bulk_array_interleave(gfp, pol,
>  							 nr_pages, page_array);
>  
> +	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> +		return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
> +								  nr_pages,
> +								  page_array);
> +
>  	if (pol->mode == MPOL_PREFERRED_MANY)
>  		return alloc_pages_bulk_array_preferred_many(gfp,
>  				numa_node_id(), pol, nr_pages, page_array);
> @@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		return !!nodes_equal(a->nodes, b->nodes);
>  	case MPOL_LOCAL:
>  		return true;
> @@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
>  		polnid = interleave_nid(pol, ilx);
>  		break;
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		polnid = weighted_interleave_nid(pol, ilx);
> +		break;
> +
>  	case MPOL_PREFERRED:
>  		if (node_isset(curnid, pol->nodes))
>  			goto out;
> @@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =
>  	[MPOL_PREFERRED]  = "prefer",
>  	[MPOL_BIND]       = "bind",
>  	[MPOL_INTERLEAVE] = "interleave",
> +	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
>  	[MPOL_LOCAL]      = "local",
>  	[MPOL_PREFERRED_MANY]  = "prefer (many)",
>  };
> @@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
>  		}
>  		break;
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		/*
>  		 * Default to online nodes with memory if no nodelist
>  		 */
> @@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		nodes = pol->nodes;
>  		break;
>  	default:

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2023-12-23 18:10 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
@ 2023-12-27  8:39   ` Huang, Ying
  2023-12-26  7:05     ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2023-12-27  8:39 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	gregory.price, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

Gregory Price <gourry.memverge@gmail.com> writes:

> split sanitize_mpol_flags into sanitize and validate.
>
> Sanitize is used by set_mempolicy to split (int mode) into mode
> and mode_flags, and then validates them.
>
> Validate validates already split flags.
>
> Validate will be reused for new syscalls that accept already
> split mode and mode_flags.
>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> ---
>  mm/mempolicy.c | 29 ++++++++++++++++++++++-------
>  1 file changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0a180c670f0c..59ac0da24f56 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1463,24 +1463,39 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
>  	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
>  }
>  
> -/* Basic parameter sanity check used by both mbind() and set_mempolicy() */
> -static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> +/*
> + * Basic parameter sanity check used by mbind/set_mempolicy
> + * May modify flags to include internal flags (e.g. MPOL_F_MOF/F_MORON)
> + */
> +static inline int validate_mpol_flags(unsigned short mode, unsigned short *flags)
>  {
> -	*flags = *mode & MPOL_MODE_FLAGS;
> -	*mode &= ~MPOL_MODE_FLAGS;
> -
> -	if ((unsigned int)(*mode) >=  MPOL_MAX)
> +	if ((unsigned int)(mode) >= MPOL_MAX)
>  		return -EINVAL;
>  	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
>  		return -EINVAL;
>  	if (*flags & MPOL_F_NUMA_BALANCING) {
> -		if (*mode != MPOL_BIND)
> +		if (mode != MPOL_BIND)
>  			return -EINVAL;
>  		*flags |= (MPOL_F_MOF | MPOL_F_MORON);
>  	}
>  	return 0;
>  }
>  
> +/*
> + * Used by mbind/set_memplicy to split and validate mode/flags
> + * set_mempolicy combines (mode | flags), split them out into separate

mbind() uses mode flags too.

> + * fields and return just the mode in mode_arg and flags in flags.
> + */
> +static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags)
> +{
> +	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
> +
> +	*flags = *mode_arg & MPOL_MODE_FLAGS;
> +	*mode_arg = mode;

It appears that it's unnecessary to introduce a local variable to split
mode/flags.  Just reuse the original code?

> +
> +	return validate_mpol_flags(mode, flags);
> +}
> +
>  static long kernel_mbind(unsigned long start, unsigned long len,
>  			 unsigned long mode, const unsigned long __user *nmask,
>  			 unsigned long maxnode, unsigned int flags)

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
  2023-12-26  7:45   ` Gregory Price
@ 2024-01-02  4:27     ` Huang, Ying
  2024-01-02 19:06       ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-02  4:27 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron

Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Dec 25, 2023 at 03:54:18PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > For example, the stream benchmark demonstrates that default interleave
>> > is actively harmful, where weighted interleave is beneficial.
>> >
>> > Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
>> > Default interleave : -78% (slower than DRAM)
>> > Global weighting   : -6% to +4% (workload dependant)
>> > Targeted weights   : +2.5% to +4% (consistently better than DRAM)
>> >
>> > If nothing else, this shows how awful round-robin interleave is.
>> 
>> I guess the performance of the default policy, local (fast memory)
>> first, may be even better in some situation?  For example, before the
>> bandwidth of DRAM is saturated?
>> 
>
> Yes - but it's more complicated than that.
>
> Global weighting here means we did `numactl -w --interleave ...`, which
> means *all* memory regions will be interleaved.  Code, stack, heap, etc.
>
> Targeted weights means we used mbind2() with local weights, which only
> targted specific heap regions.
>
> The default policy was better than global weighting likely as a result
> of things like stack/code being distributed to higher latency memory
> produced a measurable overhead.
>
> To provide this, we only applied weights to bandwidth driving regions,
> and as a result we demonstrated a measurable performance increase.
>
> So yes, the defautl policy may be better in some situations - but that
> will be true of any policy.

Yes.  Some memory area may be more sensitive to memory latency than
other area.

Per my understanding, memory latency will increase with the actual
memory throughput.  And it increases quickly when the memory throughput
nears the maximum memory bandwidth.  As in the figures in the following
URL.

https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/

If the memory latency of the DRAM will not increase much, it's better to
place the hot data in DRAM always.  But if the memory throughput nears
the max memory bandwidth, so that the memory latency of DRAM increases
greatly, may be even higher than that of CXL memory, it's better to put
some hot data in CXL memory to reduce the overall memory latency.

If it's right, I suggest to add something like above in the patch
description.

>> I understand that you may want to limit the memory usage of the fast
>> memory too.  But IMHO, that is another requirements.  That should be
>> enforced by something like per-node memory limit.
>> 
>
> This interface does not limit memory usage of a particular node, it 
> distributes data according to the requested policy.
>
> Nuanced distinction, but important.  If nodes become exhausted, tasks
> are still free to allocate memory from any node in the nodemask, even if
> it violates the requested mempolicy.
>
> This is consistent with the existing behavior of mempolicy.

Good.

>> > =====================================================================
>> > (Patches 3-6) Refactoring mempolicy for code-reuse
>> >
>> > To avoid multiple paths of mempolicy creation, we should refactor the
>> > existing code to enable the designed extensibility, and refactor
>> > existing users to utilize the new interface (while retaining the
>> > existing userland interface).
>> >
>> > This set of patches introduces a new mempolicy_args structure, which
>> > is used to more fully describe a requested mempolicy - to include
>> > existing and future extensions.
>> >
>> > /*
>> >  * Describes settings of a mempolicy during set/get syscalls and
>> >  * kernel internal calls to do_set_mempolicy()
>> >  */
>> > struct mempolicy_args {
>> >     unsigned short mode;            /* policy mode */
>> >     unsigned short mode_flags;      /* policy mode flags */
>> >     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
>> >     nodemask_t *policy_nodes;       /* get/set/mbind */
>> >     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
>> > };
>> 
>> According to
>> 
>> https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/
>> 
>> it appears that "parameter" are better than "argument" for struct name
>> here.  It appears that current kernel source supports this too.
>> 
>> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l
>> 411
>> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l
>> 25
>> 
>
> Will change.
>
>> > This arg structure will eventually be utilized by the following
>> > interfaces:
>> >     mpol_new() - new mempolicy creation
>> >     do_get_mempolicy() - acquiring information about mempolicy
>> >     do_set_mempolicy() - setting the task mempolicy
>> >     do_mbind()         - setting a vma mempolicy
>> >
>> > do_get_mempolicy() is completely refactored to break it out into
>> > separate functionality based on the flags provided by get_mempolicy(2)
>> >     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
>> >     MPOL_F_ADDR: acquires information on vma policies
>> >     MPOL_F_NODE: changes the output for the policy arg to node info
>> >
>> > We refactor the get_mempolicy syscall flatten the logic based on these
>> > flags, and aloow for set_mempolicy2() to re-use the underlying logic.
>> >
>> > The result of this refactor, and the new mempolicy_args structure, is
>> > that extensions like 'sys_set_mempolicy_home_node' can now be directly
>> > integrated into the initial call to 'set_mempolicy2', and that more
>> > complete information about a mempolicy can be returned with a single
>> > call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
>> >
>> >
>> > =====================================================================
>> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>> >
>> > These interfaces are the 'extended' counterpart to their relatives.
>> > They use the userland 'struct mpol_args' structure to communicate a
>> > complete mempolicy configuration to the kernel.  This structure
>> > looks very much like the kernel-internal 'struct mempolicy_args':
>> >
>> > struct mpol_args {
>> >         /* Basic mempolicy settings */
>> >         __u16 mode;
>> >         __u16 mode_flags;
>> >         __s32 home_node;
>> >         __u64 pol_maxnodes;
>> 
>> I understand that we want to avoid hole in struct.  But I still feel
>> uncomfortable to use __u64 for a small.  But I don't have solution too.
>> Anyone else has some idea?
>>
>
> maxnode has been an `unsigned long` in every other interface for quite
> some time.  Seems better to keep this consistent rather than it suddenly
> become `unsigned long` over here and `unsigned short` over there.

I don't think that it matters.  The actual maximum node number will be
less than maximum `unsigned short`.

>> >         __aligned_u64 pol_nodes;
>> >         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
>> 
>> Typo?  Should be,
>> 
>
> derp derp
>
>> >
>> > The 'flags' argument for mbind2 is the same as 'mbind', except with
>> > the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
>> > field should be utilized.
>> >
>> > The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
>> > allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
>> > behavior has been omitted, since get_mempolicy() provides this already.
>> 
>> I still think that it's a good idea to make it possible to deprecate
>> get_mempolicy().  How about use a union as follows?
>> 
>> struct mpol_mems_allowed {
>>          __u64 maxnodes;
>>          __aligned_u64 nodemask;
>> };
>> 
>> union mpol_info {
>>         struct mpol_args args;
>>         struct mpol_mems_allowed mems_allowed;
>>         __s32 node;
>> };
>> 
>
> See my other email.  I've come around to see mems_allowed as a wart that
> needs to be removed.  The same information is already available via
> sysfs cpusets.mems and cpusets.mems_effective.
>
> Additionally, mems_allowed isn't even technically part of the mempolicy,
> so if we did want an interface to acquire the infomation, you'd prefer
> to just implement a stand-alone syscall.
>
> The sysfs interface seems sufficient though.
>
> `policy_node` is a similar "why does this even exist" type feature,
> except that it can still be used from get_mempolicy() and if there is an
> actual reason to extend it to get_mempolicy2() it can be added to
> mpol_params.

OK.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2023-12-26  6:48     ` Gregory Price
@ 2024-01-02  7:41       ` Huang, Ying
  2024-01-02 19:45         ` Gregory Price
  2024-01-03  2:46         ` Gregory Price
  0 siblings, 2 replies; 46+ messages in thread
From: Huang, Ying @ 2024-01-02  7:41 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Dec 27, 2023 at 02:42:15PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > +		These weights only affect new allocations, and changes at runtime
>> > +		will not cause migrations on already allocated pages.
>> > +
>> > +		Writing an empty string resets the weight value to 1.
>> 
>> I still think that it's a good idea to provide some better default
>> weight value with HMAT or CDAT if available.  So, better not to make "1"
>> as part of ABI?
>> 
>
> That's the eventual goal,

Good to know this.

> but this is just the initial mechanism.
>
> My current thought is that the CXL driver will apply weights as the
> system iterates through devices and creates numa nodes.  In the
> meantime, you have to give the "possible" nodes a default value to
> prevent nodes onlined after boot from showing up with 0-value.
>
> Not allowing 0-value weights is simply easier in many respects.
>
>> > +
>> > +		Minimum weight: 1
>> 
>> Can weight be "0"?  Do we need a way to specify that a node don't want
>> to participate weighted interleave?
>> 
>
> In this code, weight cannot be 0.  My thoguht is that removing the node
> from the nodemask is the way to denote 0.
>
> The problem with 0 is hotplug, migration, and cpusets.mems_allowed.  
>
> Example issue:  Use set local weights to [1,0,1,0] for nodes [0-3],
> and has a cpusets.mems_allowed mask of (0, 2).
>
> Lets say the user migrates the task via cgroups from nodes (0,2) to
> (1,3).
>
> The task will instantly crash as basically OOM because weights of
> [1,0,1,0] will prevent memory from being allocations.
>
> Not allowing nodes weights of 0 is defensive.  Instead, simply removing
> the node from the nodemask and/or mems_allowed is both equivalent to and
> the preferred way to apply a weight of 0.

It sounds reasonable to set minimum weight to 1.  But "1" may be not the
default weight, so, I don't think it's a good idea to make "1" as
default in ABI.

>> > +		Maximum weight: 255
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 10a590ee1c89..0e77633b07a5 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
>> >  
>> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>> >  
>> > +static char iw_table[MAX_NUMNODES];
>> > +
>> 
>> It's kind of obscure whether "char" is "signed" or "unsigned".  Given
>> the max weight is 255 above, it's better to use "u8"?
>>
>
> bah, stupid mistake.  I will switch this to u8.
>
>> And, we may need a way to specify whether the weight has been overridden
>> by the user.
>> A special value (such as 255) can be used for that.  If
>> so, the maximum weight should be 254 instead of 255.  As a user space
>> interface, is it better to use 100 as the maximum value?
>> 
>
> There's global weights and local weights.  These are the global weights.
>
> Local weights are stored in task->mempolicy.wil.il_weights.
>
> (policy->mode_flags & MPOL_F_GWEIGHT) denotes the override.
> This is set if (mempolicy_args->il_weights) was provided.
>
> This simplifies the interface.
>
> (note: local weights are not introduced until the last patch 11/11)

The global weight is writable via sysfs too, right?  Then, for global
weights, we have 2 sets of values,

iw_table_default[], and iw_table[].

iw_table_default[] is set to "1" now, and will be set to some other
values after we have enabled HMAT/CDAT based default value.

iw_table[] is initialized with a special value (for example, "0", if "1"
is the minimal weight).  And users can change it via sysfs.  Then the
actual global weight for a node becomes

    iw_table[node] ? : iw_table_default[node]


As for global weight and local weight, we may need a way to copy from
the global weights to the local weights to simplify the memory policy
setup.  For example, if users want to use the global weights of node 0,
1, 2 and override the weight of node 3.  They can specify some special
value, for example, 0, in mempolicy_args->il_weights[0], [1], [2] to
copy from the global values, and override [3] via some other value.


Think about the default weight value via HMAT/CDAT again.  It may be not
a good idea to use "1" as default even for now.

For example,

- The memory bandwidth of DRAM is 100GB, whose default weight is "1".

- We hot-plug CXL.mem A with memory bandwidth 20GB.  So, we change the
  weight of DRAM to 5, and use "1" as the weight of CXL.mem A.

- We hot-plug CXL.mem B with memory bandwidth 10GB.  So, we change the
  weight of DRAM to 10, the weight of CXL.mem A to 2, and use "1" as the
  weight of CXL.mem B.

That is, if we use "1" as default weight, we need to change weights of
nodes frequently because we haven't a "base" weight.  The best candidate
base weight is the weight of DRAM node.  For example, if we set the
default weight of DRAM node to be "16" and use that as the base weight,
we don't need to change it in most cases.  The weight of other nodes can
be set according to the ratio of its memory bandwidth to that of DRAM.

This makes it easy to set the default weight via HMAT/CDAT too.

What do you think about that?

>> > +
>> > +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
>> > +{
>> > +	int i;
>> > +
>> > +	for (i = 0; i < MAX_NUMNODES; i++)
>> > +		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
>> 
>> IIUC, if this is called in error path (such as, in
>> add_weighted_interleave_group()), some node_attrs[] element may be
>> "NULL"?
>> 
>
> The null check is present in sysfs_wi_node_release
>
> if (!node_attr)
> 	return;

This works.  Sorry for noise.

> Is it preferable to pull this out? Seemed more defensive to put it
> inside the function.

Both are OK for me.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2023-12-26  7:01     ` Gregory Price
  2023-12-26  8:06       ` Gregory Price
  2023-12-26 11:32       ` Gregory Price
@ 2024-01-02  8:42       ` Huang, Ying
  2024-01-02 20:30         ` Gregory Price
  2 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-02  8:42 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> > +{
>> > +	nodemask_t nodemask = pol->nodes;
>> > +	unsigned int target, weight_total = 0;
>> > +	int nid;
>> > +	unsigned char weights[MAX_NUMNODES];
>> 
>> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> large?
>> 
>
> I've been struggling with a good solution to this.  We need a local copy
> of weights to prevent weights from changing out from under us during
> allocation (which may take quite some time), but it seemed unwise to
> to allocate 1KB heap in this particular path.
>
> Is my concern unfounded?  If so, I can go ahead and add the allocation
> code.

Please take a look at NODEMASK_ALLOC().

>> > +	unsigned char weight;
>> > +
>> > +	barrier();
>> 
>> Memory barrier needs comments.
>> 
>
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.

Please see below.

>> > +
>> > +	/* first ensure we have a valid nodemask */
>> > +	nid = first_node(nodemask);
>> > +	if (nid == MAX_NUMNODES)
>> > +		return nid;
>> 
>> It appears that this isn't necessary, because we can check whether
>> weight_total == 0 after the next loop.
>> 
>
> fair, will snip.
>
>> > +
>> > +	/* Then collect weights on stack and calculate totals */
>> > +	for_each_node_mask(nid, nodemask) {
>> > +		weight = iw_table[nid];
>> > +		weight_total += weight;
>> > +		weights[nid] = weight;
>> > +	}
>> > +
>> > +	/* Finally, calculate the node offset based on totals */
>> > +	target = (unsigned int)ilx % weight_total;
>> 
>> Why use type casting?
>> 
>
> Artifact of old prototypes, snipped.
>
>> > +
>> > +	/* Stabilize the nodemask on the stack */
>> > +	barrier();
>> 
>> I don't think barrier() is needed to wait for memory operations for
>> stack.  It's usually used for cross-processor memory order.
>>
>
> This is present in the old interleave code.  To the best of my
> understanding, the concern is for mempolicy->nodemask rebinding that can
> occur when cgroups.cpusets.mems_allowed changes.
>
> so we can't iterate over (mempolicy->nodemask), we have to take a local
> copy.
>
> My *best* understanding of the barrier here is to prevent the compiler
> from reordering operations such that it attempts to optimize out the
> local copy (or do lazy-fetch).
>
> It is present in the original interleave code, so I pulled it forward to
> this, but I have not tested whether this is a bit paranoid or not.
>
> from `interleave_nid`:
>
>  /*
>   * The barrier will stabilize the nodemask in a register or on
>   * the stack so that it will stop changing under the code.
>   *
>   * Between first_node() and next_node(), pol->nodes could be changed
>   * by other threads. So we put pol->nodes in a local stack.
>   */
>  barrier();

Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
comments all over the place.  Can we implement a wrapper for it?  For
example, memcpy_once().  __read_once_size() in
tools/include/linux/compiler.h can be used as reference.

Because node_weights[] may be changed simultaneously too.  We may need
to consider similar issue for it too.  But RCU seems more appropriate
for node_weights[].

>> > +		/* Otherwise we adjust nr_pages down, and continue from there */
>> > +		rem_pages -= pol->wil.cur_weight;
>> > +		pol->wil.cur_weight = 0;
>> > +		prev_node = node;
>> 
>> If pol->wil.cur_weight == 0, prev_node will be used without being
>> initialized below.
>> 
>
> pol->wil.cur_weight is not used below.
>
>> > +	}
>> > +
>> > +	/* Now we can continue allocating as if from 0 instead of an offset */
>> > +	rounds = rem_pages / weight_total;
>> > +	delta = rem_pages % weight_total;
>> > +	for (i = 0; i < nnodes; i++) {
>> > +		node = next_node_in(prev_node, nodes);
>> > +		weight = weights[node];
>> > +		node_pages = weight * rounds;
>> > +		if (delta) {
>> > +			if (delta > weight) {
>> > +				node_pages += weight;
>> > +				delta -= weight;
>> > +			} else {
>> > +				node_pages += delta;
>> > +				delta = 0;
>> > +			}
>> > +		}
>> > +		/* We may not make it all the way around */
>> > +		if (!node_pages)
>> > +			break;
>> > +		/* If an over-allocation would occur, floor it */
>> > +		if (node_pages + total_allocated > nr_pages) {
>> 
>> Why is this possible?
>> 
>
> this may have been a paranoid artifact from an early prototype, will
> snip and validate.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2023-12-26 11:48       ` Gregory Price
@ 2024-01-02  9:09         ` Huang, Ying
  2024-01-02 20:32           ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-02  9:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Dec 26, 2023 at 02:05:35AM -0500, Gregory Price wrote:
>> On Wed, Dec 27, 2023 at 04:39:29PM +0800, Huang, Ying wrote:
>> > Gregory Price <gourry.memverge@gmail.com> writes:
>> > 
>> > > +	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
>> > > +
>> > > +	*flags = *mode_arg & MPOL_MODE_FLAGS;
>> > > +	*mode_arg = mode;
>> > 
>> > It appears that it's unnecessary to introduce a local variable to split
>> > mode/flags.  Just reuse the original code?
>> > 
>
> Revisiting during fixes: Note the change from int to short.
>
> I chose to make this explicit because validate_mpol_flags takes a short.
>
> I'm fairly sure changing it back throws a truncation warning.

Why something like below doesn't work?

int sanitize_mpol_flags(int *mode, unsigned short *flags)
{
        *flags = *mode & MPOL_MODE_FLAGS;
        *mode &= ~MPOL_MODE_FLAGS;

        return validate_mpol_flags(*mode, flags);
}

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall
  2023-12-23 18:10 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
@ 2024-01-02 14:38   ` Geert Uytterhoeven
  0 siblings, 0 replies; 46+ messages in thread
From: Geert Uytterhoeven @ 2024-01-02 14:38 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko

On Sat, Dec 23, 2023 at 7:13 PM Gregory Price <gourry.memverge@gmail.com> wrote:
> set_mempolicy2 is an extensible set_mempolicy interface which allows
> a user to set the per-task memory policy.
>
> Defined as:
>
> set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags);
>
> relevant mpol_args fields include the following:
>
> mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
> mode_flags:   The MPOL_F_* flags that were previously passed in or'd
>               into the mode.  This was split to hopefully allow future
>               extensions additional mode/flag space.
> home_node:    ignored (see note below)
> pol_nodes:    the nodemask to apply for the memory policy
> pol_maxnodes: The max number of nodes described by pol_nodes
>
> The usize arg is intended for the user to pass in sizeof(mpol_args)
> to allow forward/backward compatibility whenever possible.
>
> The flags argument is intended to future proof the syscall against
> future extensions which may require interpreting the arguments in
> the structure differently.
>
> Semantics of `set_mempolicy` are otherwise the same as `set_mempolicy`
> as of this patch.
>
> As of this patch, setting the home node of a task-policy is not
> supported, as this functionality was not supported by set_mempolicy.
> Additional research should be done to determine whether adding this
> functionality is safe, but doing so would only require setting
> MPOL_MF_HOME_NODE and providing a valid home node value.
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>

>  arch/m68k/kernel/syscalls/syscall.tbl         |  1 +

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall
  2023-12-23 18:10 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
@ 2024-01-02 14:46   ` Geert Uytterhoeven
  0 siblings, 0 replies; 46+ messages in thread
From: Geert Uytterhoeven @ 2024-01-02 14:46 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko

On Sat, Dec 23, 2023 at 7:14 PM Gregory Price <gourry.memverge@gmail.com> wrote:
> get_mempolicy2 is an extensible get_mempolicy interface which allows
> a user to retrieve the memory policy for a task or address.
>
> Defined as:
>
> get_mempolicy2(struct mpol_args *args, size_t size,
>                unsigned long addr, unsigned long flags)
>
> Top level input values:
>
> mpol_args:    The field which collects information about the mempolicy
>               returned to userspace.
> addr:         if MPOL_F_ADDR is passed in `flags`, this address will be
>               used to return the mempolicy details of the vma the
>               address belongs to
> flags:        if MPOL_F_ADDR, return mempolicy info vma containing addr
>               else, returns task mempolicy information
>
> Input values include the following fields of mpol_args:
>
> pol_nodes:    if set, the nodemask of the policy returned here
> pol_maxnodes: if pol_nodes is set, must describe max number of nodes
>               to be copied to pol_nodes
>
> Output values include the following fields of mpol_args:
>
> mode:         mempolicy mode
> mode_flags:   mempolicy mode flags
> home_node:    policy home node will be returned here, or -1 if not.
> pol_nodes:    if set, the nodemask for the mempolicy
> policy_node:  if the policy has extended node information, it will
>               be placed here.  For example MPOL_INTERLEAVE will
>               return the next node which will be used for allocation
>
> MPOL_F_NODE has been dropped from get_mempolicy2 (EINVAL).
> MPOL_F_MEMS_ALLOWED has been dropped from get_mempolicy2 (EINVAL).
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>

>  arch/m68k/kernel/syscalls/syscall.tbl         |  1 +

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall
  2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
@ 2024-01-02 14:47   ` Geert Uytterhoeven
  0 siblings, 0 replies; 46+ messages in thread
From: Geert Uytterhoeven @ 2024-01-02 14:47 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-mm, linux-doc, linux-fsdevel, linux-kernel, linux-api, x86,
	akpm, arnd, tglx, luto, mingo, bp, dave.hansen, hpa, mhocko, tj,
	ying.huang, gregory.price, corbet, rakie.kim, hyeongtak.ji,
	honggyu.kim, vtavarespetr, peterz, jgroves, ravis.opensrc,
	sthanneeru, emirakhur, Hasan.Maruf, seungjun.ha, Michal Hocko,
	Frank van der Linden

On Sat, Dec 23, 2023 at 7:14 PM Gregory Price <gourry.memverge@gmail.com> wrote:
> mbind2 is an extensible mbind interface which allows a user to
> set the mempolicy for one or more address ranges.
>
> Defined as:
>
> mbind2(unsigned long addr, unsigned long len, struct mpol_args *args,
>        size_t size, unsigned long flags)
>
> addr:         address of the memory range to operate on
> len:          length of the memory range
> flags:        MPOL_MF_HOME_NODE + original mbind() flags
>
> Input values include the following fields of mpol_args:
>
> mode:         The MPOL_* policy (DEFAULT, INTERLEAVE, etc.)
> mode_flags:   The MPOL_F_* flags that were previously passed in or'd
>               into the mode.  This was split to hopefully allow future
>               extensions additional mode/flag space.
> home_node:    if (flags & MPOL_MF_HOME_NODE), set home node of policy
>               to this otherwise it is ignored.
> pol_maxnodes: The max number of nodes described by pol_nodes
> pol_nodes:    the nodemask to apply for the memory policy
>
> The semantics are otherwise the same as mbind(), except that
> the home_node can be set.
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Suggested-by: Frank van der Linden <fvdl@google.com>
> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Suggested-by: Rakie Kim <rakie.kim@sk.com>
> Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>

>  arch/m68k/kernel/syscalls/syscall.tbl         |  1 +

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
  2024-01-02  4:27     ` Huang, Ying
@ 2024-01-02 19:06       ` Gregory Price
  2024-01-03  3:15         ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-02 19:06 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron

> >> > struct mpol_args {
> >> >         /* Basic mempolicy settings */
> >> >         __u16 mode;
> >> >         __u16 mode_flags;
> >> >         __s32 home_node;
> >> >         __u64 pol_maxnodes;
> >> 
> >> I understand that we want to avoid hole in struct.  But I still feel
> >> uncomfortable to use __u64 for a small.  But I don't have solution too.
> >> Anyone else has some idea?
> >>
> >
> > maxnode has been an `unsigned long` in every other interface for quite
> > some time.  Seems better to keep this consistent rather than it suddenly
> > become `unsigned long` over here and `unsigned short` over there.
> 
> I don't think that it matters.  The actual maximum node number will be
> less than maximum `unsigned short`.
> 

the structure will end up being

struct mpol_args {
	__u16 mode;
	__u16 mode_flags;
	__s32 home_node;
	__u16 pol_maxnodes;
	__u8  rsv[6];
	__aligned_u64 pol_nodes;
	__aligned_u64 il_weights;
}

If you're fine with that, i'll make the change.
~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2024-01-02  7:41       ` Huang, Ying
@ 2024-01-02 19:45         ` Gregory Price
  2024-01-03  2:45           ` Huang, Ying
  2024-01-03  2:46         ` Gregory Price
  1 sibling, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-02 19:45 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Tue, Jan 02, 2024 at 03:41:08PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> That is, if we use "1" as default weight, we need to change weights of
> nodes frequently because we haven't a "base" weight.  The best candidate
> base weight is the weight of DRAM node.  For example, if we set the
> default weight of DRAM node to be "16" and use that as the base weight,
> we don't need to change it in most cases.  The weight of other nodes can
> be set according to the ratio of its memory bandwidth to that of DRAM.
> 
> This makes it easy to set the default weight via HMAT/CDAT too.
> 
> What do you think about that?
> 

You're getting a bit ahead of the patch set.  There is "what is a
reasonable default weight" and "what is the minumum functionality".

The minimum functionality is everything receiving a default weight of 1,
such that weighted interleave's behavior defaults to round-robin
interleave. This gets the system off the ground.

We can then expose an internal interface to drivers for them to set the
default weight to some reasonable number during system and device
initialization. The question at that point is what system is responsible
for setting the default weights... node? cxl? anything? What happens on
hotplug? etc.  That seems outside the scope of this patch set.


If you want me to add the default_iw_table with special value 0 denoting
"use default" at each layer, I can do that.

The basic change is this snippet:
```
if (pol->flags & MPOL_F_GWEIGHT)
	pol_weights = iw_table;
else
	pol_weights = pol->wil.weights;

for_each_node_mask(nid, nodemask) {
	weight = pol_weights[nid];
	weight_total += weight;
	weights[nid] = weight;
}
```

changes to:
```
for_each_node_mask(nid, nodemask) {
	weight = pol->wil.weights[node]
	if (!weight)
		weight = iw_table[node]
	if (!weight)
		weight = default_iw_table[node]
	weight_total += weight;
	weights[nid] = weight
}
```

It's a bit ugly, but it allows a 0 value to represent "use default",
and default_iw_table just ends up being initialized to `1` for now.

I think it also allows MPOL_F_GWEIGHT to be eliminated.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-02  8:42       ` Huang, Ying
@ 2024-01-02 20:30         ` Gregory Price
  2024-01-03  5:46           ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-02 20:30 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry.memverge@gmail.com> writes:
> >> 
> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> >> > +{
> >> > +	nodemask_t nodemask = pol->nodes;
> >> > +	unsigned int target, weight_total = 0;
> >> > +	int nid;
> >> > +	unsigned char weights[MAX_NUMNODES];
> >> 
> >> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
> >> large?
> >> 
> >
> > I've been struggling with a good solution to this.  We need a local copy
> > of weights to prevent weights from changing out from under us during
> > allocation (which may take quite some time), but it seemed unwise to
> > to allocate 1KB heap in this particular path.
> >
> > Is my concern unfounded?  If so, I can go ahead and add the allocation
> > code.
> 
> Please take a look at NODEMASK_ALLOC().
>

This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. 

Some of the allocations on the stack can be replaced with a scratch
allocation, that's no big deal.

I'm specifically concerned about:
	weighted_interleave_nid
	alloc_pages_bulk_array_weighted_interleave

I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
contexts. If kmalloc/kfree is safe fine, this problem is trivial.

If not, there is no good solution to this without pre-allocating a
scratch area per-task.

> >> I don't think barrier() is needed to wait for memory operations for
> >> stack.  It's usually used for cross-processor memory order.
> >>
> >
> > This is present in the old interleave code.  To the best of my
> > understanding, the concern is for mempolicy->nodemask rebinding that can
> > occur when cgroups.cpusets.mems_allowed changes.
> >
> > so we can't iterate over (mempolicy->nodemask), we have to take a local
> > copy.
> >
> > My *best* understanding of the barrier here is to prevent the compiler
> > from reordering operations such that it attempts to optimize out the
> > local copy (or do lazy-fetch).
> >
> > It is present in the original interleave code, so I pulled it forward to
> > this, but I have not tested whether this is a bit paranoid or not.
> >
> > from `interleave_nid`:
> >
> >  /*
> >   * The barrier will stabilize the nodemask in a register or on
> >   * the stack so that it will stop changing under the code.
> >   *
> >   * Between first_node() and next_node(), pol->nodes could be changed
> >   * by other threads. So we put pol->nodes in a local stack.
> >   */
> >  barrier();
> 
> Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
> comments all over the place.  Can we implement a wrapper for it?  For
> example, memcpy_once().  __read_once_size() in
> tools/include/linux/compiler.h can be used as reference.
> 
> Because node_weights[] may be changed simultaneously too.  We may need
> to consider similar issue for it too.  But RCU seems more appropriate
> for node_weights[].
> 

Weights are collected individually onto the stack because we have to sum
them up before we actually apply the weights.

A stale weight is not offensive.  RCU is not needed and doesn't help.

The reason the barrier is needed is not weights, it's the nodemask.

So you basically just want to replace barrier() with this and drop the
copy/pasted comments:

static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
{
        /*
         * The barrier will stabilize the nodemask in a register or on
         * the stack so that it will stop changing under the code.
         *
         * Between first_node() and next_node(), pol->nodes could be changed
         * by other threads. So we put pol->nodes in a local stack.
         */
        barrier();
        __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
        barrier();
}

- nodemask_t nodemask = pol->nodemask
- barrier()
+ nodemask_t nodemask;
+ read_once_policy_nodemask(pol, &nodemask)

Is that right?

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse
  2024-01-02  9:09         ` Huang, Ying
@ 2024-01-02 20:32           ` Gregory Price
  0 siblings, 0 replies; 46+ messages in thread
From: Gregory Price @ 2024-01-02 20:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Tue, Jan 02, 2024 at 05:09:55PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Tue, Dec 26, 2023 at 02:05:35AM -0500, Gregory Price wrote:
> >> On Wed, Dec 27, 2023 at 04:39:29PM +0800, Huang, Ying wrote:
> >> > Gregory Price <gourry.memverge@gmail.com> writes:
> >> > 
> >> > > +	unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS);
> >> > > +
> >> > > +	*flags = *mode_arg & MPOL_MODE_FLAGS;
> >> > > +	*mode_arg = mode;
> >> > 
> >> > It appears that it's unnecessary to introduce a local variable to split
> >> > mode/flags.  Just reuse the original code?
> >> > 
> >
> > Revisiting during fixes: Note the change from int to short.
> >
> > I chose to make this explicit because validate_mpol_flags takes a short.
> >
> > I'm fairly sure changing it back throws a truncation warning.
> 
> Why something like below doesn't work?
> 
> int sanitize_mpol_flags(int *mode, unsigned short *flags)
> {
>         *flags = *mode & MPOL_MODE_FLAGS;
>         *mode &= ~MPOL_MODE_FLAGS;
> 
>         return validate_mpol_flags(*mode, flags);
> }

was concerned with silent truncation of (*mode) (int) to short.

*shrug* happy to change it

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2024-01-02 19:45         ` Gregory Price
@ 2024-01-03  2:45           ` Huang, Ying
  2024-01-03  2:59             ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-03  2:45 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 02, 2024 at 03:41:08PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> That is, if we use "1" as default weight, we need to change weights of
>> nodes frequently because we haven't a "base" weight.  The best candidate
>> base weight is the weight of DRAM node.  For example, if we set the
>> default weight of DRAM node to be "16" and use that as the base weight,
>> we don't need to change it in most cases.  The weight of other nodes can
>> be set according to the ratio of its memory bandwidth to that of DRAM.
>> 
>> This makes it easy to set the default weight via HMAT/CDAT too.
>> 
>> What do you think about that?
>> 
>
> You're getting a bit ahead of the patch set.  There is "what is a
> reasonable default weight" and "what is the minumum functionality".

I totally agree that we need the minimal functionality firstly.

> The minimum functionality is everything receiving a default weight of 1,
> such that weighted interleave's behavior defaults to round-robin
> interleave. This gets the system off the ground.

I don't think that we need to implement all functionalities now.  But,
we may need to consider more especially if it may impact the user space
interface.  The default base weight is something like that.  If we
change the default base weight from "1" to "16" later, users may be
surprised.  So, I think it's better to discuss it now.

> We can then expose an internal interface to drivers for them to set the
> default weight to some reasonable number during system and device
> initialization. The question at that point is what system is responsible
> for setting the default weights... node? cxl? anything? What happens on
> hotplug? etc.  That seems outside the scope of this patch set.
>
>
> If you want me to add the default_iw_table with special value 0 denoting
> "use default" at each layer, I can do that.
>
> The basic change is this snippet:
> ```
> if (pol->flags & MPOL_F_GWEIGHT)
> 	pol_weights = iw_table;
> else
> 	pol_weights = pol->wil.weights;
>
> for_each_node_mask(nid, nodemask) {
> 	weight = pol_weights[nid];
> 	weight_total += weight;
> 	weights[nid] = weight;
> }
> ```
>
> changes to:
> ```
> for_each_node_mask(nid, nodemask) {
> 	weight = pol->wil.weights[node]
> 	if (!weight)
> 		weight = iw_table[node]
> 	if (!weight)
> 		weight = default_iw_table[node]
> 	weight_total += weight;
> 	weights[nid] = weight
> }
> ```
>
> It's a bit ugly,

We can use a wrapper function to hide the logic.

> but it allows a 0 value to represent "use default",
> and default_iw_table just ends up being initialized to `1` for now.

Because the contents of default_iw_table are just "default weight" for
now.  We don't need it for now.  We can add it later.

> I think it also allows MPOL_F_GWEIGHT to be eliminated.

Do we need a way to distinguish whether to copy the global weights to
local weights when the memory policy is created?  That is, when the
global weights are changed later, will the changes be used?  One
possible solution is

- If no weights are specified in set_mempolicy2(), the global weights
  will be used always.

- If at least one weight is specified in set_mempolicy2(), it will be
  used, and the other weights in global weights will be copied to the
  local weights.  That is, changes to the global weights will not be
  used.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2024-01-02  7:41       ` Huang, Ying
  2024-01-02 19:45         ` Gregory Price
@ 2024-01-03  2:46         ` Gregory Price
  1 sibling, 0 replies; 46+ messages in thread
From: Gregory Price @ 2024-01-03  2:46 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Tue, Jan 02, 2024 at 03:41:08PM +0800, Huang, Ying wrote:
> Think about the default weight value via HMAT/CDAT again.  It may be not
> a good idea to use "1" as default even for now.
> 
> For example,
> 
> - The memory bandwidth of DRAM is 100GB, whose default weight is "1".
> 
> - We hot-plug CXL.mem A with memory bandwidth 20GB.  So, we change the
>   weight of DRAM to 5, and use "1" as the weight of CXL.mem A.
> 
> - We hot-plug CXL.mem B with memory bandwidth 10GB.  So, we change the
>   weight of DRAM to 10, the weight of CXL.mem A to 2, and use "1" as the
>   weight of CXL.mem B.
> 
> That is, if we use "1" as default weight, we need to change weights of
> nodes frequently because we haven't a "base" weight.  The best candidate
> base weight is the weight of DRAM node.  For example, if we set the
> default weight of DRAM node to be "16" and use that as the base weight,
> we don't need to change it in most cases.  The weight of other nodes can
> be set according to the ratio of its memory bandwidth to that of DRAM.
> 
> This makes it easy to set the default weight via HMAT/CDAT too.
> 
> What do you think about that?
> 

Giving this more thought.

Hotplug should be an incredibly rare event. I don't think swapping defaults
"frequently" is a real problem we should handle.

It's expected that dynamic capacity devices will not cause a node to
hotplug, but instead cause a node to grow/shrink.

Seems perfectly fine to rebalance weights in response to rare events.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2024-01-03  2:45           ` Huang, Ying
@ 2024-01-03  2:59             ` Gregory Price
  2024-01-03  6:03               ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-03  2:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

On Wed, Jan 03, 2024 at 10:45:53AM +0800, Huang, Ying wrote:
> 
> > The minimum functionality is everything receiving a default weight of 1,
> > such that weighted interleave's behavior defaults to round-robin
> > interleave. This gets the system off the ground.
> 
> I don't think that we need to implement all functionalities now.  But,
> we may need to consider more especially if it may impact the user space
> interface.  The default base weight is something like that.  If we
> change the default base weight from "1" to "16" later, users may be
> surprised.  So, I think it's better to discuss it now.
>

This is a hill I don't particularly care to die on.  I think the weights
are likely to end up being set at boot and rebalanced as (rare) hotplug
events occur.

So if people think the default weight should be 3,16,24 or 123, i don't
think it's going to matter.

> 
> We can use a wrapper function to hide the logic.
>

Done.  I'll push a new set tomorrow.

> > I think it also allows MPOL_F_GWEIGHT to be eliminated.
> 
> Do we need a way to distinguish whether to copy the global weights to
> local weights when the memory policy is created?  That is, when the
> global weights are changed later, will the changes be used?  One
> possible solution is
> 
> - If no weights are specified in set_mempolicy2(), the global weights
>   will be used always.
> 
> - If at least one weight is specified in set_mempolicy2(), it will be
>   used, and the other weights in global weights will be copied to the
>   local weights.  That is, changes to the global weights will not be
>   used.
> 

What's confusing about that is that if a user sets a weight to 0,
they'll get a non-0 weight - always.

In my opinion, if we want to make '0' mean 'use system default', then
it should mean 'ALWAYS use system default for this node'.

"Use the system default at the time the syscall was called, and do not
update to use a new system default if that default is changed" is
confusing.

If you say use a global value, use the global value. Simple.

> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
  2024-01-02 19:06       ` Gregory Price
@ 2024-01-03  3:15         ` Huang, Ying
  0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2024-01-03  3:15 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Johannes Weiner,
	Hasan Al Maruf, Hao Wang, Dan Williams, Michal Hocko,
	Zhongkun He, Frank van der Linden, John Groves, Jonathan Cameron

Gregory Price <gregory.price@memverge.com> writes:

>> >> > struct mpol_args {
>> >> >         /* Basic mempolicy settings */
>> >> >         __u16 mode;
>> >> >         __u16 mode_flags;
>> >> >         __s32 home_node;
>> >> >         __u64 pol_maxnodes;
>> >> 
>> >> I understand that we want to avoid hole in struct.  But I still feel
>> >> uncomfortable to use __u64 for a small.  But I don't have solution too.
>> >> Anyone else has some idea?
>> >>
>> >
>> > maxnode has been an `unsigned long` in every other interface for quite
>> > some time.  Seems better to keep this consistent rather than it suddenly
>> > become `unsigned long` over here and `unsigned short` over there.
>> 
>> I don't think that it matters.  The actual maximum node number will be
>> less than maximum `unsigned short`.
>> 
>
> the structure will end up being
>
> struct mpol_args {
> 	__u16 mode;
> 	__u16 mode_flags;
> 	__s32 home_node;
> 	__u16 pol_maxnodes;
> 	__u8  rsv[6];
> 	__aligned_u64 pol_nodes;
> 	__aligned_u64 il_weights;
> }
>
> If you're fine with that, i'll make the change.

This looks OK for me.  But, I don't know whether others think this is
better.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-02 20:30         ` Gregory Price
@ 2024-01-03  5:46           ` Huang, Ying
  2024-01-03 22:09             ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-03  5:46 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry.memverge@gmail.com> writes:
>> >> 
>> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> >> > +{
>> >> > +	nodemask_t nodemask = pol->nodes;
>> >> > +	unsigned int target, weight_total = 0;
>> >> > +	int nid;
>> >> > +	unsigned char weights[MAX_NUMNODES];
>> >> 
>> >> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> >> large?
>> >> 
>> >
>> > I've been struggling with a good solution to this.  We need a local copy
>> > of weights to prevent weights from changing out from under us during
>> > allocation (which may take quite some time), but it seemed unwise to
>> > to allocate 1KB heap in this particular path.
>> >
>> > Is my concern unfounded?  If so, I can go ahead and add the allocation
>> > code.
>> 
>> Please take a look at NODEMASK_ALLOC().
>>
>
> This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. 
>
> Some of the allocations on the stack can be replaced with a scratch
> allocation, that's no big deal.
>
> I'm specifically concerned about:
> 	weighted_interleave_nid
> 	alloc_pages_bulk_array_weighted_interleave
>
> I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>
> If not, there is no good solution to this without pre-allocating a
> scratch area per-task.

You need to audit whether it's safe for all callers.  I guess that you
need to allocate pages after calling, so you can use the same GFP flags
here.

>> >> I don't think barrier() is needed to wait for memory operations for
>> >> stack.  It's usually used for cross-processor memory order.
>> >>
>> >
>> > This is present in the old interleave code.  To the best of my
>> > understanding, the concern is for mempolicy->nodemask rebinding that can
>> > occur when cgroups.cpusets.mems_allowed changes.
>> >
>> > so we can't iterate over (mempolicy->nodemask), we have to take a local
>> > copy.
>> >
>> > My *best* understanding of the barrier here is to prevent the compiler
>> > from reordering operations such that it attempts to optimize out the
>> > local copy (or do lazy-fetch).
>> >
>> > It is present in the original interleave code, so I pulled it forward to
>> > this, but I have not tested whether this is a bit paranoid or not.
>> >
>> > from `interleave_nid`:
>> >
>> >  /*
>> >   * The barrier will stabilize the nodemask in a register or on
>> >   * the stack so that it will stop changing under the code.
>> >   *
>> >   * Between first_node() and next_node(), pol->nodes could be changed
>> >   * by other threads. So we put pol->nodes in a local stack.
>> >   */
>> >  barrier();
>> 
>> Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
>> comments all over the place.  Can we implement a wrapper for it?  For
>> example, memcpy_once().  __read_once_size() in
>> tools/include/linux/compiler.h can be used as reference.
>> 
>> Because node_weights[] may be changed simultaneously too.  We may need
>> to consider similar issue for it too.  But RCU seems more appropriate
>> for node_weights[].
>> 
>
> Weights are collected individually onto the stack because we have to sum
> them up before we actually apply the weights.
>
> A stale weight is not offensive.  RCU is not needed and doesn't help.

When you copy weights from iw_table[] to stack, it's possible for
compiler to cache its contents in register, or merge, split the memory
operations.  At the same time, iw_table[] may be changed simultaneously
via sysfs interface.  So, we need a mechanism to guarantee that we read
the latest contents consistently.

> The reason the barrier is needed is not weights, it's the nodemask.

Yes.  So I said that we need similar stuff for weights.

> So you basically just want to replace barrier() with this and drop the
> copy/pasted comments:
>
> static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
> {
>         /*
>          * The barrier will stabilize the nodemask in a register or on
>          * the stack so that it will stop changing under the code.
>          *
>          * Between first_node() and next_node(), pol->nodes could be changed
>          * by other threads. So we put pol->nodes in a local stack.
>          */
>         barrier();
>         __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
>         barrier();
> }
>
> - nodemask_t nodemask = pol->nodemask
> - barrier()
> + nodemask_t nodemask;
> + read_once_policy_nodemask(pol, &nodemask)
>
> Is that right?

Yes.  Something like that.  Or even more general (need to be optimized?),

static inline static void memcpy_once(void *dst, void *src, size_t n)
{
        barrier();
        memcpy(dst, src, n);
        barrier();
}

        memcpy_once(&nodemask, &pol->nodemask, sizeof(nodemask));

The comments can be based on that of READ_ONCE().

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface
  2024-01-03  2:59             ` Gregory Price
@ 2024-01-03  6:03               ` Huang, Ying
  0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2024-01-03  6:03 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha

Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Jan 03, 2024 at 10:45:53AM +0800, Huang, Ying wrote:
>> 
>> > The minimum functionality is everything receiving a default weight of 1,
>> > such that weighted interleave's behavior defaults to round-robin
>> > interleave. This gets the system off the ground.
>> 
>> I don't think that we need to implement all functionalities now.  But,
>> we may need to consider more especially if it may impact the user space
>> interface.  The default base weight is something like that.  If we
>> change the default base weight from "1" to "16" later, users may be
>> surprised.  So, I think it's better to discuss it now.
>>
>
> This is a hill I don't particularly care to die on.  I think the weights
> are likely to end up being set at boot and rebalanced as (rare) hotplug
> events occur.
>
> So if people think the default weight should be 3,16,24 or 123, i don't
> think it's going to matter.
>
>> 
>> We can use a wrapper function to hide the logic.
>>
>
> Done.  I'll push a new set tomorrow.
>
>> > I think it also allows MPOL_F_GWEIGHT to be eliminated.
>> 
>> Do we need a way to distinguish whether to copy the global weights to
>> local weights when the memory policy is created?  That is, when the
>> global weights are changed later, will the changes be used?  One
>> possible solution is
>> 
>> - If no weights are specified in set_mempolicy2(), the global weights
>>   will be used always.
>> 
>> - If at least one weight is specified in set_mempolicy2(), it will be
>>   used, and the other weights in global weights will be copied to the
>>   local weights.  That is, changes to the global weights will not be
>>   used.
>> 
>
> What's confusing about that is that if a user sets a weight to 0,
> they'll get a non-0 weight - always.
>
> In my opinion, if we want to make '0' mean 'use system default', then
> it should mean 'ALWAYS use system default for this node'.
>
> "Use the system default at the time the syscall was called, and do not
> update to use a new system default if that default is changed" is
> confusing.
>
> If you say use a global value, use the global value. Simple.

I mainly have concerns about consistency.  The global weights can be
changed while the local weights are fixed.  For example,

- Weights of node 0,1 is [3, 1] initially

- Process A call set_mempolicy2() to set weights to [4, 0], that is, use
  default weight for node 1.

- After hotplug, the weights of node is changed to [12, 4, 1], now the
  effective weights used in process A becomes [4, 4].  Which is hardly
  desired.

Another choice is to disallow "0" as weight in set_mempolicy2().

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-03  5:46           ` Huang, Ying
@ 2024-01-03 22:09             ` Gregory Price
  2024-01-04  5:39               ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-03 22:09 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> > I'm specifically concerned about:
> > 	weighted_interleave_nid
> > 	alloc_pages_bulk_array_weighted_interleave
> >
> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
> >
> > If not, there is no good solution to this without pre-allocating a
> > scratch area per-task.
> 
> You need to audit whether it's safe for all callers.  I guess that you
> need to allocate pages after calling, so you can use the same GFP flags
> here.
> 

After picking away i realized that this code is usually going to get
called during page fault handling - duh.  So kmalloc is almost never
safe (or can fail), and we it's nasty to try to handle those errors.

Instead of doing that, I simply chose to implement the scratch space
in the mempolicy structure

mempolicy->wil.scratch_weights[MAX_NUMNODES].

We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
scratch space we can use any time the task is allocating memory, and
prevents the need for any fancy error handling.  That seems like a
perfectly reasonable tradeoff.

> >
> > Weights are collected individually onto the stack because we have to sum
> > them up before we actually apply the weights.
> >
> > A stale weight is not offensive.  RCU is not needed and doesn't help.
> 
> When you copy weights from iw_table[] to stack, it's possible for
> compiler to cache its contents in register, or merge, split the memory
> operations.  At the same time, iw_table[] may be changed simultaneously
> via sysfs interface.  So, we need a mechanism to guarantee that we read
> the latest contents consistently.
> 

Fair enough, I went ahead and added a similar interaction.

~Gregoryg

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-03 22:09             ` Gregory Price
@ 2024-01-04  5:39               ` Huang, Ying
  2024-01-04 18:59                 ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-04  5:39 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> > I'm specifically concerned about:
>> > 	weighted_interleave_nid
>> > 	alloc_pages_bulk_array_weighted_interleave
>> >
>> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
>> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>> >
>> > If not, there is no good solution to this without pre-allocating a
>> > scratch area per-task.
>> 
>> You need to audit whether it's safe for all callers.  I guess that you
>> need to allocate pages after calling, so you can use the same GFP flags
>> here.
>> 
>
> After picking away i realized that this code is usually going to get
> called during page fault handling - duh.  So kmalloc is almost never
> safe (or can fail), and we it's nasty to try to handle those errors.

Why not just OOM for allocation failure?

> Instead of doing that, I simply chose to implement the scratch space
> in the mempolicy structure
>
> mempolicy->wil.scratch_weights[MAX_NUMNODES].
>
> We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
> scratch space we can use any time the task is allocating memory, and
> prevents the need for any fancy error handling.  That seems like a
> perfectly reasonable tradeoff.

I don't think that this is a good idea.  The weight array is temporary.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-04  5:39               ` Huang, Ying
@ 2024-01-04 18:59                 ` Gregory Price
  2024-01-05  6:51                   ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-04 18:59 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
> >> Gregory Price <gregory.price@memverge.com> writes:
> >> > I'm specifically concerned about:
> >> > 	weighted_interleave_nid
> >> > 	alloc_pages_bulk_array_weighted_interleave
> >> >
> >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
> >> >
> >> > If not, there is no good solution to this without pre-allocating a
> >> > scratch area per-task.
> >> 
> >> You need to audit whether it's safe for all callers.  I guess that you
> >> need to allocate pages after calling, so you can use the same GFP flags
> >> here.
> >> 
> >
> > After picking away i realized that this code is usually going to get
> > called during page fault handling - duh.  So kmalloc is almost never
> > safe (or can fail), and we it's nasty to try to handle those errors.
> 
> Why not just OOM for allocation failure?
>

2 notes:

1) callers of weighted_interleave_nid do not expect OOM conditions, they
   expect a node selection.  On error, we would simply return the local
   numa node without indication of failure.

2) callers of alloc_pages_bulk_array_weighted_interleave receive the
   total number of pages allocated, and they are expected to detect
   pages allocated != pages requested, and then handle whether to
   OOM or simply retry (allocation may fail for a variety of reasons).

By introducing an allocation into this area, if an allocation failure
occurs, we would essentially need to silently squash it and return
either local_node (interleave_nid) or return 0 (bulk allocator) and
allow the allocation logic to handle any subsequent OOM condition.

That felt less desirable than just allocating a scratch space up front
in the mempolicy and avoiding the issue altogether.

> > Instead of doing that, I simply chose to implement the scratch space
> > in the mempolicy structure
> >
> > mempolicy->wil.scratch_weights[MAX_NUMNODES].
> >
> > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
> > scratch space we can use any time the task is allocating memory, and
> > prevents the need for any fancy error handling.  That seems like a
> > perfectly reasonable tradeoff.
> 
> I don't think that this is a good idea.  The weight array is temporary.
> 

It's temporary, but it's also only used in the context of the task while
the alloc lock is held.

If you think it's fine to introduce another potential OOM generating
spot, then I'll just go ahead and allocate the memory on the fly.

I do want to point out, though, that weighted_interleave_nid is called
per allocated page.  So now we're not just collecting weights to
calculate the offset, we're doing an allocation (that can fail) per page
allocated for that region.

The bulk allocator amortizes the cost of this allocation by doing it
once while allocating a chunk of pages - but the weighted_interleave_nid
function is called per-page.

By comparison, the memory cost to just allocate a static scratch area in
the mempolicy struct is only incurred by tasks with a mempolicy.


So we're talking ~1MB for 1024 threads with mempolicies to avoid error
conditions mid-page-allocation and to reduce the cost associated with
applying weighted interleave.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-04 18:59                 ` Gregory Price
@ 2024-01-05  6:51                   ` Huang, Ying
  2024-01-05  7:25                     ` Gregory Price
  0 siblings, 1 reply; 46+ messages in thread
From: Huang, Ying @ 2024-01-05  6:51 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

Gregory Price <gregory.price@memverge.com> writes:

> On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gregory.price@memverge.com> writes:
>> >> > I'm specifically concerned about:
>> >> > 	weighted_interleave_nid
>> >> > 	alloc_pages_bulk_array_weighted_interleave
>> >> >
>> >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
>> >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>> >> >
>> >> > If not, there is no good solution to this without pre-allocating a
>> >> > scratch area per-task.
>> >> 
>> >> You need to audit whether it's safe for all callers.  I guess that you
>> >> need to allocate pages after calling, so you can use the same GFP flags
>> >> here.
>> >> 
>> >
>> > After picking away i realized that this code is usually going to get
>> > called during page fault handling - duh.  So kmalloc is almost never
>> > safe (or can fail), and we it's nasty to try to handle those errors.
>> 
>> Why not just OOM for allocation failure?
>>
>
> 2 notes:
>
> 1) callers of weighted_interleave_nid do not expect OOM conditions, they
>    expect a node selection.  On error, we would simply return the local
>    numa node without indication of failure.
>
> 2) callers of alloc_pages_bulk_array_weighted_interleave receive the
>    total number of pages allocated, and they are expected to detect
>    pages allocated != pages requested, and then handle whether to
>    OOM or simply retry (allocation may fail for a variety of reasons).
>
> By introducing an allocation into this area, if an allocation failure
> occurs, we would essentially need to silently squash it and return
> either local_node (interleave_nid) or return 0 (bulk allocator) and
> allow the allocation logic to handle any subsequent OOM condition.
>
> That felt less desirable than just allocating a scratch space up front
> in the mempolicy and avoiding the issue altogether.
>
>> > Instead of doing that, I simply chose to implement the scratch space
>> > in the mempolicy structure
>> >
>> > mempolicy->wil.scratch_weights[MAX_NUMNODES].
>> >
>> > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
>> > scratch space we can use any time the task is allocating memory, and
>> > prevents the need for any fancy error handling.  That seems like a
>> > perfectly reasonable tradeoff.
>> 
>> I don't think that this is a good idea.  The weight array is temporary.
>> 
>
> It's temporary, but it's also only used in the context of the task while
> the alloc lock is held.
>
> If you think it's fine to introduce another potential OOM generating
> spot, then I'll just go ahead and allocate the memory on the fly.
>
> I do want to point out, though, that weighted_interleave_nid is called
> per allocated page.  So now we're not just collecting weights to
> calculate the offset, we're doing an allocation (that can fail) per page
> allocated for that region.
>
> The bulk allocator amortizes the cost of this allocation by doing it
> once while allocating a chunk of pages - but the weighted_interleave_nid
> function is called per-page.
>
> By comparison, the memory cost to just allocate a static scratch area in
> the mempolicy struct is only incurred by tasks with a mempolicy.
>
>
> So we're talking ~1MB for 1024 threads with mempolicies to avoid error
> conditions mid-page-allocation and to reduce the cost associated with
> applying weighted interleave.

Think about this again.  Why do we need weights array on stack?  I think
this is used to keep weights consistent.  If so, we don't need weights
array on stack.  Just use RCU to access global weights array.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-05  6:51                   ` Huang, Ying
@ 2024-01-05  7:25                     ` Gregory Price
  2024-01-08  7:08                       ` Huang, Ying
  0 siblings, 1 reply; 46+ messages in thread
From: Gregory Price @ 2024-01-05  7:25 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

On Fri, Jan 05, 2024 at 02:51:40PM +0800, Huang, Ying wrote:
> >
> > So we're talking ~1MB for 1024 threads with mempolicies to avoid error
> > conditions mid-page-allocation and to reduce the cost associated with
> > applying weighted interleave.
> 
> Think about this again.  Why do we need weights array on stack?  I think
> this is used to keep weights consistent.  If so, we don't need weights
> array on stack.  Just use RCU to access global weights array.
> 

From the bulk allocation code:

__alloc_pages_bulk(gfp, node, NULL, node_pages, NULL, page_array);

This function can block. You cannot block during an RCU read context.

~Gregory

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
  2024-01-05  7:25                     ` Gregory Price
@ 2024-01-08  7:08                       ` Huang, Ying
  0 siblings, 0 replies; 46+ messages in thread
From: Huang, Ying @ 2024-01-08  7:08 UTC (permalink / raw)
  To: Gregory Price
  Cc: Gregory Price, linux-mm, linux-doc, linux-fsdevel, linux-kernel,
	linux-api, x86, akpm, arnd, tglx, luto, mingo, bp, dave.hansen,
	hpa, mhocko, tj, corbet, rakie.kim, hyeongtak.ji, honggyu.kim,
	vtavarespetr, peterz, jgroves, ravis.opensrc, sthanneeru,
	emirakhur, Hasan.Maruf, seungjun.ha, Srinivasulu Thanneeru

Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Jan 05, 2024 at 02:51:40PM +0800, Huang, Ying wrote:
>> >
>> > So we're talking ~1MB for 1024 threads with mempolicies to avoid error
>> > conditions mid-page-allocation and to reduce the cost associated with
>> > applying weighted interleave.
>> 
>> Think about this again.  Why do we need weights array on stack?  I think
>> this is used to keep weights consistent.  If so, we don't need weights
>> array on stack.  Just use RCU to access global weights array.
>> 
>
> From the bulk allocation code:
>
> __alloc_pages_bulk(gfp, node, NULL, node_pages, NULL, page_array);
>
> This function can block. You cannot block during an RCU read context.

Yes.  You are right.  For __alloc_pages_bulk(), it should be OK to
allocate the weights array.  For weighted_interleave_nid(), we can use
RCU to avoid memory allocation in relative fast code path.

BTW, we can use nr_node_ids instead of MAX_NUMNODES if applicable.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2024-01-08  7:10 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2023-12-27  6:42   ` Huang, Ying
2023-12-26  6:48     ` Gregory Price
2024-01-02  7:41       ` Huang, Ying
2024-01-02 19:45         ` Gregory Price
2024-01-03  2:45           ` Huang, Ying
2024-01-03  2:59             ` Gregory Price
2024-01-03  6:03               ` Huang, Ying
2024-01-03  2:46         ` Gregory Price
2023-12-23 18:10 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2023-12-27  8:32   ` Huang, Ying
2023-12-26  7:01     ` Gregory Price
2023-12-26  8:06       ` Gregory Price
2023-12-26 11:32       ` Gregory Price
2024-01-02  8:42       ` Huang, Ying
2024-01-02 20:30         ` Gregory Price
2024-01-03  5:46           ` Huang, Ying
2024-01-03 22:09             ` Gregory Price
2024-01-04  5:39               ` Huang, Ying
2024-01-04 18:59                 ` Gregory Price
2024-01-05  6:51                   ` Huang, Ying
2024-01-05  7:25                     ` Gregory Price
2024-01-08  7:08                       ` Huang, Ying
2023-12-23 18:10 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
2023-12-27  8:39   ` Huang, Ying
2023-12-26  7:05     ` Gregory Price
2023-12-26 11:48       ` Gregory Price
2024-01-02  9:09         ` Huang, Ying
2024-01-02 20:32           ` Gregory Price
2023-12-23 18:10 ` [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
2023-12-23 18:10 ` [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
2023-12-23 18:10 ` [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
2023-12-23 18:10 ` [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2023-12-23 18:10 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
2024-01-02 14:38   ` Geert Uytterhoeven
2023-12-23 18:10 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
2024-01-02 14:46   ` Geert Uytterhoeven
2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
2024-01-02 14:47   ` Geert Uytterhoeven
2023-12-23 18:11 ` [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
2023-12-25  7:54 ` [PATCH v5 00/11] mempolicy2, mbind2, and " Huang, Ying
2023-12-26  7:45   ` Gregory Price
2024-01-02  4:27     ` Huang, Ying
2024-01-02 19:06       ` Gregory Price
2024-01-03  3:15         ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).