From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gourry.memverge@gmail.com>
Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-api@vger.kernel.org, x86@kernel.org,
akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de,
luto@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, hpa@zytor.com, mhocko@kernel.org,
tj@kernel.org, gregory.price@memverge.com, corbet@lwn.net,
rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com,
vtavarespetr@micron.com, peterz@infradead.org,
jgroves@micron.com, ravis.opensrc@micron.com,
sthanneeru@micron.com, emirakhur@micron.com,
Hasan.Maruf@amd.com, seungjun.ha@samsung.com,
Johannes Weiner <hannes@cmpxchg.org>,
Hasan Al Maruf <hasanalmaruf@fb.com>,
Hao Wang <haowang3@fb.com>,
Dan Williams <dan.j.williams@intel.com>,
Michal Hocko <mhocko@suse.com>,
Zhongkun He <hezhongkun.hzk@bytedance.com>,
Frank van der Linden <fvdl@google.com>,
John Groves <john@jagalactic.com>,
Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Subject: Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
Date: Mon, 25 Dec 2023 15:54:18 +0800 [thread overview]
Message-ID: <87frzqg1jp.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20231223181101.1954-1-gregory.price@memverge.com> (Gregory Price's message of "Sat, 23 Dec 2023 13:10:50 -0500")
Gregory Price <gourry.memverge@gmail.com> writes:
> Weighted interleave is a new interleave policy intended to make
> use of a the new distributed-memory environment made available
> by CXL. The existing interleave mechanism does an even round-robin
> distribution of memory across all nodes in a nodemask, while
> weighted interleave can distribute memory across nodes according
> the available bandwidth that that node provides.
>
> As tests below show, "default interleave" can cause major performance
> degredation due to distribution not matching bandwidth available,
> while "weighted interleave" can provide a performance increase.
>
> For example, the stream benchmark demonstrates that default interleave
> is actively harmful, where weighted interleave is beneficial.
>
> Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
> Default interleave : -78% (slower than DRAM)
> Global weighting : -6% to +4% (workload dependant)
> Targeted weights : +2.5% to +4% (consistently better than DRAM)
>
> If nothing else, this shows how awful round-robin interleave is.
I guess the performance of the default policy, local (fast memory)
first, may be even better in some situation? For example, before the
bandwidth of DRAM is saturated?
I understand that you may want to limit the memory usage of the fast
memory too. But IMHO, that is another requirements. That should be
enforced by something like per-node memory limit.
> Rather than implement yet another specific syscall to set one
> particular field of a mempolicy, we chose to implement an extensible
> mempolicy interface so that future extensions can be captured.
>
> To implement weighted interleave, we need an interface to set the
> node weights along with a MPOL_WEIGHTED_INTERLEAVE. We implement a
> a sysfs extension for "system global" weights which can be set by
> a daemon or administrator, and new extensible syscalls (mempolicy2,
> mbind2) which allow task-local weights to be set via user-software.
>
> The benefit of the sysfs extension is that MPOL_WEIGHTED_INTERLEAVE
> can be used by the existing set_mempolicy and mbind via numactl.
>
> There are 3 "phases" in the patch set that could be considered
> for separate merge candidates, but are presented here as a single
> line as the goal is a fully functional MPOL_WEIGHTED_INTERLEAVE.
>
> 1) Implement MPOL_WEIGHTED_INTERLEAVE with a sysfs extension for
> setting system-global weights via sysfs.
> (Patches 1 & 2)
>
> 2) Refactor mempolicy creation mechanism to use an extensible arg
> struct `struct mempolicy_args` to promote code re-use between
> the original mempolicy/mbind interfaces and the new interfaces.
> (Patches 3-6)
>
> 3) Implementation of set_mempolicy2, get_mempolicy2, and mbind2,
> along with the addition of task-local weights so that per-task
> weights can be registered for MPOL_WEIGHTED_INTERLEAVE.
> (Patches 7-11)
>
> Included at the bottom of this cover letter is linux test project
> tests for backward and forward compatibility, some sample software
> which can be used for quick tests, as well as a numactl branch
> which implements `numactl -w --interleave` for testing.
>
> = Performance summary =
> (tests may have different configurations, see extended info below)
> 1) MLC (W2) : +38% over DRAM. +264% over default interleave.
> MLC (W5) : +40% over DRAM. +226% over default interleave.
> 2) Stream : -6% to +4% over DRAM, +430% over default interleave.
> 3) XSBench : +19% over DRAM. +47% over default interleave.
>
> = LTP Testing Summary =
> existing mempolicy & mbind tests: pass
> mempolicy & mbind + weighted interleave (global weights): pass
> mempolicy2 & mbind2 + weighted interleave (global weights): pass
> mempolicy2 & mbind2 + weighted interleave (local weights): pass
>
[snip]
> =====================================================================
> (Patches 3-6) Refactoring mempolicy for code-reuse
>
> To avoid multiple paths of mempolicy creation, we should refactor the
> existing code to enable the designed extensibility, and refactor
> existing users to utilize the new interface (while retaining the
> existing userland interface).
>
> This set of patches introduces a new mempolicy_args structure, which
> is used to more fully describe a requested mempolicy - to include
> existing and future extensions.
>
> /*
> * Describes settings of a mempolicy during set/get syscalls and
> * kernel internal calls to do_set_mempolicy()
> */
> struct mempolicy_args {
> unsigned short mode; /* policy mode */
> unsigned short mode_flags; /* policy mode flags */
> int home_node; /* mbind: use MPOL_MF_HOME_NODE */
> nodemask_t *policy_nodes; /* get/set/mbind */
> unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */
> };
According to
https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/
it appears that "parameter" are better than "argument" for struct name
here. It appears that current kernel source supports this too.
$ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l
411
$ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l
25
> This arg structure will eventually be utilized by the following
> interfaces:
> mpol_new() - new mempolicy creation
> do_get_mempolicy() - acquiring information about mempolicy
> do_set_mempolicy() - setting the task mempolicy
> do_mbind() - setting a vma mempolicy
>
> do_get_mempolicy() is completely refactored to break it out into
> separate functionality based on the flags provided by get_mempolicy(2)
> MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
> MPOL_F_ADDR: acquires information on vma policies
> MPOL_F_NODE: changes the output for the policy arg to node info
>
> We refactor the get_mempolicy syscall flatten the logic based on these
> flags, and aloow for set_mempolicy2() to re-use the underlying logic.
>
> The result of this refactor, and the new mempolicy_args structure, is
> that extensions like 'sys_set_mempolicy_home_node' can now be directly
> integrated into the initial call to 'set_mempolicy2', and that more
> complete information about a mempolicy can be returned with a single
> call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
>
>
> =====================================================================
> (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>
> These interfaces are the 'extended' counterpart to their relatives.
> They use the userland 'struct mpol_args' structure to communicate a
> complete mempolicy configuration to the kernel. This structure
> looks very much like the kernel-internal 'struct mempolicy_args':
>
> struct mpol_args {
> /* Basic mempolicy settings */
> __u16 mode;
> __u16 mode_flags;
> __s32 home_node;
> __u64 pol_maxnodes;
I understand that we want to avoid hole in struct. But I still feel
uncomfortable to use __u64 for a small. But I don't have solution too.
Anyone else has some idea?
> __aligned_u64 pol_nodes;
> __aligned_u64 *il_weights; /* of size pol_maxnodes */
Typo? Should be,
__aligned_u64 il_weights; /* of size pol_maxnodes */
?
Found this in some patch descriptions too.
> };
>
> The basic mempolicy settings which are shared across all interfaces
> are captured at the top of the structure, while extensions such as
> 'policy_node' and 'addr' are collected beneath.
>
> The syscalls are uniform and defined as follows:
>
> long sys_mbind2(unsigned long addr, unsigned long len,
> struct mpol_args *args, size_t usize,
> unsigned long flags);
>
> long sys_get_mempolicy2(struct mpol_args *args, size_t size,
> unsigned long addr, unsigned long flags);
>
> long sys_set_mempolicy2(struct mpol_args *args, size_t size,
> unsigned long flags);
>
> The 'flags' argument for mbind2 is the same as 'mbind', except with
> the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
> field should be utilized.
>
> The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
> allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
> behavior has been omitted, since get_mempolicy() provides this already.
I still think that it's a good idea to make it possible to deprecate
get_mempolicy(). How about use a union as follows?
struct mpol_mems_allowed {
__u64 maxnodes;
__aligned_u64 nodemask;
};
union mpol_info {
struct mpol_args args;
struct mpol_mems_allowed mems_allowed;
__s32 node;
};
> The 'flags' argument is not used by 'set_mempolicy' at this time, but
> may end up allowing the use of MPOL_MF_HOME_NODE if such functionality
> is desired.
>
> The extensions can be summed up as follows:
>
> get_mempolicy2 extensions:
> - mode and mode flags are split into separate fields
> - MPOL_F_MEMS_ALLOWED and MPOL_F_NODE are not supported
>
> set_mempolicy2:
> - task-local interleave weights can be set via 'il_weights'
>
> mbind2:
> - home_node field sets policy home node w/ MPOL_MF_HOME_NODE
> - task-local interleave weights can be set via 'il_weights'
>
--
Best Regards,
Huang, Ying
[snip]
next prev parent reply other threads:[~2023-12-25 7:56 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-23 18:10 [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave Gregory Price
2023-12-23 18:10 ` [PATCH v5 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Gregory Price
2023-12-27 6:42 ` Huang, Ying
2023-12-26 6:48 ` Gregory Price
2024-01-02 7:41 ` Huang, Ying
2024-01-02 19:45 ` Gregory Price
2024-01-03 2:45 ` Huang, Ying
2024-01-03 2:59 ` Gregory Price
2024-01-03 6:03 ` Huang, Ying
2024-01-03 2:46 ` Gregory Price
2023-12-23 18:10 ` [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Gregory Price
2023-12-27 8:32 ` Huang, Ying
2023-12-26 7:01 ` Gregory Price
2023-12-26 8:06 ` Gregory Price
2023-12-26 11:32 ` Gregory Price
2024-01-02 8:42 ` Huang, Ying
2024-01-02 20:30 ` Gregory Price
2024-01-03 5:46 ` Huang, Ying
2024-01-03 22:09 ` Gregory Price
2024-01-04 5:39 ` Huang, Ying
2024-01-04 18:59 ` Gregory Price
2024-01-05 6:51 ` Huang, Ying
2024-01-05 7:25 ` Gregory Price
2024-01-08 7:08 ` Huang, Ying
2023-12-23 18:10 ` [PATCH v5 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Gregory Price
2023-12-27 8:39 ` Huang, Ying
2023-12-26 7:05 ` Gregory Price
2023-12-26 11:48 ` Gregory Price
2024-01-02 9:09 ` Huang, Ying
2024-01-02 20:32 ` Gregory Price
2023-12-23 18:10 ` [PATCH v5 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Gregory Price
2023-12-23 18:10 ` [PATCH v5 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Gregory Price
2023-12-23 18:10 ` [PATCH v5 06/11] mm/mempolicy: allow home_node to be set by mpol_new Gregory Price
2023-12-23 18:10 ` [PATCH v5 07/11] mm/mempolicy: add userland mempolicy arg structure Gregory Price
2023-12-23 18:10 ` [PATCH v5 08/11] mm/mempolicy: add set_mempolicy2 syscall Gregory Price
2024-01-02 14:38 ` Geert Uytterhoeven
2023-12-23 18:10 ` [PATCH v5 09/11] mm/mempolicy: add get_mempolicy2 syscall Gregory Price
2024-01-02 14:46 ` Geert Uytterhoeven
2023-12-23 18:11 ` [PATCH v5 10/11] mm/mempolicy: add the mbind2 syscall Gregory Price
2024-01-02 14:47 ` Geert Uytterhoeven
2023-12-23 18:11 ` [PATCH v5 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Gregory Price
2023-12-25 7:54 ` Huang, Ying [this message]
2023-12-26 7:45 ` [PATCH v5 00/11] mempolicy2, mbind2, and " Gregory Price
2024-01-02 4:27 ` Huang, Ying
2024-01-02 19:06 ` Gregory Price
2024-01-03 3:15 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87frzqg1jp.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=Hasan.Maruf@amd.com \
--cc=Jonathan.Cameron@Huawei.com \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=bp@alien8.de \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=emirakhur@micron.com \
--cc=fvdl@google.com \
--cc=gourry.memverge@gmail.com \
--cc=gregory.price@memverge.com \
--cc=hannes@cmpxchg.org \
--cc=haowang3@fb.com \
--cc=hasanalmaruf@fb.com \
--cc=hezhongkun.hzk@bytedance.com \
--cc=honggyu.kim@sk.com \
--cc=hpa@zytor.com \
--cc=hyeongtak.ji@sk.com \
--cc=jgroves@micron.com \
--cc=john@jagalactic.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mhocko@kernel.org \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rakie.kim@sk.com \
--cc=ravis.opensrc@micron.com \
--cc=seungjun.ha@samsung.com \
--cc=sthanneeru@micron.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vtavarespetr@micron.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).