All of lore.kernel.org
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Tejun Heo <tj@kernel.org>
Cc: Abel Wu <wuyun.abel@bytedance.com>,
	akpm@linux-foundation.org, lizefan.x@bytedance.com,
	corbet@lwn.net, cgroups@vger.kernel.org, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration
Date: Wed, 5 May 2021 18:30:44 -0400	[thread overview]
Message-ID: <YJMclPw7OVYxboEE@cmpxchg.org> (raw)
In-Reply-To: <YIgjE6CgU4nDsJiR@slm.duckdns.org>

On Tue, Apr 27, 2021 at 10:43:31AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
> > When a NUMA node is assigned to numa-service, the workload
> > on that node needs to be moved away fast and complete. The
> > main aspects we cared about on the eviction are as follows:
> > 
> > a) it should complete soon enough so that numa-services
> >    won’t wait too long to hurt user experience
> > b) the workloads to be evicted could have massive usage on
> >    memory, and migrating such amount of memory may lead to
> >    a sudden severe performance drop lasting tens of seconds
> >    that some certain workloads may not afford
> > c) the impact of the eviction should be limited within the
> >    source and destination nodes
> > d) cgroup interface is preferred
> > 
> > So we come to a thought that:
> > 
> > 1) fire up numa-services without waiting for memory migration
> > 2) memory migration can be done asynchronously by using spare
> >    memory bandwidth
> > 
> > AutoNUMA seems to be a solution, but its scope is global which
> > violates c&d. And cpuset.memory_migrate performs in a synchronous
> 
> I don't think d) in itself is a valid requirement. How does it violate c)?
> 
> > fashion which breaks a&b. So a mixture of them, the new cgroup2
> > interface cpuset.mems.migration, is introduced.
> > 
> > The new cpuset.mems.migration supports three modes:
> > 
> >  - "none" mode, meaning migration disabled
> >  - "sync" mode, which is exactly the same as the cgroup v1
> >    interface cpuset.memory_migrate
> >  - "lazy" mode, when walking through all the pages, unlike
> >    cpuset.memory_migrate, it only sets pages to protnone,
> >    and numa faults triggered by later touch will handle the
> >    movement.
> 
> cpuset is already involved in NUMA allocation but it always felt like
> something bolted on - it's weird to have cpu to NUMA node settings at global
> level and then to have possibly conflicting direct NUMA configuration via
> cpuset. My preference would be putting as much configuration as possible on
> the mm / autonuma side and let cpuset's node confinements further restrict
> their operations rather than cpuset having its own set of policy
> configurations.
> 
> Johannes, what are your thoughts?

This is basically a cgroup interface for the existing MPOL_MF_LAZY /
MPOL_F_MOF flags, which are per task (set_mempolicy()) and per-vma
(mbind()) scope respectively. They're not per-node, so cannot be
cgroupified through cpuset's node restrictions alone, and I understand
why a cgroup interface could be convenient.

On the other hand, this is not really about configuring a shared
resource. Rather it's using cgroups to set an arbitrary task parameter
on a bunch of tasks simultaneously. It's the SIMD type usecase of
cgroup1 that we tried to get away from in cgroup2, simply because it's
so unbounded in scope. There are *a lot* of possible task parameters,
and we could add a lot of kernel interfaces that boil down to
css_task_iter and setting or clearing a task flag.

So I'm also thinking this cgroup interface isn't desirable.

If you want to control numa policies of tasks from the outside, it's
probably best to extend the numa syscall interface to work on pids.
And then use cgroup.procs to cgroupify the operation from userspace.

Or extend the NUMA interface to make the system-wide default behavior
configurable, so that you can set MPOL_F_MOF in there (without having
to enable autonuma).

But yeah, cgroups doesn't seem like the right place to do this.

Thanks

WARNING: multiple messages have this Message-ID (diff)
From: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Abel Wu <wuyun.abel-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org,
	corbet-T1hC0tSOHrs@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration
Date: Wed, 5 May 2021 18:30:44 -0400	[thread overview]
Message-ID: <YJMclPw7OVYxboEE@cmpxchg.org> (raw)
In-Reply-To: <YIgjE6CgU4nDsJiR-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>

On Tue, Apr 27, 2021 at 10:43:31AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
> > When a NUMA node is assigned to numa-service, the workload
> > on that node needs to be moved away fast and complete. The
> > main aspects we cared about on the eviction are as follows:
> > 
> > a) it should complete soon enough so that numa-services
> >    won’t wait too long to hurt user experience
> > b) the workloads to be evicted could have massive usage on
> >    memory, and migrating such amount of memory may lead to
> >    a sudden severe performance drop lasting tens of seconds
> >    that some certain workloads may not afford
> > c) the impact of the eviction should be limited within the
> >    source and destination nodes
> > d) cgroup interface is preferred
> > 
> > So we come to a thought that:
> > 
> > 1) fire up numa-services without waiting for memory migration
> > 2) memory migration can be done asynchronously by using spare
> >    memory bandwidth
> > 
> > AutoNUMA seems to be a solution, but its scope is global which
> > violates c&d. And cpuset.memory_migrate performs in a synchronous
> 
> I don't think d) in itself is a valid requirement. How does it violate c)?
> 
> > fashion which breaks a&b. So a mixture of them, the new cgroup2
> > interface cpuset.mems.migration, is introduced.
> > 
> > The new cpuset.mems.migration supports three modes:
> > 
> >  - "none" mode, meaning migration disabled
> >  - "sync" mode, which is exactly the same as the cgroup v1
> >    interface cpuset.memory_migrate
> >  - "lazy" mode, when walking through all the pages, unlike
> >    cpuset.memory_migrate, it only sets pages to protnone,
> >    and numa faults triggered by later touch will handle the
> >    movement.
> 
> cpuset is already involved in NUMA allocation but it always felt like
> something bolted on - it's weird to have cpu to NUMA node settings at global
> level and then to have possibly conflicting direct NUMA configuration via
> cpuset. My preference would be putting as much configuration as possible on
> the mm / autonuma side and let cpuset's node confinements further restrict
> their operations rather than cpuset having its own set of policy
> configurations.
> 
> Johannes, what are your thoughts?

This is basically a cgroup interface for the existing MPOL_MF_LAZY /
MPOL_F_MOF flags, which are per task (set_mempolicy()) and per-vma
(mbind()) scope respectively. They're not per-node, so cannot be
cgroupified through cpuset's node restrictions alone, and I understand
why a cgroup interface could be convenient.

On the other hand, this is not really about configuring a shared
resource. Rather it's using cgroups to set an arbitrary task parameter
on a bunch of tasks simultaneously. It's the SIMD type usecase of
cgroup1 that we tried to get away from in cgroup2, simply because it's
so unbounded in scope. There are *a lot* of possible task parameters,
and we could add a lot of kernel interfaces that boil down to
css_task_iter and setting or clearing a task flag.

So I'm also thinking this cgroup interface isn't desirable.

If you want to control numa policies of tasks from the outside, it's
probably best to extend the numa syscall interface to work on pids.
And then use cgroup.procs to cgroupify the operation from userspace.

Or extend the NUMA interface to make the system-wide default behavior
configurable, so that you can set MPOL_F_MOF in there (without having
to enable autonuma).

But yeah, cgroups doesn't seem like the right place to do this.

Thanks

  parent reply	other threads:[~2021-05-05 22:30 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-26  6:59 [PATCH 0/3] cgroup2: introduce cpuset.mems.migration Abel Wu
2021-04-26  6:59 ` Abel Wu
2021-04-26  6:59 ` [PATCH 1/3] mm/mempolicy: apply cpuset limits to tasks using default policy Abel Wu
2021-04-26  6:59   ` Abel Wu
2021-04-26  6:59 ` [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration Abel Wu
2021-04-26  6:59   ` Abel Wu
2021-04-27 14:43   ` Tejun Heo
2021-04-27 14:43     ` Tejun Heo
2021-04-28  7:24     ` Abel Wu
2021-04-28  7:24       ` Abel Wu
2021-05-05  5:06     ` [Phishing Risk] [External] " Abel Wu
2021-05-05  5:06       ` Abel Wu
2021-05-05 22:30     ` Johannes Weiner [this message]
2021-05-05 22:30       ` Johannes Weiner
2021-04-26  6:59 ` [PATCH 3/3] docs/admin-guide/cgroup-v2: add cpuset.mems.migration Abel Wu
  -- strict thread matches above, loose matches on Subject: below --
2021-04-22  9:06 [PATCH 0/3] cgroup2: introduce cpuset.mems.migration Abel Wu
2021-04-22  9:06 ` [PATCH 2/3] cgroup/cpuset: " Abel Wu
2021-04-22  9:06   ` Abel Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJMclPw7OVYxboEE@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=tj@kernel.org \
    --cc=wuyun.abel@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.