archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <>
To: Yang Shi <>
Cc: Shakeel Butt <>,
	Andrew Morton <>,
	Michal Hocko <>, Tejun Heo <>,
	Roman Gushchin <>, Linux MM <>,
	Cgroups <>,
	LKML <>,
	Kernel Team <>
Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
Date: Thu, 27 Feb 2020 07:50:11 -0500	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Wed, Feb 26, 2020 at 04:12:23PM -0800, Yang Shi wrote:
> On 2/26/20 2:26 PM, Johannes Weiner wrote:
> > So we should be able to fully resolve this problem inside the kernel,
> > without going through userspace, by accounting CPU cycles used by the
> > background reclaim worker to the cgroup that is being reclaimed.
> Actually I'm wondering if we really need account CPU cycles used by
> background reclaimer or not. For our usecase (this may be not general), the
> purpose of background reclaimer is to avoid latency sensitive workloads get
> into direct relcaim (avoid the stall from direct relcaim). In fact it just
> "steal" CPU cycles from lower priority or best-effort workloads to guarantee
> latency sensitive workloads behave well. If the "stolen" CPU cycles are
> accounted, it means the latency sensitive workloads would get throttled from
> somewhere else later, i.e. by CPU share.

That doesn't sound right.

"Not accounting" isn't an option. If we don't annotate the reclaim
work, the cycles will go to the root cgroup. That means that the
latency-sensitive workload can steal cycles from the low-pri job, yes,
but also that the low-pri job can steal from the high-pri one.

Say your two workloads on the system are a web server and a compile
job and the CPU shares are allocated 80:20. The compile job will cause
most of the reclaim. If the reclaim cycles can escape to the root
cgroup, the compile job will effectively consume more than 20 shares
and the low-pri job will get less than 80.

But let's say we executed all background reclaim in the low-pri group,
to allow the high-pri group to steal cycles from the low-pri group,
but not the other way round. Again an 80:20 CPU distribution. Now the
reclaim work competes with the compile job over a very small share of
CPU. The reclaim work that the high priority job is relying on is
running at low priority. That means that the compile job can cause the
web server to go into direct reclaim. That's a priority inversion.

> We definitely don't want to the background reclaimer eat all CPU cycles. So,
> the whole background reclaimer is opt in stuff. The higher level cluster
> management and administration components make sure the cgroups are setup
> correctly, i.e. enable for specific cgroups, setup watermark properly, etc.
> Of course, this may be not universal and may be just fine for some specific
> configurations or usecases.

Yes, I suspect it works for you because you set up watermarks on the
high-pri job but not on the background jobs, thus making sure only
high-pri jobs can steal cycles from the rest of the system.

However, we do want low-pri jobs to have background reclaim as well. A
compile job may not be latency-sensitive, but it still benefits from a
throughput POV when the reclaim work runs concurrently. And if there
are idle CPU cycles available that the high-pri work isn't using right
now, it would be wasteful not to make use of them.

So yes, I can see how such an accounting loophole can be handy. By
letting reclaim CPU cycles sneak out of containment, you can kind of
use it for high-pri jobs. Or rather *one* high-pri job, because more
than one becomes unsafe again, where one can steal a large number of
cycles from others at the same priority.

But it's more universally useful to properly account CPU cycles that
are actually consumed by a cgroup, to that cgroup, and then reflect
the additional CPU explicitly in the CPU weight configuration. That
way you can safely have background reclaim on jobs of all priorities.

  parent reply	other threads:[~2020-02-27 12:50 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-19 18:12 [PATCH] mm: memcontrol: asynchronous reclaim for memory.high Johannes Weiner
2020-02-19 18:37 ` Michal Hocko
2020-02-19 19:16   ` Johannes Weiner
2020-02-19 19:53     ` Michal Hocko
2020-02-19 21:17       ` Johannes Weiner
2020-02-20  9:46         ` Michal Hocko
2020-02-20 14:41           ` Johannes Weiner
2020-02-19 21:41       ` Daniel Jordan
2020-02-19 22:08         ` Johannes Weiner
2020-02-20 15:45           ` Daniel Jordan
2020-02-20 15:56             ` Tejun Heo
2020-02-20 18:23               ` Daniel Jordan
2020-02-20 18:45                 ` Tejun Heo
2020-02-20 19:55                   ` Daniel Jordan
2020-02-20 20:54                     ` Tejun Heo
2020-02-19 19:17   ` Chris Down
2020-02-19 19:31   ` Andrew Morton
2020-02-19 21:33     ` Johannes Weiner
2020-02-26 20:25 ` Shakeel Butt
2020-02-26 22:26   ` Johannes Weiner
2020-02-26 23:36     ` Shakeel Butt
2020-02-26 23:46       ` Johannes Weiner
2020-02-27  0:12     ` Yang Shi
2020-02-27  2:42       ` Shakeel Butt
2020-02-27  9:58       ` Michal Hocko
2020-02-27 12:50       ` Johannes Weiner [this message]
2020-02-26 23:59   ` Yang Shi
2020-02-27  2:36     ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).