linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] mm: memcontrol: asynchronous reclaim for memory.high
Date: Wed, 19 Feb 2020 20:53:32 +0100	[thread overview]
Message-ID: <20200219195332.GE11847@dhcp22.suse.cz> (raw)
In-Reply-To: <20200219191618.GB54486@cmpxchg.org>

On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > We have received regression reports from users whose workloads moved
> > > into containers and subsequently encountered new latencies. For some
> > > users these were a nuisance, but for some it meant missing their SLA
> > > response times. We tracked those delays down to cgroup limits, which
> > > inject direct reclaim stalls into the workload where previously all
> > > reclaim was handled my kswapd.
> > 
> > I am curious why is this unexpected when the high limit is explicitly
> > documented as a throttling mechanism.
> 
> Memory.high is supposed to curb aggressive growth using throttling
> instead of OOM killing. However, if the workload has plenty of easily
> reclaimable memory and just needs to recycle a couple of cache pages
> to permit an allocation, there is no need to throttle the workload -
> just as there wouldn't be any need to trigger the OOM killer.
> 
> So it's not unexpected, but it's unnecessarily heavy-handed: since
> memory.high allows some flexibility around the target size, we can
> move the routine reclaim activity (cache recycling) out of the main
> execution stream of the workload, just like we do with kswapd. If that
> cannot keep up, we can throttle and do direct reclaim.
> 
> It doesn't change the memory.high semantics, but it allows exploiting
> the fact that we have SMP systems and can parallize the book keeping.

Thanks, this describes the problem much better and I believe this all
belongs to the changelog.

> > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > while keeping direct reclaim as a fallback. In our testing, this
> > > eliminated all direct reclaim from the affected workload.
> > 
> > Who is accounted for all the work? Unless I am missing something this
> > just gets hidden in the system activity and that might hurt the
> > isolation. I do see how moving the work to a different context is
> > desirable but this work has to be accounted properly when it is going to
> > become a normal mode of operation (rather than a rare exception like the
> > existing irq context handling).
> 
> Yes, the plan is to account it to the cgroup on whose behalf we're
> doing the work.

OK, great, because honestly I am not really sure we can merge this work
without that being handled, I am afraid. We've had similar attempts
- mostly to parallelize work on behalf of the process (e.g. address space
tear down) - and the proper accounting was always the main obstacle so we
really need to handle this problem for other reasons. This doesn't sound
very different. And your example of a workload not meeting SLAs just
shows that the amount of the work required for the high limit reclaim
can be non-trivial. Somebody has to do that work and we cannot simply
allow everybody else to pay for that.

> The problem is that we have a general lack of usable CPU control right
> now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
> For workloads that are contended on CPU, we cannot enable the CPU
> controller because the scheduling latencies are too high. And for
> workloads that aren't CPU contended, well, it doesn't really matter
> where the reclaim cycles are accounted to.
> 
> Once we have the CPU controller up to speed, we can add annotations
> like these to account stretches of execution to specific
> cgroups. There just isn't much point to do it before we can actually
> enable CPU control on the real workloads where it would matter.
> 
> [ This is generally work in process: for example, if you isolate
>   workloads with memory.low, kswapd cpu time isn't accounted to the
>   cgroup that causes it. Swap IO issued by kswapd isn't accounted to
>   the group that is getting swapped.

Well, kswapd is a system activity and as such it is acceptable that it
is accounted to the system. But in this case we are talking about a
memcg configuration which influences all other workloads by stealing CPU
cycles from them without much throttling on the consumer side -
especially when the memory is reclaimable without a lot of sleeping or
contention on locks etc.

I am absolutely aware that we will never achieve a perfect isolation due
to all sorts of shared data structures, lock contention and what not but
this patch alone just allows spill over to unaccounted work way too
easily IMHO.

>   The memory consumed by struct
>   cgroup itself, the percpu allocations for the vmstat arrays etc.,
>   which is sizable, are not accounted to the cgroup that creates
>   subgroups, and so forth. ]

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2020-02-19 19:53 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-19 18:12 [PATCH] mm: memcontrol: asynchronous reclaim for memory.high Johannes Weiner
2020-02-19 18:37 ` Michal Hocko
2020-02-19 19:16   ` Johannes Weiner
2020-02-19 19:53     ` Michal Hocko [this message]
2020-02-19 21:17       ` Johannes Weiner
2020-02-20  9:46         ` Michal Hocko
2020-02-20 14:41           ` Johannes Weiner
2020-02-19 21:41       ` Daniel Jordan
2020-02-19 22:08         ` Johannes Weiner
2020-02-20 15:45           ` Daniel Jordan
2020-02-20 15:56             ` Tejun Heo
2020-02-20 18:23               ` Daniel Jordan
2020-02-20 18:45                 ` Tejun Heo
2020-02-20 19:55                   ` Daniel Jordan
2020-02-20 20:54                     ` Tejun Heo
2020-02-19 19:17   ` Chris Down
2020-02-19 19:31   ` Andrew Morton
2020-02-19 21:33     ` Johannes Weiner
2020-02-26 20:25 ` Shakeel Butt
2020-02-26 22:26   ` Johannes Weiner
2020-02-26 23:36     ` Shakeel Butt
2020-02-26 23:46       ` Johannes Weiner
2020-02-27  0:12     ` Yang Shi
2020-02-27  2:42       ` Shakeel Butt
2020-02-27  9:58       ` Michal Hocko
2020-02-27 12:50       ` Johannes Weiner
2020-02-26 23:59   ` Yang Shi
2020-02-27  2:36     ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200219195332.GE11847@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).