linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeelb@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>, Tejun Heo <tj@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>,  Kernel Team <kernel-team@fb.com>,
	Chris Down <chris@chrisdown.name>,
	 Cgroups <cgroups@vger.kernel.org>
Subject: Re: [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted
Date: Tue, 21 Apr 2020 15:39:01 -0700	[thread overview]
Message-ID: <CALvZod4gwBxT=nYnO8eEHr3jrffoMBoLpaB_uHMWL1VAi8XH4Q@mail.gmail.com> (raw)
In-Reply-To: <20200421215946.GA347151@cmpxchg.org>

On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
[snip]
>
> We do control very aggressive batch jobs to the extent where they have
> negligible latency impact on interactive services running on the same
> hosts. All the tools to do that are upstream and/or public, but it's
> still pretty new stuff (memory.low, io.cost, cpu headroom control,
> freezer) and they need to be put together just right.
>
> We're working on a demo application that showcases how it all fits
> together and hope to be ready to publish it soon.
>

That would be awesome.

>
[snip]
> >
> > What do you mean by not interchangeable? If I keep the hot memory (or
> > workingset) of a job in DRAM and cold memory in swap and control the
> > rate of refaults by controlling the definition of cold memory then I
> > am using the DRAM and swap interchangeably and transparently to the
> > job (that is what we actually do).
>
> Right, that's a more precise definition than my randomly chosen "80%"
> number above. There are parts of a workload's memory access curve
> (where x is distinct data accessed and y is the access frequency) that
> don't need to stay in RAM permanently and can be got on-demand from
> secondary storage without violating the workload's throughput/latency
> requirements. For that part, RAM, swap, disk can be interchangeable.
>
> I'm was specifically talking about the other half of that curve, and
> meant to imply that that's usually bigger than 20%. Usually ;-)
>
> I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't
> matter whether it gets it in ram or in swap. There is a line somewhere
> in between, and it'll vary with workload requirements, access patterns
> and IO speed. But no workload can actually run with 10G of swap and 0
> bytes worth of direct access memory, right?

Yes.

>
> Since you said before you're using combined memory+swap limits, I'm
> assuming that you configure the resource as interchangeable, but still
> have some form of determining where that cutoff line is between them -
> either by tuning proactive reclaim toward that line or having OOM kill
> policies when the line is crossed and latencies are violated?
>

Yes, more specifically tuning proactive reclaim towards that line. We
define that line in terms of acceptable refault rate for the job. The
acceptable refault rate is measured through re-use and idle page
histograms (these histograms are collected through our internal
implementation of Page Idle Tracking). I am planning to upstream and
open-source these.

> > I am also wondering if you guys explored the in-memory compression
> > based swap medium and if there are any reasons to not follow that
> > route.
>
> We played around with it, but I'm ambivalent about it.
>
> You need to identify that perfect "warm" middle section of the
> workingset curve that is 1) cold enough to not need permanent direct
> access memory, yet 2) warm enough to justify allocating RAM to it.
>
> A lot of our workloads have a distinguishable hot set and various
> amounts of fairly cold data during stable states, with not too much
> middle ground in between where compressed swap would really shine.
>
> Do you use compressed swap fairly universally, or more specifically
> for certain workloads?
>

Yes, we are using it fairly universally. There are few exceptions like
user space net and storage drivers.

> > Oh you mentioned DAX, that brings to mind a very interesting topic.
> > Are you guys exploring the idea of using PMEM as a cheap slow memory?
> > It is byte-addressable, so, regarding memcg accounting, will you treat
> > it as a memory or a separate resource like swap in v2? How does your
> > memory overcommit model work with such a type of memory?
>
> I think we (the kernel MM community, not we as in FB) are still some
> ways away from having dynamic/transparent data placement for pmem the
> same way we have for RAM. But I expect the kernel's high-level default
> strategy to be similar: order virtual memory (the data) by access
> frequency and distribute across physical memory/storage accordingly.
>
> (With pmem being divided into volatile space and filesystem space,
> where volatile space holds colder anon pages (and, if there is still a
> disk, disk cache), and the sizing decisions between them being similar
> as the ones we use for swap and filesystem today).
>
> I expect cgroup policy to be separate, because to users the
> performance difference matters. We won't want greedy batch
> applications displacing latency sensitive ones from RAM into pmem,
> just like we don't want this displacement into secondary storage
> today. Other than that, there isn't too much difference to users,
> because paging is already transparent - an mmapped() file looks the
> same whether it's backed by RAM, by disk or by pmem. The difference is
> access latencies and the aggregate throughput loss they add up to. So
> I could see pmem cgroup limits and protections (for the volatile space
> portion) the same way we have RAM limits and protections.
>
> But yeah, I think this is going a bit off topic ;-)

That's really interesting. Thanks for appeasing my curiosity.

thanks,
Shakeel


  reply	other threads:[~2020-04-21 22:39 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-17  1:06 [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted Jakub Kicinski
2020-04-17  1:06 ` [PATCH 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
2020-04-17  1:06 ` [PATCH 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski
2020-04-17  1:06 ` [PATCH 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
2020-04-17  7:37   ` Michal Hocko
2020-04-17 23:22     ` Jakub Kicinski
2020-04-17 16:11 ` [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted Shakeel Butt
2020-04-17 16:23   ` Tejun Heo
2020-04-17 17:18     ` Shakeel Butt
2020-04-17 17:36       ` Tejun Heo
2020-04-17 17:51         ` Shakeel Butt
2020-04-17 19:35           ` Tejun Heo
2020-04-17 21:51             ` Shakeel Butt
2020-04-17 22:59               ` Tejun Heo
2020-04-20 16:12                 ` Shakeel Butt
2020-04-20 16:47                   ` Tejun Heo
2020-04-20 17:03                     ` Michal Hocko
2020-04-20 17:06                       ` Tejun Heo
2020-04-21 11:06                         ` Michal Hocko
2020-04-21 14:27                           ` Johannes Weiner
2020-04-21 16:11                             ` Michal Hocko
2020-04-21 16:56                               ` Johannes Weiner
2020-04-22 13:26                                 ` Michal Hocko
2020-04-22 14:15                                   ` Johannes Weiner
2020-04-22 15:43                                     ` Michal Hocko
2020-04-22 17:13                                       ` Johannes Weiner
2020-04-22 18:49                                         ` Michal Hocko
2020-04-23 15:00                                           ` Johannes Weiner
2020-04-24 15:05                                             ` Michal Hocko
2020-04-28 14:24                                               ` Johannes Weiner
2020-04-29  9:55                                                 ` Michal Hocko
2020-04-21 19:09                             ` Shakeel Butt
2020-04-21 21:59                               ` Johannes Weiner
2020-04-21 22:39                                 ` Shakeel Butt [this message]
2020-04-21 15:20                           ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALvZod4gwBxT=nYnO8eEHr3jrffoMBoLpaB_uHMWL1VAi8XH4Q@mail.gmail.com' \
    --to=shakeelb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=chris@chrisdown.name \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=kuba@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).