archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <>
To: Johannes Weiner <>
Cc: Michal Hocko <>, Tejun Heo <>,
	Jakub Kicinski <>,
	 Andrew Morton <>,
	Linux MM <>,  Kernel Team <>,
	Chris Down <>,
	 Cgroups <>
Subject: Re: [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted
Date: Tue, 21 Apr 2020 15:39:01 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Tue, Apr 21, 2020 at 2:59 PM Johannes Weiner <> wrote:
> We do control very aggressive batch jobs to the extent where they have
> negligible latency impact on interactive services running on the same
> hosts. All the tools to do that are upstream and/or public, but it's
> still pretty new stuff (memory.low, io.cost, cpu headroom control,
> freezer) and they need to be put together just right.
> We're working on a demo application that showcases how it all fits
> together and hope to be ready to publish it soon.

That would be awesome.

> >
> > What do you mean by not interchangeable? If I keep the hot memory (or
> > workingset) of a job in DRAM and cold memory in swap and control the
> > rate of refaults by controlling the definition of cold memory then I
> > am using the DRAM and swap interchangeably and transparently to the
> > job (that is what we actually do).
> Right, that's a more precise definition than my randomly chosen "80%"
> number above. There are parts of a workload's memory access curve
> (where x is distinct data accessed and y is the access frequency) that
> don't need to stay in RAM permanently and can be got on-demand from
> secondary storage without violating the workload's throughput/latency
> requirements. For that part, RAM, swap, disk can be interchangeable.
> I'm was specifically talking about the other half of that curve, and
> meant to imply that that's usually bigger than 20%. Usually ;-)
> I.e. we cannot say: workload x gets 10G of ram or swap, and it doesn't
> matter whether it gets it in ram or in swap. There is a line somewhere
> in between, and it'll vary with workload requirements, access patterns
> and IO speed. But no workload can actually run with 10G of swap and 0
> bytes worth of direct access memory, right?


> Since you said before you're using combined memory+swap limits, I'm
> assuming that you configure the resource as interchangeable, but still
> have some form of determining where that cutoff line is between them -
> either by tuning proactive reclaim toward that line or having OOM kill
> policies when the line is crossed and latencies are violated?

Yes, more specifically tuning proactive reclaim towards that line. We
define that line in terms of acceptable refault rate for the job. The
acceptable refault rate is measured through re-use and idle page
histograms (these histograms are collected through our internal
implementation of Page Idle Tracking). I am planning to upstream and
open-source these.

> > I am also wondering if you guys explored the in-memory compression
> > based swap medium and if there are any reasons to not follow that
> > route.
> We played around with it, but I'm ambivalent about it.
> You need to identify that perfect "warm" middle section of the
> workingset curve that is 1) cold enough to not need permanent direct
> access memory, yet 2) warm enough to justify allocating RAM to it.
> A lot of our workloads have a distinguishable hot set and various
> amounts of fairly cold data during stable states, with not too much
> middle ground in between where compressed swap would really shine.
> Do you use compressed swap fairly universally, or more specifically
> for certain workloads?

Yes, we are using it fairly universally. There are few exceptions like
user space net and storage drivers.

> > Oh you mentioned DAX, that brings to mind a very interesting topic.
> > Are you guys exploring the idea of using PMEM as a cheap slow memory?
> > It is byte-addressable, so, regarding memcg accounting, will you treat
> > it as a memory or a separate resource like swap in v2? How does your
> > memory overcommit model work with such a type of memory?
> I think we (the kernel MM community, not we as in FB) are still some
> ways away from having dynamic/transparent data placement for pmem the
> same way we have for RAM. But I expect the kernel's high-level default
> strategy to be similar: order virtual memory (the data) by access
> frequency and distribute across physical memory/storage accordingly.
> (With pmem being divided into volatile space and filesystem space,
> where volatile space holds colder anon pages (and, if there is still a
> disk, disk cache), and the sizing decisions between them being similar
> as the ones we use for swap and filesystem today).
> I expect cgroup policy to be separate, because to users the
> performance difference matters. We won't want greedy batch
> applications displacing latency sensitive ones from RAM into pmem,
> just like we don't want this displacement into secondary storage
> today. Other than that, there isn't too much difference to users,
> because paging is already transparent - an mmapped() file looks the
> same whether it's backed by RAM, by disk or by pmem. The difference is
> access latencies and the aggregate throughput loss they add up to. So
> I could see pmem cgroup limits and protections (for the volatile space
> portion) the same way we have RAM limits and protections.
> But yeah, I think this is going a bit off topic ;-)

That's really interesting. Thanks for appeasing my curiosity.


  reply	other threads:[~2020-04-21 22:39 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-17  1:06 Jakub Kicinski
2020-04-17  1:06 ` [PATCH 1/3] mm: prepare for swap over-high accounting and penalty calculation Jakub Kicinski
2020-04-17  1:06 ` [PATCH 2/3] mm: move penalty delay clamping out of calculate_high_delay() Jakub Kicinski
2020-04-17  1:06 ` [PATCH 3/3] mm: automatically penalize tasks with high swap use Jakub Kicinski
2020-04-17  7:37   ` Michal Hocko
2020-04-17 23:22     ` Jakub Kicinski
2020-04-17 16:11 ` [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted Shakeel Butt
2020-04-17 16:23   ` Tejun Heo
2020-04-17 17:18     ` Shakeel Butt
2020-04-17 17:36       ` Tejun Heo
2020-04-17 17:51         ` Shakeel Butt
2020-04-17 19:35           ` Tejun Heo
2020-04-17 21:51             ` Shakeel Butt
2020-04-17 22:59               ` Tejun Heo
2020-04-20 16:12                 ` Shakeel Butt
2020-04-20 16:47                   ` Tejun Heo
2020-04-20 17:03                     ` Michal Hocko
2020-04-20 17:06                       ` Tejun Heo
2020-04-21 11:06                         ` Michal Hocko
2020-04-21 14:27                           ` Johannes Weiner
2020-04-21 16:11                             ` Michal Hocko
2020-04-21 16:56                               ` Johannes Weiner
2020-04-22 13:26                                 ` Michal Hocko
2020-04-22 14:15                                   ` Johannes Weiner
2020-04-22 15:43                                     ` Michal Hocko
2020-04-22 17:13                                       ` Johannes Weiner
2020-04-22 18:49                                         ` Michal Hocko
2020-04-23 15:00                                           ` Johannes Weiner
2020-04-24 15:05                                             ` Michal Hocko
2020-04-28 14:24                                               ` Johannes Weiner
2020-04-29  9:55                                                 ` Michal Hocko
2020-04-21 19:09                             ` Shakeel Butt
2020-04-21 21:59                               ` Johannes Weiner
2020-04-21 22:39                                 ` Shakeel Butt [this message]
2020-04-21 15:20                           ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \ \ \ \ \ \ \
    --subject='Re: [PATCH 0/3] memcg: Slow down swap allocation as the available space gets depleted' \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).