linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Augmented Page Reclaim
@ 2021-02-02  8:57 Yu Zhao
  2021-02-02 12:17 ` Matthew Wilcox
  2021-02-09 21:32 ` [page-reclaim] " Jesse Barnes
  0 siblings, 2 replies; 6+ messages in thread
From: Yu Zhao @ 2021-02-02  8:57 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, page-reclaim

======================
Augmented Page Reclaim
======================
We would like to share a work with you and see if there is enough
interest to warrant a run for the mainline. This work is a part of
result from a decade of research and experimentation in memory
overcommit at Google: an augmented page reclaim that, in our
experience, is performant, versatile and, more importantly, simple.

Performance
===========
On Android, our most advanced simulation that generates memory
pressure from realistic user behavior shows 18% fewer low-memory
kills, which in turn reduces cold starts by 16%. This is on top of
per-process reclaim, a predecessor of ``MADV_COLD`` and
``MADV_PAGEOUT``, against background apps.

On Borg (warehouse-scale computers), a similar approach enables us to
identify jobs that underutilize their memory and downsize them
considerably without compromising any of our service level indicators.
Our findings are published in the papers listed below, e.g., 32% of
memory usage on Borg has been idle for at least 2 minutes.

On Chrome OS, our field telemetry reports 96% fewer low-memory tab
discards and 59% fewer OOM kills from fully-utilized devices and no UX
regressions from underutilized devices. Our real-world benchmark that
browses popular websites in multiple tabs demonstrates 51% less CPU
usage from ``kswapd`` and 45% (some) and 52% (full) less PSI on
v5.11-rc6 built from the tree below.

Versatility
===========
Userspace can trigger aging and eviction independently via the
``debugfs`` interface [note]_ for working set estimation, proactive
reclaim, far memory tiering, NUMA-aware job scheduling, etc. The
metrics from the interface are easily interpretable, which allows
intuitive provisioning and discoveries like the Borg example above.
For a warehouse-scale computer, the interface is intended to be a
building block of a closed-loop control system, with a machine
learning algorithm being the controller.

Simplicity
==========
The workflow [note]_ is well defined and each step in it has a clear
meaning. There are no magic numbers or heuristics involved but a few
basic data structures that have negligible memory footprint. This
simplicity has served us well as the scale and the diversity of our
workloads constantly grow.

Repo
====
git pull https://linux-mm.googlesource.com/page-reclaim refs/changes/80/1080/1

Gerrit: https://linux-mm-review.googlesource.com/c/page-reclaim/+/1080

.. [note] See ``Documentation/vm/multigen-lru.rst`` in the tree.

FAQ
===
What is the motivation for this work?
-------------------------------------
In our case, DRAM is a major factor in total cost of ownership, and
improving memory overcommit brings a high return on investment.
Moreover, Google-Wide Profiling has been observing the high CPU
overhead [note]_ from page reclaim.

Why not try to improve the existing code?
-----------------------------------------
We have tried but concluded the two limiting factors [note]_ in the
existing code are fundamental, and therefore changes made atop them
will not result in substantial gains on any of the aspects above.

What particular workloads does it help?
---------------------------------------
This work optimizes page reclaim for workloads that are not IO bound,
because we find they are the norm on servers and clients in the cloud
era. It would most likely help any workloads that share the common
characteristics [note]_ we observed.

How would it benefit the community?
-----------------------------------
Google is committed to promoting sustainable development of the
community. We hope successful adoptions of this work will steadily
climb over time. To that end, we would be happy to learn your
workloads and work with you case by case, and we will do our best to
keep the repo fully maintained. For those whose workloads rely on the
existing code, we will make sure you will not be affected in any way.

References
==========
1. `Long-term SLOs for reclaimed cloud computing resources
   <https://research.google/pubs/pub43017/>`_

2. `Profiling a warehouse-scale computer
   <https://research.google/pubs/pub44271/>`_

3. `Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
   <https://research.google/pubs/pub48329/>`_

4. `Software-defined far memory in warehouse-scale computers
   <https://research.google/pubs/pub48551/>`_

5. `Borg: the Next Generation
   <https://research.google/pubs/pub49065/>`_

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Augmented Page Reclaim
  2021-02-02  8:57 Augmented Page Reclaim Yu Zhao
@ 2021-02-02 12:17 ` Matthew Wilcox
  2021-02-02 19:38   ` Yu Zhao
  2021-02-09 21:32 ` [page-reclaim] " Jesse Barnes
  1 sibling, 1 reply; 6+ messages in thread
From: Matthew Wilcox @ 2021-02-02 12:17 UTC (permalink / raw)
  To: Yu Zhao; +Cc: linux-mm, linux-kernel, page-reclaim


It's hard to know which 'note' refers to which reference.  Here's
my attempt to figure that out.

On Tue, Feb 02, 2021 at 01:57:15AM -0700, Yu Zhao wrote:

> Versatility
> ===========
> Userspace can trigger aging and eviction independently via the
> ``debugfs`` interface [note]_ for working set estimation, proactive

1. `Long-term SLOs for reclaimed cloud computing resources
   <https://research.google/pubs/pub43017/>`_

> reclaim, far memory tiering, NUMA-aware job scheduling, etc. The
> metrics from the interface are easily interpretable, which allows
> intuitive provisioning and discoveries like the Borg example above.
> For a warehouse-scale computer, the interface is intended to be a
> building block of a closed-loop control system, with a machine
> learning algorithm being the controller.
> 
> Simplicity
> ==========
> The workflow [note]_ is well defined and each step in it has a clear

2. `Profiling a warehouse-scale computer
   <https://research.google/pubs/pub44271/>`_

> meaning. There are no magic numbers or heuristics involved but a few
> basic data structures that have negligible memory footprint. This
> simplicity has served us well as the scale and the diversity of our
> workloads constantly grow.
[...]
> FAQ
> ===
> What is the motivation for this work?
> -------------------------------------
> In our case, DRAM is a major factor in total cost of ownership, and
> improving memory overcommit brings a high return on investment.
> Moreover, Google-Wide Profiling has been observing the high CPU
> overhead [note]_ from page reclaim.

3. `Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
   <https://research.google/pubs/pub48329/>`_

> Why not try to improve the existing code?
> -----------------------------------------
> We have tried but concluded the two limiting factors [note]_ in the

4. `Software-defined far memory in warehouse-scale computers
   <https://research.google/pubs/pub48551/>`_

> existing code are fundamental, and therefore changes made atop them
> will not result in substantial gains on any of the aspects above.
> 
> What particular workloads does it help?
> ---------------------------------------
> This work optimizes page reclaim for workloads that are not IO bound,
> because we find they are the norm on servers and clients in the cloud
> era. It would most likely help any workloads that share the common
> characteristics [note]_ we observed.

5. `Borg: the Next Generation
   <https://research.google/pubs/pub49065/>`_


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Augmented Page Reclaim
  2021-02-02 12:17 ` Matthew Wilcox
@ 2021-02-02 19:38   ` Yu Zhao
  0 siblings, 0 replies; 6+ messages in thread
From: Yu Zhao @ 2021-02-02 19:38 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-mm, linux-kernel, page-reclaim

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On Tue, Feb 02, 2021 at 12:17:08PM +0000, Matthew Wilcox wrote:
> 
> It's hard to know which 'note' refers to which reference.  Here's
> my attempt to figure that out.

Sorry for the trouble. [note]_ links to

.. [note] See ``Documentation/vm/multigen-lru.rst`` in the tree.

which has nothing to do with the references listed at the bottom.

The references are helpful but not required to process the information
in this email or the doc above.

Let me attach PDF files generated my first email (intro.pdf) and the
doc (man.pdf). They are better formatted.

> 
> On Tue, Feb 02, 2021 at 01:57:15AM -0700, Yu Zhao wrote:
> 
> > Versatility
> > ===========
> > Userspace can trigger aging and eviction independently via the
> > ``debugfs`` interface [note]_ for working set estimation, proactive
> 
> 1. `Long-term SLOs for reclaimed cloud computing resources
>    <https://research.google/pubs/pub43017/>`_
> 
> > reclaim, far memory tiering, NUMA-aware job scheduling, etc. The
> > metrics from the interface are easily interpretable, which allows
> > intuitive provisioning and discoveries like the Borg example above.
> > For a warehouse-scale computer, the interface is intended to be a
> > building block of a closed-loop control system, with a machine
> > learning algorithm being the controller.
> > 
> > Simplicity
> > ==========
> > The workflow [note]_ is well defined and each step in it has a clear
> 
> 2. `Profiling a warehouse-scale computer
>    <https://research.google/pubs/pub44271/>`_
> 
> > meaning. There are no magic numbers or heuristics involved but a few
> > basic data structures that have negligible memory footprint. This
> > simplicity has served us well as the scale and the diversity of our
> > workloads constantly grow.
> [...]
> > FAQ
> > ===
> > What is the motivation for this work?
> > -------------------------------------
> > In our case, DRAM is a major factor in total cost of ownership, and
> > improving memory overcommit brings a high return on investment.
> > Moreover, Google-Wide Profiling has been observing the high CPU
> > overhead [note]_ from page reclaim.
> 
> 3. `Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    <https://research.google/pubs/pub48329/>`_
> 
> > Why not try to improve the existing code?
> > -----------------------------------------
> > We have tried but concluded the two limiting factors [note]_ in the
> 
> 4. `Software-defined far memory in warehouse-scale computers
>    <https://research.google/pubs/pub48551/>`_
> 
> > existing code are fundamental, and therefore changes made atop them
> > will not result in substantial gains on any of the aspects above.
> > 
> > What particular workloads does it help?
> > ---------------------------------------
> > This work optimizes page reclaim for workloads that are not IO bound,
> > because we find they are the norm on servers and clients in the cloud
> > era. It would most likely help any workloads that share the common
> > characteristics [note]_ we observed.
> 
> 5. `Borg: the Next Generation
>    <https://research.google/pubs/pub49065/>`_
> 

[-- Attachment #2: intro.pdf --]
[-- Type: application/pdf, Size: 25077 bytes --]

[-- Attachment #3: man.pdf --]
[-- Type: application/pdf, Size: 15996 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [page-reclaim] Augmented Page Reclaim
  2021-02-02  8:57 Augmented Page Reclaim Yu Zhao
  2021-02-02 12:17 ` Matthew Wilcox
@ 2021-02-09 21:32 ` Jesse Barnes
  2021-02-10  7:12   ` Yu Zhao
  1 sibling, 1 reply; 6+ messages in thread
From: Jesse Barnes @ 2021-02-09 21:32 UTC (permalink / raw)
  To: Yu Zhao, Sonny Rao, Jann Horn, Matthew Wilcox
  Cc: linux-mm, Linux Kernel Mailing List, page-reclaim

> ======================
> Augmented Page Reclaim
> ======================
> We would like to share a work with you and see if there is enough
> interest to warrant a run for the mainline. This work is a part of
> result from a decade of research and experimentation in memory
> overcommit at Google: an augmented page reclaim that, in our
> experience, is performant, versatile and, more importantly, simple.

Per discussion on IRC, maybe some additional background would help.

In looking at browser workloads on Chrome OS, we found that reclaim was:
1) too expensive in terms of CPU usage
2) often making poor decisions about what to reclaim

This work was mainly targeted toward improving those things, with an
eye toward interactive performance for browser workloads.

We have a few key tests we use for that, that measure tab switch times
and number of tab discards when under memory pressure, and this
approach significantly improves these (see Yu's data).

We do expect this approach will also be beneficial to cloud workloads,
and so are looking for people to try it out in their environments with
their favorite key tests or workloads.

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [page-reclaim] Augmented Page Reclaim
  2021-02-09 21:32 ` [page-reclaim] " Jesse Barnes
@ 2021-02-10  7:12   ` Yu Zhao
  2021-02-10 13:22     ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Yu Zhao @ 2021-02-10  7:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Sonny Rao, Jann Horn, Matthew Wilcox, Jesse Barnes,
	Linux Kernel Mailing List, page-reclaim

On Tue, Feb 09, 2021 at 01:32:58PM -0800, Jesse Barnes wrote:
> > ======================
> > Augmented Page Reclaim
> > ======================
> > We would like to share a work with you and see if there is enough
> > interest to warrant a run for the mainline. This work is a part of
> > result from a decade of research and experimentation in memory
> > overcommit at Google: an augmented page reclaim that, in our
> > experience, is performant, versatile and, more importantly, simple.
> 
> Per discussion on IRC, maybe some additional background would help.

And I'll add more details to the doc included in the tree once I've
finished collecting feedback.

> In looking at browser workloads on Chrome OS, we found that reclaim was:
> 1) too expensive in terms of CPU usage

We have two general metrics for this item: CPU time spent on page
reclaim and (direct) page reclaim latency. CPU usage is important to
everybody but latency is also quite important for phones, laptops,
etc.

> 2) often making poor decisions about what to reclaim

We have another two metrics here: the number of refaults and the
number of false inactive pages. For example, it's bad if pages refault
within a hundred of milliseconds after they have been reclaimed. Also
it's bad if reclaim thinks many pages are inactive but later finds
they are actually active.

> This work was mainly targeted toward improving those things, with an
> eye toward interactive performance for browser workloads.
> 
> We have a few key tests we use for that, that measure tab switch times
> and number of tab discards when under memory pressure, and this
> approach significantly improves these (see Yu's data).

We monitor workload-specific metrics as well. For example, we found
page reclaim also affects battery life, tab switch latency and the
number of janks (pauses when scrolling web pages) on Chrome OS. I
don't want to dump everything here because they seem irrelevant to
most people.

> We do expect this approach will also be beneficial to cloud workloads,
> and so are looking for people to try it out in their environments with
> their favorite key tests or workloads.

And if you are interested in our workload-specific metrics, Android,
Cloud, etc., please feel free to contact us. Any other questions,
concerns and suggestions are also welcome.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [page-reclaim] Augmented Page Reclaim
  2021-02-10  7:12   ` Yu Zhao
@ 2021-02-10 13:22     ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2021-02-10 13:22 UTC (permalink / raw)
  To: Yu Zhao
  Cc: linux-mm, Sonny Rao, Jann Horn, Matthew Wilcox, Jesse Barnes,
	Linux Kernel Mailing List, page-reclaim

On Wed 10-02-21 00:12:38, Yu Zhao wrote:
> On Tue, Feb 09, 2021 at 01:32:58PM -0800, Jesse Barnes wrote:
> > > ======================
> > > Augmented Page Reclaim
> > > ======================
> > > We would like to share a work with you and see if there is enough
> > > interest to warrant a run for the mainline. This work is a part of
> > > result from a decade of research and experimentation in memory
> > > overcommit at Google: an augmented page reclaim that, in our
> > > experience, is performant, versatile and, more importantly, simple.
> > 
> > Per discussion on IRC, maybe some additional background would help.
> 
> And I'll add more details to the doc included in the tree once I've
> finished collecting feedback.

Please be as specific as possible early.

> > In looking at browser workloads on Chrome OS, we found that reclaim was:
> > 1) too expensive in terms of CPU usage
> 
> We have two general metrics for this item: CPU time spent on page
> reclaim and (direct) page reclaim latency. CPU usage is important to
> everybody but latency is also quite important for phones, laptops,
> etc.

While this is true in general, more details would be more than welcome.
What is the source of the additional overhead and how does your work
address that?

This applies to most of other areas you are covering here and in the
original cover letter. Especially when you do not plan to build on an
existing code and rather plan to do things considerably differently.

I confess I haven't checked your repository but it would have been much
better to post a patch series
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-02-10 13:23 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02  8:57 Augmented Page Reclaim Yu Zhao
2021-02-02 12:17 ` Matthew Wilcox
2021-02-02 19:38   ` Yu Zhao
2021-02-09 21:32 ` [page-reclaim] " Jesse Barnes
2021-02-10  7:12   ` Yu Zhao
2021-02-10 13:22     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).