[RFC IDEA v2 0/6] mm/damon: introduce Access/Contiguity-aware Memory Auto-scaling (ACMA)

From: SeongJae Park <sj@kernel.org>
Cc: SeongJae Park <sj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>,
	damon@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, virtualization@lists.linux.dev,
	cxiaoyi@amazon.com, sieberf@amazon.com, bodeddub@amazon.com,
	jiamingy@amazon.com
Subject: [RFC IDEA v2 0/6] mm/damon: introduce Access/Contiguity-aware Memory Auto-scaling (ACMA)
Date: Sun, 12 May 2024 12:36:51 -0700	[thread overview]
Message-ID: <20240512193657.79298-1-sj@kernel.org> (raw)

Extend DAMOS for access-aware gradual contiguous memory regions
allocation, and implement a module for efficiently and automatically
scaling system memory using the feature.

This is not a valid patchset but a summary of the idea and pseudo-code
level partial implementation examples of the idea.  The implementation
examples are only for helping people's understanding of the idea and how
it would be implemented.  The code is not tested at all.  It is even not
attempted to be compiled ever.

Motivation: Memory Overcommit Virtual Machine Systems
=====================================================

This work is motivated by an effort to improve efficiency and
reliability of a memory over-committed virtual machine system operation
method.  This section describes a business model of such systems, an
available approach, and its limitations.

Business Model
--------------

The business services compute/memory resource for users' workloads.  The
service provider receives the workload from the user, runs the workload
on their guest virtual machines, and returns the workload's output back
to the user.  The provider calculates how much of their resource is
consumed by the workload, and ask the user to pay for the real usage.

To maximize the host-level memory utilization while providing the
highest performance and lowest price to the user, the provider
overcommit the host's memory and automatically scales the guests' memory
based on their estimation of the workload's true memory demand in
runtime.  To avoid low performance or high price resulting from the
provider's mistakes in the auto-scaling, the user can specify the
minimum and maximum amount of memory for their guest machine.

  User                              Service Provider
                                   ┌─────────────────┐
  workload, min/max memory ──────► │                 │
                                   │      ???        │
  workload output, bill    ◄────── │                 │
                                   └─────────────────┘

Existing Approach
-----------------

It is challenging to estimate real memory demand of guests from the host
in high accuracy.  Meanwhile, the service provider owns the control of
both the host and guests.  Therefore they use a guests-driven
cooperative management.  The guest reports unnecessary pages to the
host, and the host reallocates the reported pages to other guests.
Specifically, free pages reporting is being used.

The host-level reuse of the page is invisible to guests.  Guests can
simply use their pages regardless of the reporting.  If a guest access a
page that reported before, the host detects it via page faults, and give
the memory back to the guest.

Unless the guest is memory-frugal, only small amount of the guests'
memory is reported to the host, and the host-level memory utilization
drops.  To make the guests be memory-frugal with minimum performance
impact, the guests run access-aware proactive memory reclamation using
DAMON.  The basic structure of the system looks like below.

  ┌─────────────────────────────┐      ┌─────────┐
  │  Guest 1                    │      │ Guest 2 │
  │ ┌─────────────────────────┐ │      │         │
  │ │ DAMON-based ReClamation │ │      │         │
  │ └────────────┬────────────┘ │      │         │
  │              │ Reclaim      │      │         │
  │              ▼              │      │         │
  │ ┌─────────────────────────┐ │      │         │
  │ │  Free pages reporting   │ │      │         │
  │ └────────────┬────────────┘ │      │         │
  │              │              │      │         │
  └──────────────┼──────────────┘      └─────────┘
                 │ Report reclaimed         ▲
                 ▼ (free) pages             │ Alloc Guest 1
  ┌───────────────────────────────┐         │ freed memory
  │            Host               ├─────────┘
  └───────────────────────────────┘

The guest uses 4 KiB-size regular page by default while the host uses 2
MiB-size regular page for efficient management of the huge host-level
memory.  Hence, even if a guest reports a 4 KiB-page, the host cannot
use it unless its 511 neighbor pages are also reported.  Letting the
guest reports every 4 KiB-page only increase the reporting overhead.
Hence the free pages reporting is tuned to work in 2 MiB granularity.
To avoid fragmented free pages not being reported, guests also
proactively run memory compaction ('/proc/sys/vm/compact_memory').

The provider further wants to minimize the 'struct page' overhead.  For
that, the guests continuously estimate real memory demands of the
running workload, and hot-[un]plug memory blocks with
'memory_hotplug.memmap_on_memory' so that 'struct page' objects for
offlined memory blocks can also be deallocated.  The guest kernel is
modified to let the user space do hot-[un]plug memory blocks, and report
the hot-unplugged memory block to the host.  This memory
hot-[un]plugging is also being used to keep the user-specified maximum
memory limit.

Limitations
-----------

The approach uses four kernel features, namely free pages reporting,
DAMON-based proactive reclamation, compaction, and memory
hot-[un]plugging.  Utilizing the four kernel features that not designed
to be used together for the specific case from user space in an
efficient way is somewhat challenging.

Memory hot-unplugging is slow and easy to fail.  The problem mainly
comes from the fact that the operation requires isolating and migrating
pages in the block to other blocks, and the operation works in memory
block granularity, which is huge compared to pages.  The minimum amount
of works for doing it is not small, and the probability to meet
unmovable pages in the huge block is not low.  This makes the
guest-level memory scaling beocmes slow and unreliable, which results in
low host-level memory efficiency.

The system-level compaction is not optimized for only the reporting
purpose.  It could consume resource for compacting some part of memory
that anyway will not be able to be reported to the host for reuse.

Both hot-unplugging and compaction require pages isolations and
migrations, which are valid to fail for some reasons.  The operations
may better to be applied to cold pages first, since cold pages would
have lower probability to be pinned or making performance impact.  But
both hot-unplugging and compaction are access pattern oblivious.

There is no control to the reported pages.  This helps keeping the
system simple, but it makes reusing reported pages unreliable.  Any
reported page can be accessed again by the guest.  And even if only one
page among the reported 512 pages are accessed again, the entire 512
pages need to be returned to the guest.

Both the host and guest systems are under the service provider's
control, but the workload is not.  Occasional host-level memory pressure
is hence inevitable.  The host could avoid such situation by setting
the host-driven hard memory limit on guests.  Balloon drivers like
methods could be used.  However, such existing mechanisms lacks the
understanding of the host's different page size and access-oblivious.

Design
======

We mitigate the limitations by introducing a way to get ownership of
contiguous memory regions in an access-aware way, implementing a kernel
module that automatically scales the memory of the system in an
access/contiguity-aware way, and replacing the approach with the module.

Access-aware Contiguous Memory Allocation
-----------------------------------------

As mentioned on the above limitations section, doing page isolations and
migrations, which are core operations of the memory scaling usage, in
fine granularity for cold pages first may makes some improvements.
DAMOS helps applying specific memory operations in an access-aware
manner.  Therefore, we implement the core operations as DAMOS schemes.
For this, we design two new DAMOS actions, namely 'alloc' and 'free'.

'Alloc' DAMOS action migrates in-use pages of the DAMOS target memory
region out of the region, and get the ownership of the pages.  The
action also receives the granularity of the operation to apply at once.
In implementation, 'alloc_contig_range()' may simply be used.  For each
given-granular contig pages that successfully allocated, DAMOS does
nothing but notifies those to the user.  Then the user can use the pages
for their purpose.  In other words, 'alloc' DAMOS action takes ownership
of DAMON-found region of specific access pattern in specific granularity
and passes it to the user.  For example, the guest of the motivation use
case can ask DAMOS to 'alloc' cold regions in 2 MiB granularity and
report those to the host as unnecessary, to scale the memory down.

'Free' DAMOS action returns the ownership of the DAMOS target memory
region to the system.  Same to 'alloc', the user can specify the
granularity of each operation.  Before returning the ownership, DAMOS
notifies the user which pages are gonna be returned to the system, so
that the user can safely forgive the ownership.  In the page fault based
memory overcommit use case, the user would need to do nothing for such
notification, though.

Access/Contiguity-aware Memory Auto-scaling (ACMA)
--------------------------------------------------

Using the two new DAMOS actions, we design a kernel module for replacing
the abovely mentioned approach.  The module is called
Access/Contiguity-aware Memory Auto-scaling (ACMA).  ACMA receives three
user inputs.  Those are the minimum amount of memory to let the system
use ('min-mem'), the maximum amount of memory to let the system use
('max-mem'), and the acceptable level of memory pressure
('mem-pressure').  'Mem-pressure' is represented by memory pressure
stall information (PSI).  Then, it scales the memory of the system while
meeting the condition utilizing three DAMON-based operation schemes,
namely 'reclaim', 'scale-down', and 'scale-up'.

'Reclaim' scheme reclaims memory of the system, coldest pages first,
aiming the 'mem-pressure' amount of memory pressure.  That is, if the
system's memory pressure level is lower than 'mem-pressure', it reclaims
some coldest pages of the system.  The amount of memory to reclaim is
proportional to distance between current pressure level and
'mem-pressure'.  If the memory pressure level becomes higher than
'mem-pressure', it reduces the amount again in a proportional way until
the memory pressure level becomes same to or lower than 'mem-pressure'.
This can be implemented as a DAMOS scheme of 'pageout' action with a
memory pressure type quota tuning goal of 'mem-pressure' value.

'Scale-down' scheme scales the memory down aiming the 'mem-pressure'
memory pressure.  It is implemented as a DAMOS scheme of 'alloc' action
with 2 MiB operation granularity.  Similar to 'reclaim' scheme, it has a
memory pressure type quota tuning goal of 'mem-pressure' target value.
For each allocated 2 MiB contig pages, it applies a vmemmap-remapping
based 'struct page' optimization and reports those to the host so that
the host can reuse.  The scheme has 'address range' type scheme filter.
The filter makes the scheme to be applied to only the memory block of
highest physical address that not completely allocated/reported, and in
the physical address range starting from 'min-mem' and enging at the end
of the physical address space of the system.  Once current 'scale-down'
target memory block is entirely allocated/reported, ACMA updates the
filter to apply the action to next lower-address memory block.

'Scale-up' scheme scales the memory up aiming the 'mem-pressure' memory
pressure.  It is implemented as a DAMOS scheme of 'free' action with 2
MiB operation granularity.  It also uses a memory pressure type quota
tuning goal with 'mem-pressure' target value, but it notifies DAMOS that
the aggressiveness of the scheme and the memory pressure are inversely
proportional.  Similar to 'scale-up' scheme, it uses an 'address range'
type scheme filter.  The filter makes the scheme to be applied to only
the partially or completely 'alloc'-ed and reported memory block of
lowest starting address in the physical address range starting from '0'
and ending at the 'max-mem' of the address space.  Inside the
to-free-pages notification callback, ACMA cancels the vmemmap-remapping
based 'struct page' optimization for the pages.  Once current 'scale-up'
target address range is entirely 'free'-ed, ACMA updates the filter to
apply the action to the just-hotplugged memory block.

Overall Picture of ACMA-running System
--------------------------------------

Below illustrates how the ACMA's three schemes will be applied to
different address ranges.  The effective memory size of the system
starts from 'end' bytes, and automatically be changed depends on the
real memory pressure of the system.  Once it becomes lower than 'max'
bytes, it will never get greater than 'max' bytes, since 'scale-up'
cannot make impact over the limit.  It can also never gets lower than
'min' bytes because 'scale-down' does not cross the line.  'reclaim'
scheme makes the system memory-frugal and helps 'scale-down' by making
free pages that can be migration destinations.  Because all the schemes
auto-tune their aggressiveness based on 'mem-pressure', the system's
memory pressure cannot exceed the user-acceptable level.

          Reclaim
             │
  ┌──────────┴────────────┐
  │                       │
  │          scale-down   │
                 │
         ┌───────┴────────┐
         │                │
  0      min-mem max-mem  end (memory address)
  │              │
  └──────┬───────┘
         │
     Scale-up

Ballooning-ACMA Integration
---------------------------

As mentioned on the limitations section above, the host may want to use
ballooning-like host-driven guest memory limit setup, while existing
implementation is not access/contiguity-aware.  We integrate ballooning
driver and ACMA for such a case.  We modify ballooning to set the
'max-mem' of ACMA instead of the classic page allocation-based
ballooning inflation.  Specifically, we expose a function for setting
the 'max-mem' of ACMA, and update virtio-balloon driver's
virtballoon_changed() to use the function to set effective hard memory
limit of the guest.  In this way, the host can simply use the classical
virtio-balloon interface while the ballooning driver is working in the
access/contiguity-aware efficient way.

Possible Future Usage of Access-aware Memory Allocation
=======================================================

As mentioned on the motivation section above, this work is motivated by
the memory over-commit VM systems.  However, we believe similar
approaches could be used for more use cases.  Introducing two such cases
below.

Contiguous Memory Allocation
----------------------------

For contiguous memory allocation, a large contiguous memory pool is
required.  Current approaches reserve such regions in early boot time
before the memory is fragmented, or define zones of specific types.  The
pool can exclusively used for only the contiguous memory allocation, or
allow non-contiguous memory allocation to use it under special condition
such as allowing only movable pages.  The second approach improves the
memory utilization, but sometimes suffers from pages that movable by
definition, but not easily movable in practice, similar to the memory
block-level page migration for memory hot unplugging that described on
the limitation section.  Even without the migration reliability and the
speed, finding the optimum size of the pool is challenging.

We could use ACMA-like approach for dynamically allocating a memory pool
for contiguous memory allocation.  It will be similar to ACMA but do not
report DAMOS-alloc-ed pages to the host.  Instead, use the regions as
contiguous memory allocation pool.

DRAM Consuming Power Saving
---------------------------

DRAM consumes and emits huge amount of power and carbon, respectively.
On bare-meta machines, we could scale down memory using ACMA, hot-unplug
completely DAMOS-alloc-ed memory blocks, and power off the DRAM device if
the hardware supports such operation.

Discussion Points
=================

- Is there existing better alternatives for memory over-commit VM
  systems?
- Is it ok to reuse pages reporting infrastructure from ACMA?
- Is it ok to reuse virtio-balloon's interface for ACMA-integration?
- Will access-aware migration make real benefit?
- Does future usages of access-aware memory allocation make sense?

SeongJae Park (6):
  mm/damon: implement DAMOS actions for access-aware contiguous memory
    allocation
  mm/damon: add the initial part of access/contiguity-aware memory
    auto-scaling module
  mm/page_reporting: implement a function for reporting specific pfn
    range
  mm/damon/acma: implement scale down feature
  mm/damon/acma: implement scale up feature
  drivers/virtio/virtio_balloon: integrate ACMA and ballooning

 drivers/virtio/virtio_balloon.c |  26 ++
 include/linux/damon.h           |  37 +++
 mm/damon/Kconfig                |  10 +
 mm/damon/Makefile               |   1 +
 mm/damon/acma.c                 | 546 ++++++++++++++++++++++++++++++++
 mm/damon/paddr.c                |  93 ++++++
 mm/damon/sysfs-schemes.c        |   4 +
 mm/page_reporting.c             |  27 ++
 8 files changed, 744 insertions(+)
 create mode 100644 mm/damon/acma.c

base-commit: 40475439de721986370c9d26f53596e2bd4e1416
-- 
2.39.2