[RFC IDEA] ACMA: Access/Contiguity-aware Memory Auto-scaling

* [RFC IDEA] ACMA: Access/Contiguity-aware Memory Auto-scaling
@ 2023-11-12 19:51 ` SeongJae Park
  0 siblings, 0 replies; 3+ messages in thread
From: SeongJae Park @ 2023-11-12 19:51 UTC (permalink / raw)
  Cc: damon, linux-mm, linux-kernel

Hello,

I'd like to share an idea for making systems automatically scale up/down memory
in an access/contiguity-awared way.  It is designed for memory efficiency of
free pages reporting-like collaboration based memory oversubscribed virtual
machine systems, but it might also be potentially useful for memory/power
efficiency and memory contiguity of general systems.  There is no
implementation at the moment, but I'd like to hear any comments or concerns
about the idea first if anyone has.  I will also share this in the upcoming
kernel summit DAMON talk[1]'s future plans part.

Background
==========

On memory oversubscribed virtual machine systems, free pages reporting could be
used as a core of the collaborative memory management.  That is, the kernel of
guests report free pages to the host, and the host utilizes the reported pages
for other guests.  When the guest accesses the reported guest-physical page
again, the host knows that via the page fault mechanism, allocates a
host-physical page, and provide it to the guest.

Requirements
------------

For maximizing the memory efficiency of such systems, below properties are
required to guest machines.

1. Being memory frugal.  The guest should use memory only as really needed.
Otherwise, only insufficient amounts of memory are freed and reported to the
host while the guest is wasting the host-physical pages to accommodate not
really needed data.  As a result, the host level memory efficiency is degraded.

2. Report-time contiguity of free pages.  To reduce the overhead of the free
pages reporting, the feature usually works for not every single page but for
contiguous free pages of user-specifiable granularity (e.g., 2 MiB).  Hence,
even if there are many free pages in a guest, if the free pages are not
report-granularity-contiguous, those cannot be reported to the host.

3. Post-report contiguity of free pages.  In some cases, the host's page size
could be different from (usually larger than) that of the guest.  For example,
the host can manage the memory with 2 MiB-sized pages while the guest is using
4 KiB-sized pages.  In this case, the host-guest page mapping works in the
host-side page size.  Hence, even if only one page among a reported contiguous
free pages are allocated again and accessed, the whole reported contiguous
chunk should be returned to the guest.  This kind of ping pong itself could
also consume some resources.

4. Minimizing the metadata for reported pages.  Even though the guests report
free pages, the metadata for the pages (e.g., 'struct page') still exist and
consume memory.  Ideally, guests should have only metadata for really needed
pages.

Possible Approach and Limitations
---------------------------------

There are kernel features that could be used from the guests' user space for
the requirements.  DAMON-based proactive reclamation[2] could be turned on for
being memory frugal with only minimum performance impact.  Proactive compaction
can periodically run for the report-time contiguity of free pages.  Memory
hot-unplugging can be used for freeing the metadata of free pages[3].  The
guest would need to hot-plug the memory blocks again depending on memory
demands.  This may require some changes in the kernel for user-space driven
hot-[un]plugging of memory, and reporting hot-unplugged memory to the host.

This approach could work, but has some limitations.  Firstly, memory
hot-[un]plugging needs isolation/migration of the whole pages in the target
block.  This takes time, and could fail for any page isolation/migration
failures.    Periodic compaction could also partially fail due to page
isolation/migration failures.  It could also waste resources for compacting too
much memory, while required contiguity is only report-granularity.  There is no
way to avoid the compacted regions being defragmented again.  We were unable to
find a good existing solution for the post-report contiguity.  Finally,
efficiently controlling these multiple different kernel features from user
space is complex and challenging.

ACMA: Access/Contiguity-aware Memory Auto-scaling
=================================================

We therefore propose a new kernel feature for the requirements, namely
Access/Contiguity-aware Memory Auto-scaling (ACMA).

Definitions
-----------

ACMA defines a metric called DAMON-detected working set.  This is a set of
memory regions that DAMON has detected access to those regions within a
user-specifiable time interval, say, one minute.

ACMA also defines a new operation called stealing.  It receives a contiguous
memory region as its input, and allocates the pages of the region.  If some
pages in the region are not free, migrate those out.  Hence it could be thought
of a variant, or a wrapper of memory offlining or alloc_contig_range().  If the
allocation is successful, it further reports the region as safe to use to the
host.  ACMA manages the stealing status of each memory block.  If the entire
page of a memory block is stolen, it further hot-unplug the block.

It further defines a new operation called stolen pages returning.  The action
receives an amount of memory size as input.  If there are not-yet-hot-unplugged
stolen pages of the size, it frees the page.  If there are no such stolen pages
but a hot-unplugged stolen memory block, it hot-plugs the block again, closer
to the not-hot-unplugged blocks first.  Then the guest users can allocate pages
of returned ones and access it.  When they access it, the host will notify that
via page fault and assign/map a host-physical page for that.

Workflow
--------

With these definitions, ACMA behaves based on system status as follows.

Phase 0. It periodically monitors the DAMON-based working set size and free
memory size of the system.

Phase 1. If the free memory to the working set size ratio is more than a
threshold (high), say, 2:1 (200%), ACMA steals report-granularity contiguous
non-working set pages in the last not-yet-hot-unplugged memory block, colder
pages first.  The ratio will decrease.

Phase 2. If the free memory to the working set size ratio becomes less than a
threshold (normal), say, 1:1 (100%), ACMA stops stealing and start reclaiming
non-workingset pages, colder pages first.  The ratio will increase.  The
reclamation is continued until the ratio becomes higher than the normal
threshold.

Phase 3. If the non-workingset reclamation is not increasing the ratio and it
becomes less than yet another threshold (low), say, 1:2 (50%), ACMA starts
returning stolen pages until the free memory to the working set ratio becomes
higher than the low threshold.

Expectations
------------

Since ACMA does stealing in phase 1, which does a sort of compaction on its
own, in free pages report-granularity, it does compaction only as much as
really required.  Because the stealing targets colder pages first, it will only
rarely conflict with users of the pages.  Hence less isolation/migration
failures, which results in more stealing success is expected.

Since ACMA-stolen pages are allocated to ACMA, which is in kernel space, no
other in-guest components can use it before ACMA returns those.  Hence,
after-report contiguity is kept, unless working set size, which represents real
memory demand, grows enough to make ACMA work in the phase 3.

Since ACMA does proactive non-workingset cold-pages first reclamation in
phase 2, the guest becomes memory frugal with minimum performance degradation.

Because the phase changes based on free memory to working set size ratio, the
guest system is guaranteed to have only the working set plus normal-high
(100%-200% in this example) working set size proportional free memory.  This
wouldn't be true if the working set size is more than 50% of all available
guest-physical memory of the guest.  In this case, if memory demands continues
increasing, any system has no way but OOM.  The host might be able to detect
this and add more guest-physical memory so that ACMA can hot-plug those
automatically, though.  Because stealing does hot-unplugging of the memory,
'struct page' for only really needed pages are used.

Hence, ACMA provides monitored access pattern based contiguity-aware real
memory demands based memory scaling without unnecessary metadata.

Implementation
--------------

Implementation detail is to-be-discussed, but we could implement ACMA using
DAMOS.  That is, the stealing and stolen pages return operation could be
implemented as a new DAMOS action.  The working set size monitoring can be
natively done with DAMON.  The three phases can each be implemented as a DAMOS
scheme.  The free memory to the working set size ratio based activation and
deactivation of the schemes can be done using the aim-oriented auto-tuning of
DAMOS[4].  We could add PSI goals to the schemes, too.

For example, below DAMOS schemes in the DAMO json input format could be
imagined.  Note that this is not what is currently supported.

    [
        {
            "action": "acma_steal",
            "access_pattern": {
                "sz_bytes": { "min": "2 MB", "max": "max" },
                "nr_accesses": { "max": "0 %" },
                "age": { "min": "1 minute" }
            },
            "auto_tuning_aims": [
                {
                    "metric": "workingset_to_free_mem_ratio",
                    "workingset_min_age": "1 minute", "target": 1.0
                },
                { "metric": "psi_mem_ratio", "target": 0.001, },
            ],
        },
        {
            "action": "pageout",
            "access_pattern": {
                "nr_accesses": { "max": "0 %" },
                "age": { "min": "1 minute" }
            },
            "auto_tuning_aims": [
                {
                    "metric": "free_mem_to_workingset_ratio",
                    "workingset_min_age": "1 minute", "target": 1.0
                },
                { "metric": "psi_mem_ratio", "target": 0.001, },
            ],
        },
        {
            "action": "acma_return",
            "auto_tuning_aims": [
                {
                    "metric": "free_mem_to_workingset_ratio",
                    "workingset_min_age": "1 minute", "target": 0.5
                },
            ],
        }
    ]

Potential Benefits for General Usage
====================================

ACMA is basically designed for memory overcommitted virtual machine systems, as
described above.  However, it could be useful for general systems that memory
can be physically hot-[un]pluggable.  It could help improve memory efficiency
of physical clusters, and save power for unused DRAM or memory devices.

We could also think about extending ACMA to provide a contiguous memory
allocation interface.  Since stolen pages are report-granularity or memory
block-granularity contiguous and isolated from the system's other components,
ACMA could allocate contiguous memory from the stolen memory, without high
latency.  If the report granularity and required contiguous memory allocation
size is same (e.g., 2 MiB default free pages reporting granularity and 2 MiB
hugepages), it would be especially efficient.  In this case, ACMA may stand for
Access-aware Contiguous Memory Allocator.

Request For Comments
====================

This is in very early stage.  No enough survey of related works is done, and no
implementation is made at all.  That said, I hope to share what I'm gonna do,
and get any comment if possible, not to only success, but rather to learn from
you and develop it together, or even fail fast.

Example ACMA Operation Scenario
===============================

Let’s assume a guest using 2MiB size pages.  Each memory block has 9 pages, and
also 1 page for metadata of the 9 pages.  The system has 10 memory blocks, so
200 MiB memory in total.

Let’s represent the state of each page as below.

U: stolen-and-unplugged page
M: metadata of the pages in the page block
S: stolen-but-not-yet-unplugged page
F: Free page
C: Non-free (assigned) cold (non-workingset) page
H: Non-free (assigned) hot (workingset) page

And the proposed system is configured like above implementation example.  To
summarize it again,

* Steal 2MiB-contiguous cold memory in last plugged memory block, when
  free memory to working set rate > 100%
* Reclaim cold pages if free memory to working set rate <= 100%
* Return stolen memory if free memory to working set rate < 50%

The initial state could look like below.

    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MCCCCCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 63 pages / 9 pages = 700 %

Stealing memory (down-scaling)
------------------------------

Since free memory to workingset ratio is larger than 100%, cold pages stealing
works.  Stolen pages are reported to the host.  As more pages are stolen, the
free memory to workingset ratio decreases.  For example, if hot/cold pages are
stable and four free pages are stolen, the status looks like below.  Note that
ACMA can steal allocated-cold pages, too.

    MFSFSFSSFF MFFFFFFFFF
    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MCCCCCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 59 pages / 9 pages = 655 %

Stealing works for only the last not-yet-unplugged block.  Once all pages of
the block are stolen, the entire block is unplugged.  The metadata for the
block also becomes available to the host.  Stealing continues to the next
block.

    UUUUUUUUUU MFSFFFFFFF
    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MFFFFFFFFF
    MFFFFFFFFF MCCCCCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 52 pages / 9 pages = 577 %

Reclamation helps stealing
--------------------------

And the stealing continues... Until free memory to working set rate reaches
100%.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFFFFFFFF MCCCCCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 9 pages / 9 pages = 100 %

Now stealing stops, and proactive reclamation starts.  It reclaims cold pages,
making those free pages.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFFFFFFFF MCFCCCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 10 pages / 9 pages = 111 %

Now reclamation is deactivated, and stealing be activated.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFSFFFFFF MCFCCCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 9 pages / 9 pages = 100 %

Ping pong of reclamation-stealing continues.  Reclaim,

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFSFFFFFF MCFCFCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 10 pages / 9 pages = 111 %

and then Steal.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFSSFFFFF MCFCFCCCCC
    MCCCCCCCCC MHHHHHHHHH

    Free mem to working set rate: 9 pages / 9 pages = 100 %

Eventually, converges to system having only workingset and
workingset-sufficient amount (workingset size) of free memory.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFFFFFFFF MHHHHHHHHH

    Free mem to working set rate: 9 pages / 9 pages = 100 %

In this state, proactive reclaim is still active, but do nothing since no
allocated cold pages exist.

Stollen pages returning
-----------------------

User could start allocating more pages and accessing those frequently (make
hot).  In other words, working set could increase.  Then free memory to
workingset size ratio decreases.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFFFFFHFF MHHHHHHHHH

    Free mem to working set rate: 8 pages / 10 pages = 80 %

Proactive reclaim is still active, but doesn’t increase the free memory, since
no allocated and cold page exists.  This situation continues until stolen pages
returning threshold is met (free memory to working set 50%).

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFHFHFHFF MHHHHHHHHH

    Free mem to working set rate: 6 pages / 12 pages = 50 %

If the user stops increasing the working set, this could be a stabilized state.

If the user adds one more hot page, the state becomes like below.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    MFFHFHHHFF MHHHHHHHHH

    Free mem to working set rate: 5 pages / 13 pages = 38 %

Now stolen pages returning is activated.  Since there is no
stolen-but-plugged page, it plugs the last unplugged memory block.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU MFFFFFFFFF
    MFFHFHHHFF MHHHHHHHHH

    Free mem to working set rate: 14 pages / 13 pages = 107 %

It increased the free memory to working set ratio to a high level, so returning
and proactive reclamation stops.  Stealing is again activated, decreasing the
free memory to working set ratio.

    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU UUUUUUUUUU
    UUUUUUUUUU MFFFSFFFFF
    MFFHFHHHFF MHHHHHHHHH

    Free mem to working set rate: 13 pages / 13 pages = 100 %

In this way, the system will always have a real working set (hot pages) plus
50-100% of the working set size free memory, and let the host uses the
remaining guest-physical memory.

[1] https://lpc.events/event/17/contributions/1624/
[2] https://docs.kernel.org/admin-guide/mm/damon/reclaim.html
[3] https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#phases-of-memory-hotunplug
[4] https://lore.kernel.org/damon/20231112194607.61399-1-sj@kernel.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread