* [nacked] memcg-introduce-per-memcg-reclaim-interface.patch removed from -mm tree
@ 2020-09-22 17:35 akpm
0 siblings, 0 replies; only message in thread
From: akpm @ 2020-09-22 17:35 UTC (permalink / raw)
To: mm-commits, yang.shi, sjpark, rientjes, mkoutny, mhocko, hannes,
guro, gthelen, shakeelb
The patch titled
Subject: memcg: introduce per-memcg reclaim interface
has been removed from the -mm tree. Its filename was
memcg-introduce-per-memcg-reclaim-interface.patch
This patch was dropped because it was nacked
------------------------------------------------------
From: Shakeel Butt <shakeelb@google.com>
Subject: memcg: introduce per-memcg reclaim interface
Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
Use cases:
----------
1) Per-memcg uswapd:
Usually applications consists of combination of latency sensitive
and latency tolerant tasks. For example, tasks serving user requests
vs tasks doing data backup for a database application. At the moment
the kernel does not differentiate between such tasks when the
application hits the memcg limits. So, potentially a latency sensitive
user facing task can get stuck in high reclaim and be throttled by the
kernel.
Similarly there are cases of single process applications having two
set of thread pools where threads from one pool have high scheduling
priority and low latency requirement. One concrete example from our
production is the VMM which have high priority low latency thread pool
for the VCPUs while separate thread pool for stats reporting, I/O
emulation, health checks and other managerial operations. The kernel
memory reclaim does not differentiate between VCPU thread or a
non-latency sensitive thread and a VCPU thread can get stuck in high
reclaim.
One way to resolve this issue is to preemptively trigger the memory
reclaim from a latency tolerant task (uswapd) when the application is
near the limits. Finding 'near the limits' situation is an orthogonal
problem.
2) Proactive reclaim:
This is a similar to the previous use-case, the difference is
instead of waiting for the application to be near its limit to trigger
memory reclaim, continuously pressuring the memcg to reclaim a small
amount of memory. This gives more accurate and uptodate workingset
estimation as the LRUs are continuously sorted and can potentially
provide more deterministic memory overcommit behavior. The memory
overcommit controller can provide more proactive response to the
changing behavior of the running applications instead of being
reactive.
Benefit of user space solution:
-------------------------------
1) More flexible on who should be charged for the cpu of the memory
reclaim. For proactive reclaim, it makes more sense to centralized the
overhead while for uswapd, it makes more sense for the application to
pay for the cpu of the memory reclaim.
2) More flexible on dedicating the resources (like cpu). The memory
overcommit controller can balance the cost between the cpu usage and
the memory reclaimed.
3) Provides a way to the applications to keep their LRUs sorted, so,
under memory pressure better reclaim candidates are selected. This
also gives more accurate and uptodate notion of working set for an
application.
Questions:
----------
1) Why memory.high is not enough?
memory.high can be used to trigger reclaim in a memcg and can
potentially be used for proactive reclaim as well as uswapd use cases.
However there is a big negative in using memory.high. It can
potentially introduce high reclaim stalls in the target application as
the allocations from the processes or the threads of the application
can hit the temporary memory.high limit.
Another issue with memory.high is that it is not delegatable. To
actually use this interface for uswapd, the application has to
introduce another layer of cgroup on whose memory.high it has write
access.
2) Why uswapd safe from self induced reclaim?
This is very similar to the scenario of oomd under global memory
pressure. We can use the similar mechanisms to protect uswapd from
self induced reclaim i.e. memory.min and mlock.
Interface options:
------------------
Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to
trigger reclaim in the target memory cgroup.
In future we might want to reclaim specific type of memory from a memcg,
so, this interface can be extended to allow that. e.g.
$ echo 10M [all|anon|file|kmem] > memory.reclaim
However that should be when we have concrete use-cases for such
functionality. Keep things simple for now.
Link: https://lkml.kernel.org/r/20200909215752.1725525-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
Documentation/admin-guide/cgroup-v2.rst | 9 +++++
mm/memcontrol.c | 37 ++++++++++++++++++++++
2 files changed, 46 insertions(+)
--- a/Documentation/admin-guide/cgroup-v2.rst~memcg-introduce-per-memcg-reclaim-interface
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
+ memory.reclaim
+ A write-only file which exists on non-root cgroups.
+
+ This is a simple interface to trigger memory reclaim in the
+ target cgroup. Write the number of bytes to reclaim to this
+ file and the kernel will try to reclaim that much memory.
+ Please note that the kernel can over or under reclaim from
+ the target cgroup.
+
memory.oom.group
A read-write single value file which exists on non-root
cgroups. The default value is "0".
--- a/mm/memcontrol.c~memcg-introduce-per-memcg-reclaim-interface
+++ a/mm/memcontrol.c
@@ -6403,6 +6403,38 @@ static ssize_t memory_oom_group_write(st
return nbytes;
}
+static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+ unsigned long nr_to_reclaim, nr_reclaimed = 0;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "", &nr_to_reclaim);
+ if (err)
+ return err;
+
+ while (nr_reclaimed < nr_to_reclaim) {
+ unsigned long reclaimed;
+
+ if (signal_pending(current))
+ break;
+
+ reclaimed = try_to_free_mem_cgroup_pages(memcg,
+ nr_to_reclaim - nr_reclaimed,
+ GFP_KERNEL, true);
+
+ if (!reclaimed && !nr_retries--)
+ break;
+
+ nr_reclaimed += reclaimed;
+ }
+
+ return nbytes;
+}
+
static struct cftype memory_files[] = {
{
.name = "current",
@@ -6455,6 +6487,11 @@ static struct cftype memory_files[] = {
.seq_show = memory_oom_group_show,
.write = memory_oom_group_write,
},
+ {
+ .name = "reclaim",
+ .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
+ .write = memory_reclaim,
+ },
{ } /* terminate */
};
_
Patches currently in -mm which might be from shakeelb@google.com are
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2020-09-22 17:36 UTC | newest]
Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-22 17:35 [nacked] memcg-introduce-per-memcg-reclaim-interface.patch removed from -mm tree akpm
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).