Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim
@ 2020-06-08 22:23 Luigi Semenzato
  2020-06-09  6:19 ` Andrea Righi
  0 siblings, 1 reply; 5+ messages in thread
From: Luigi Semenzato @ 2020-06-08 22:23 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Pavel Machek, linux-kernel, Linux Memory Management List,
	Linux PM, Andrew Morton, Len Brown, Rafael J . Wysocki

Hi Andrea,

1. This mechanism is quite general.  It is possible that, although
hibernation may be an important use, there will be other uses for it.
I suggest leaving the hibernation example and performance analysis,
but not mentioning PM or hibernation in the patch subject.

2. It may be useful to have run_show() return the number of pages
reclaimed in the last attempt.  (I had suggested something similar in
https://lore.kernel.org/linux-mm/CAA25o9SxajRaa+ZyhvTYdaKdXokcrNYXgEUimax4sUJGCmRYLA@mail.gmail.com/).

3. It is not clear how much mm_reclaim/release is going to help.  If
the preloading of the swapped-out pages uses some kind of LIFO order,
and can batch multiple pages, then it might help.  Otherwise demand
paging is likely to be more effective.  If the preloading does indeed
help, it may be useful to explain why in the commit message.

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim
  2020-06-08 22:23 [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim Luigi Semenzato
@ 2020-06-09  6:19 ` Andrea Righi
  2020-09-21 15:36   ` Rafael J. Wysocki
  0 siblings, 1 reply; 5+ messages in thread
From: Andrea Righi @ 2020-06-09  6:19 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Pavel Machek, linux-kernel, Linux Memory Management List,
	Linux PM, Andrew Morton, Len Brown, Rafael J . Wysocki

On Mon, Jun 08, 2020 at 03:23:22PM -0700, Luigi Semenzato wrote:
> Hi Andrea,
> 
> 1. This mechanism is quite general.  It is possible that, although
> hibernation may be an important use, there will be other uses for it.
> I suggest leaving the hibernation example and performance analysis,
> but not mentioning PM or hibernation in the patch subject.

I was actually thinking to make this feature even more generic, since
there might be other potential users of this forced "memory reclaim"
feature outside hibernation. So, instead of adding the new sysfs files
under /sys/power/mm_reclaim/, maybe move them to /sys/kernel/mm/ (since
it's more like a mm feature, rather than a PM/hibernation feature).

> 
> 2. It may be useful to have run_show() return the number of pages
> reclaimed in the last attempt.  (I had suggested something similar in
> https://lore.kernel.org/linux-mm/CAA25o9SxajRaa+ZyhvTYdaKdXokcrNYXgEUimax4sUJGCmRYLA@mail.gmail.com/).

I like this idea, I'll add that in the next version.

> 
> 3. It is not clear how much mm_reclaim/release is going to help.  If
> the preloading of the swapped-out pages uses some kind of LIFO order,
> and can batch multiple pages, then it might help.  Otherwise demand
> paging is likely to be more effective.  If the preloading does indeed
> help, it may be useful to explain why in the commit message.

Swap readahead helps a lot in terms of performance if we preload all at
once. But I agree that for the majority of cases on-demand paging just
works fine.

My specific use-case for mm_reclaim/release is to make sure a VM
that is just resumed is immediately "fast" by preloading the swapped-out
pages back to memory all at once.

Without mm_reclaim/release I've been using the trick of running swapoff
followed by a swapon to force all the pages back to memory, but it's
kinda ugly and I was looking for a better way to do this. I've been
trying also the ptrace() + reading all the VMAs via /proc/pid/mem, it
works, but it's not as fast as swapoff+swapon or mm_reclaim/release.

I'll report performance numbers of mm_reclaim/release vs ptrace() +
/proc/pid/mem in the next version of this patch.

> 
> Thanks!

Thanks for your review!

-Andrea

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim
  2020-06-09  6:19 ` Andrea Righi
@ 2020-09-21 15:36   ` Rafael J. Wysocki
  2020-09-21 16:27     ` Andrea Righi
  0 siblings, 1 reply; 5+ messages in thread
From: Rafael J. Wysocki @ 2020-09-21 15:36 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Luigi Semenzato, Pavel Machek, linux-kernel,
	Linux Memory Management List, Linux PM, Andrew Morton, Len Brown,
	Rafael J . Wysocki

Hi Andrea,

On Tue, Jun 9, 2020 at 8:19 AM Andrea Righi <andrea.righi@canonical.com> wrote:
>
> On Mon, Jun 08, 2020 at 03:23:22PM -0700, Luigi Semenzato wrote:
> > Hi Andrea,
> >
> > 1. This mechanism is quite general.  It is possible that, although
> > hibernation may be an important use, there will be other uses for it.
> > I suggest leaving the hibernation example and performance analysis,
> > but not mentioning PM or hibernation in the patch subject.
>
> I was actually thinking to make this feature even more generic, since
> there might be other potential users of this forced "memory reclaim"
> feature outside hibernation. So, instead of adding the new sysfs files
> under /sys/power/mm_reclaim/, maybe move them to /sys/kernel/mm/ (since
> it's more like a mm feature, rather than a PM/hibernation feature).
>
> >
> > 2. It may be useful to have run_show() return the number of pages
> > reclaimed in the last attempt.  (I had suggested something similar in
> > https://lore.kernel.org/linux-mm/CAA25o9SxajRaa+ZyhvTYdaKdXokcrNYXgEUimax4sUJGCmRYLA@mail.gmail.com/).
>
> I like this idea, I'll add that in the next version.
>
> >
> > 3. It is not clear how much mm_reclaim/release is going to help.  If
> > the preloading of the swapped-out pages uses some kind of LIFO order,
> > and can batch multiple pages, then it might help.  Otherwise demand
> > paging is likely to be more effective.  If the preloading does indeed
> > help, it may be useful to explain why in the commit message.
>
> Swap readahead helps a lot in terms of performance if we preload all at
> once. But I agree that for the majority of cases on-demand paging just
> works fine.
>
> My specific use-case for mm_reclaim/release is to make sure a VM
> that is just resumed is immediately "fast" by preloading the swapped-out
> pages back to memory all at once.
>
> Without mm_reclaim/release I've been using the trick of running swapoff
> followed by a swapon to force all the pages back to memory, but it's
> kinda ugly and I was looking for a better way to do this. I've been
> trying also the ptrace() + reading all the VMAs via /proc/pid/mem, it
> works, but it's not as fast as swapoff+swapon or mm_reclaim/release.
>
> I'll report performance numbers of mm_reclaim/release vs ptrace() +
> /proc/pid/mem in the next version of this patch.

Sorry for the huge delay.

I'm wondering what your vision regarding the use of this mechanism in
practice is?

In the "Testing" part of the changelog you say that "in the
5.7-mm_reclaim case a user-space daemon detects when the system is
idle and triggers the opportunistic memory reclaim via
/sys/power/mm_reclaim/run", but this may not be entirely practical,
because hibernation is not triggered every time the system is idle.

In particular, how much time is required for the opportunistic reclaim
to run before hibernation so as to make a significant difference?

Thanks!


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim
  2020-09-21 15:36   ` Rafael J. Wysocki
@ 2020-09-21 16:27     ` Andrea Righi
  0 siblings, 0 replies; 5+ messages in thread
From: Andrea Righi @ 2020-09-21 16:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Luigi Semenzato, Pavel Machek, linux-kernel,
	Linux Memory Management List, Linux PM, Andrew Morton, Len Brown,
	Rafael J . Wysocki

On Mon, Sep 21, 2020 at 05:36:30PM +0200, Rafael J. Wysocki wrote:
...
> > > 3. It is not clear how much mm_reclaim/release is going to help.  If
> > > the preloading of the swapped-out pages uses some kind of LIFO order,
> > > and can batch multiple pages, then it might help.  Otherwise demand
> > > paging is likely to be more effective.  If the preloading does indeed
> > > help, it may be useful to explain why in the commit message.
> >
> > Swap readahead helps a lot in terms of performance if we preload all at
> > once. But I agree that for the majority of cases on-demand paging just
> > works fine.
> >
> > My specific use-case for mm_reclaim/release is to make sure a VM
> > that is just resumed is immediately "fast" by preloading the swapped-out
> > pages back to memory all at once.
> >
> > Without mm_reclaim/release I've been using the trick of running swapoff
> > followed by a swapon to force all the pages back to memory, but it's
> > kinda ugly and I was looking for a better way to do this. I've been
> > trying also the ptrace() + reading all the VMAs via /proc/pid/mem, it
> > works, but it's not as fast as swapoff+swapon or mm_reclaim/release.
> >
> > I'll report performance numbers of mm_reclaim/release vs ptrace() +
> > /proc/pid/mem in the next version of this patch.
> 
> Sorry for the huge delay.
> 
> I'm wondering what your vision regarding the use of this mechanism in
> practice is?
> 
> In the "Testing" part of the changelog you say that "in the
> 5.7-mm_reclaim case a user-space daemon detects when the system is
> idle and triggers the opportunistic memory reclaim via
> /sys/power/mm_reclaim/run", but this may not be entirely practical,
> because hibernation is not triggered every time the system is idle.
> 
> In particular, how much time is required for the opportunistic reclaim
> to run before hibernation so as to make a significant difference?
> 
> Thanks!

Hi Raphael,

the typical use-case for this feature is to hibernate "spot" cloud
instances (low-priority instances that can be stopped at any time to
prioritize more privileged instances, see for example [1]). In this
scenario hibernation can be used as a "nicer" way to stop low priority
instances, instead of shutting them down.

Opportunistic memory reclaim doesn't really reduce the time to hibernate
overall: performance wise regular hibernation and hibernation w/
opportunistic reclaim require pretty much the same time.

But the advantage of opportunistic reclaim is that we can "prepare" a
system for hibernation using some idle time, so when we really need to
hibernate a low priority instance, because a high priority instance
requires to run, hibernation can be significantly faster.

What do you think about it? Do you see a better way to achieve this
goal?

Thanks,
-Andrea

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim
  2020-06-01 16:06 [RFC PATCH 0/2] PM: hibernate: " Andrea Righi
@ 2020-06-01 16:06 ` Andrea Righi
  0 siblings, 0 replies; 5+ messages in thread
From: Andrea Righi @ 2020-06-01 16:06 UTC (permalink / raw)
  To: Rafael J . Wysocki, Pavel Machek
  Cc: Len Brown, Andrew Morton, linux-pm, linux-mm, linux-kernel

== Overview ==

When a system is going to be hibernated, the kernel needs to allocate
and dump the content of the entire memory to the resume device (swap) by
creating a "hibernation image".

To make sure this image fits in the available free memory, the kernel
can induce an artificial memory pressure condition that allows to free
up some pages (i.e., drop clean page cache pages, writeback dirty page
cache pages, swap out anonymous memory, etc.).

How much the kernel is pushing to free up memory is determined by
/sys/power/image_size: a smaller size will cause more memory to be
dropped, cutting down the amount of I/O required to write the
hibernation image; a larger image size, instead, is going to generate
more I/O, but the system will likely be less sluggish at resume, because
more caches will be restored, reducing the paging time.

The I/O generated to free up memory, write the hibernation image to disk
and load it back to memory is the main bottleneck of hibernation [1].

== Proposed solution ==

The "opportunistic memory reclaim" aims to provide an interface to the
user-space to control the artificial memory pressure. With this feature
user-space can trigger the memory reclaim before the actual hibernation
is started (e.g., if the system is idle for a certain amount of time).

This allows to consistently speed up hibernation performance when needed
(in terms of time to hibernate) by reducing the size of the hibernation
image in advance.

== Interface ==

The accomplish this goal the following new files are provided in sysfs:

 - /sys/power/mm_reclaim/run
 - /sys/power/mm_reclaim/release

The former can be used to start the memory reclaim by writing a number
representing the desired amount of pages to be reclaimed (with "0" the
kernel will try to reclaim as many pages as possible).

The latter can be used in the same way to force the kernel to pull a
certain amount of swapped out pages back to memory (by writing the
number of pages or "0" to load back to memory as many pages as
possible); this can be useful immediately after resume to speed up the
paging time and get the system back to full speed faster.

Memory reclaim and release can be interrupted sending a signal to the
process that is writing to /sys/power/mm_reclaim/{run,release} (i.e.,
to set a timeout for the particular operation).

== Testing ==

Environment:
   - VM (kvm):
     8GB of RAM
     disk speed: 100 MB/s
     8GB swap file on ext4 (/swapfile)

Use case:
  - allocate 85% of memory, wait for 60s almost in idle, then hibernate
    and resume (measuring the time)

Result (average of 10 runs):
                                 5.7-vanilla   5.7-mm_reclaim
                                 -----------   --------------
  [hibernate] image_size=default      51.56s            4.19s
     [resume] image_size=default      26.34s            5.01s
  [hibernate] image_size=0            73.22s            5.36s
     [resume] image_size=0             5.32s            5.26s

NOTE #1: in the 5.7-mm_reclaim case a user-space daemon detects when the
system is idle and triggers the opportunistic memory reclaim via
/sys/power/mm_reclaim/run.

NOTE #2: in the 5.7-mm_reclaim case, after the system is resumed, a
user-space process can (optionally) use /sys/power/mm_reclaim/release to
pre-load back to memory all (or some) of the swapped out pages in order
to have a more responsive system.

== Conclusion ==

Opportunistic memory reclaim can provide a significant benefit to those
systems where being able to hibernate quickly is important.

The typical use case is with "spot" cloud instances: low-priority
instances that can be stopped at any time (prior advice) to prioritize
other more privileged instances [2].

Being able to quickly stop low-priority instances that are mostly idle
for the majority of time can be critical to provide a better quality of
service in the overall cloud infrastructure.

== See also ==

[1] https://lwn.net/Articles/821158/
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
---
 Documentation/ABI/testing/sysfs-power | 38 +++++++++++
 include/linux/swapfile.h              |  1 +
 kernel/power/hibernate.c              | 94 ++++++++++++++++++++++++++-
 mm/swapfile.c                         | 30 +++++++++
 4 files changed, 162 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power
index 5e6ead29124c..b33db9816a8c 100644
--- a/Documentation/ABI/testing/sysfs-power
+++ b/Documentation/ABI/testing/sysfs-power
@@ -192,6 +192,44 @@ Description:
 		Reading from this file will display the current value, which is
 		set to 1 MB by default.
 
+What:		/sys/power/mm_reclaim/
+Date:		May 2020
+Contact:	Andrea Righi <andrea.righi@canonical.com>
+Description:
+		The /sys/power/mm_reclaim directory contains all the
+		opportunistic memory reclaim files.
+
+What:		/sys/power/mm_reclaim/run
+Date:		May 2020
+Contact:	Andrea Righi <andrea.righi@canonical.com>
+Description:
+		The /sys/power/mm_reclaim/run file allows user space to trigger
+		opportunistic memory reclaim. When a string representing a
+		non-negative number is written to this file, it will be assumed
+		to represent the amount of pages to be reclaimed (0 is a special
+		value that means "as many pages as possible").
+
+		When opportunistic memory reclaim is started the system will be
+		put into an artificial memory pressure condition and memory
+		will be reclaimed by dropping clean page cache pages, swapping
+		out anonymous pages, etc.
+
+		NOTE: it is possible to interrupt the memory reclaim sending a
+		signal to writer of this file.
+
+What:		/sys/power/mm_reclaim/release
+Date:		May 2020
+Contact:	Andrea Righi <andrea.righi@canonical.com>
+Description:
+		Force swapped out pages to be loaded back to memory. When a
+		string representing a non-negative number is written to this
+		file, it will be assumed to represent the amount of pages to be
+		pulled back to memory from the swap device(s) (0 is a special
+		value that means "as many pages as possible").
+
+		NOTE: it is possible to interrupt the memory release sending a
+		signal to writer of this file.
+
 What:		/sys/power/autosleep
 Date:		April 2012
 Contact:	Rafael J. Wysocki <rjw@rjwysocki.net>
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index ac4d0ccd1f7b..6f4144099958 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -9,6 +9,7 @@
 extern spinlock_t swap_lock;
 extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
+extern void swap_unuse(unsigned long pages);
 extern int try_to_unuse_wait(unsigned int type, bool frontswap, bool wait,
 			     unsigned long pages_to_unuse);
 static inline int
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index 30bd28d1d418..caa06eb5a09f 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/swapfile.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -1150,6 +1151,92 @@ static ssize_t reserved_size_store(struct kobject *kobj,
 
 power_attr(reserved_size);
 
+/*
+ * Try to reclaim some memory in the system, stop when one of the following
+ * conditions occurs:
+ *  - at least "nr_pages" have been reclaimed
+ *  - no more pages can be reclaimed
+ *  - current task explicitly interrupted by a signal (e.g., user space
+ *    timeout)
+ *
+ *  @nr_pages - amount of pages to be reclaimed (0 means "as many pages as
+ *  possible").
+ */
+static void do_mm_reclaim(unsigned long nr_pages)
+{
+	while (nr_pages > 0) {
+		unsigned long reclaimed;
+
+		if (signal_pending(current))
+			break;
+		reclaimed = shrink_all_memory(nr_pages);
+		if (!reclaimed)
+			break;
+		nr_pages -= min_t(unsigned long, reclaimed, nr_pages);
+	}
+}
+
+static ssize_t run_show(struct kobject *kobj,
+			struct kobj_attribute *attr, char *buf)
+{
+	return -EINVAL;
+}
+
+static ssize_t run_store(struct kobject *kobj,
+			 struct kobj_attribute *attr,
+			 const char *buf, size_t n)
+{
+	unsigned long nr_pages;
+	int ret;
+
+	ret = kstrtoul(buf, 0, &nr_pages);
+	if (ret)
+		return ret;
+	if (!nr_pages)
+		nr_pages = ULONG_MAX;
+	do_mm_reclaim(nr_pages);
+
+	return n;
+}
+
+power_attr(run);
+
+static ssize_t release_show(struct kobject *kobj,
+			    struct kobj_attribute *attr, char *buf)
+{
+	return -EINVAL;
+}
+
+static ssize_t release_store(struct kobject *kobj,
+			     struct kobj_attribute *attr,
+			     const char *buf, size_t n)
+{
+	unsigned long nr_pages;
+	int ret;
+
+	ret = kstrtoul(buf, 0, &nr_pages);
+	if (ret)
+		return ret;
+	if (!nr_pages)
+		nr_pages = ULONG_MAX;
+	swap_unuse(nr_pages);
+
+	return n;
+}
+
+power_attr(release);
+
+static struct attribute *mm_reclaim_attrs[] = {
+	&run_attr.attr,
+	&release_attr.attr,
+	NULL,
+};
+
+static struct attribute_group mm_reclaim_attr_group = {
+	.name = "mm_reclaim",
+	.attrs = mm_reclaim_attrs,
+};
+
 static struct attribute * g[] = {
 	&disk_attr.attr,
 	&resume_offset_attr.attr,
@@ -1164,10 +1251,15 @@ static const struct attribute_group attr_group = {
 	.attrs = g,
 };
 
+static const struct attribute_group *attr_groups[] = {
+	&attr_group,
+	&mm_reclaim_attr_group,
+	NULL,
+};
 
 static int __init pm_disk_init(void)
 {
-	return sysfs_create_group(power_kobj, &attr_group);
+	return sysfs_create_groups(power_kobj, attr_groups);
 }
 
 core_initcall(pm_disk_init);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 651471ccf133..7391f122ad73 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1749,6 +1749,36 @@ int free_swap_and_cache(swp_entry_t entry)
 }
 
 #ifdef CONFIG_HIBERNATION
+/*
+ * Force pages to be pulled back to memory from all swap devices.
+ *
+ * @nr_pages - number of pages to be pulled from all swap devices
+ * (0 = all pages from any swap device).
+ */
+void swap_unuse(unsigned long pages)
+{
+	int type;
+
+	spin_lock(&swap_lock);
+	for (type = 0; type < nr_swapfiles; type++) {
+		struct swap_info_struct *sis = swap_info[type];
+		struct block_device *bdev;
+
+		if (!(sis->flags & SWP_WRITEOK))
+			continue;
+		bdev = bdgrab(sis->bdev);
+		if (!bdev)
+			continue;
+		spin_unlock(&swap_lock);
+
+		try_to_unuse_wait(type, false, false, pages);
+
+		bdput(sis->bdev);
+		spin_lock(&swap_lock);
+	}
+	spin_unlock(&swap_lock);
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-21 16:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-08 22:23 [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim Luigi Semenzato
2020-06-09  6:19 ` Andrea Righi
2020-09-21 15:36   ` Rafael J. Wysocki
2020-09-21 16:27     ` Andrea Righi
  -- strict thread matches above, loose matches on Subject: below --
2020-06-01 16:06 [RFC PATCH 0/2] PM: hibernate: " Andrea Righi
2020-06-01 16:06 ` [RFC PATCH 2/2] PM: hibernate: introduce " Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).