From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D23EC433E0 for ; Mon, 1 Jun 2020 16:07:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 48004206C3 for ; Mon, 1 Jun 2020 16:07:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 48004206C3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=canonical.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E743280009; Mon, 1 Jun 2020 12:07:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DF92780008; Mon, 1 Jun 2020 12:07:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D0D7380009; Mon, 1 Jun 2020 12:07:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0155.hostedemail.com [216.40.44.155]) by kanga.kvack.org (Postfix) with ESMTP id B604480008 for ; Mon, 1 Jun 2020 12:07:05 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 7118119F034 for ; Mon, 1 Jun 2020 16:07:05 +0000 (UTC) X-FDA: 76881122010.22.cause41_58c147062175f Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id ABFE6180A444F for ; Mon, 1 Jun 2020 16:06:50 +0000 (UTC) X-HE-Tag: cause41_58c147062175f X-Filterd-Recvd-Size: 13760 Received: from youngberry.canonical.com (youngberry.canonical.com [91.189.89.112]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Mon, 1 Jun 2020 16:06:49 +0000 (UTC) Received: from mail-wr1-f71.google.com ([209.85.221.71]) by youngberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jfmxc-0006tG-NR for linux-mm@kvack.org; Mon, 01 Jun 2020 16:06:48 +0000 Received: by mail-wr1-f71.google.com with SMTP id c14so176437wrw.11 for ; Mon, 01 Jun 2020 09:06:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Rb4DorW2DTSvKQhxTeczpFcq7NmU3NcOTRxb6hpieGg=; b=XgvaWHtbav027+4IOVJZeYX0iYRSPkf3uw6rgRwbfUbG7Ofat2VZk3ZlFhoGA8MhNM 9wMJnxIIgWZ6RrZTYd/8TZcSQsmfMSKBSxf7c0zoQ/dC3ZhtFX8F1f0FkDITIv4YbhqZ NwRH3tBkqWEDAfUfnx1eN34f8xKO0mqysasjwI+rM2DqIQUMg/aIres+ZHNqGrYLbn+d FmDNp8ygFlW2nB6TYDoQ8EkMt3Rj+XZ1BCT6e8O0sZ58S8Td95xAF1squNHXGpNRbbCp bQQxEH2fQb5sSb1U35P+WS8UnVZEwbwNEHXi26bD0end8MZPTkAqju6LK6ThWMZk6KZM PS1g== X-Gm-Message-State: AOAM532MkYPKDkBhWqvACMJcGd1zUAm4Kdu/vQprHHLrgr3fUYu7dN66 TGNsI9f5J7PJDhCytyPuYKEk/NV6FcZJ4/W1IwpGSkyCA1z+yuPMgwtk+pLGF3hADFpWU/O1Fck ajUBcPwrevE+v6LrunWPkoqBfGSlf X-Received: by 2002:adf:f847:: with SMTP id d7mr1098529wrq.261.1591027607713; Mon, 01 Jun 2020 09:06:47 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwLTpp25Ny6dKQjmgdteuceFco5fZeCOLru4FBdl5AKKV0A8ZkU98XKhIVVEEGzpE9X9OIl8A== X-Received: by 2002:adf:f847:: with SMTP id d7mr1098511wrq.261.1591027607358; Mon, 01 Jun 2020 09:06:47 -0700 (PDT) Received: from xps-13.homenet.telecomitalia.it (host105-135-dynamic.43-79-r.retail.telecomitalia.it. [79.43.135.105]) by smtp.gmail.com with ESMTPSA id k16sm19719262wrp.66.2020.06.01.09.06.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Jun 2020 09:06:46 -0700 (PDT) From: Andrea Righi To: "Rafael J . Wysocki" , Pavel Machek Cc: Len Brown , Andrew Morton , linux-pm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 2/2] PM: hibernate: introduce opportunistic memory reclaim Date: Mon, 1 Jun 2020 18:06:36 +0200 Message-Id: <20200601160636.148346-3-andrea.righi@canonical.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200601160636.148346-1-andrea.righi@canonical.com> References: <20200601160636.148346-1-andrea.righi@canonical.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: ABFE6180A444F X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =3D=3D Overview =3D=3D When a system is going to be hibernated, the kernel needs to allocate and dump the content of the entire memory to the resume device (swap) by creating a "hibernation image". To make sure this image fits in the available free memory, the kernel can induce an artificial memory pressure condition that allows to free up some pages (i.e., drop clean page cache pages, writeback dirty page cache pages, swap out anonymous memory, etc.). How much the kernel is pushing to free up memory is determined by /sys/power/image_size: a smaller size will cause more memory to be dropped, cutting down the amount of I/O required to write the hibernation image; a larger image size, instead, is going to generate more I/O, but the system will likely be less sluggish at resume, because more caches will be restored, reducing the paging time. The I/O generated to free up memory, write the hibernation image to disk and load it back to memory is the main bottleneck of hibernation [1]. =3D=3D Proposed solution =3D=3D The "opportunistic memory reclaim" aims to provide an interface to the user-space to control the artificial memory pressure. With this feature user-space can trigger the memory reclaim before the actual hibernation is started (e.g., if the system is idle for a certain amount of time). This allows to consistently speed up hibernation performance when needed (in terms of time to hibernate) by reducing the size of the hibernation image in advance. =3D=3D Interface =3D=3D The accomplish this goal the following new files are provided in sysfs: - /sys/power/mm_reclaim/run - /sys/power/mm_reclaim/release The former can be used to start the memory reclaim by writing a number representing the desired amount of pages to be reclaimed (with "0" the kernel will try to reclaim as many pages as possible). The latter can be used in the same way to force the kernel to pull a certain amount of swapped out pages back to memory (by writing the number of pages or "0" to load back to memory as many pages as possible); this can be useful immediately after resume to speed up the paging time and get the system back to full speed faster. Memory reclaim and release can be interrupted sending a signal to the process that is writing to /sys/power/mm_reclaim/{run,release} (i.e., to set a timeout for the particular operation). =3D=3D Testing =3D=3D Environment: - VM (kvm): 8GB of RAM disk speed: 100 MB/s 8GB swap file on ext4 (/swapfile) Use case: - allocate 85% of memory, wait for 60s almost in idle, then hibernate and resume (measuring the time) Result (average of 10 runs): 5.7-vanilla 5.7-mm_reclaim ----------- -------------- [hibernate] image_size=3Ddefault 51.56s 4.19s [resume] image_size=3Ddefault 26.34s 5.01s [hibernate] image_size=3D0 73.22s 5.36s [resume] image_size=3D0 5.32s 5.26s NOTE #1: in the 5.7-mm_reclaim case a user-space daemon detects when the system is idle and triggers the opportunistic memory reclaim via /sys/power/mm_reclaim/run. NOTE #2: in the 5.7-mm_reclaim case, after the system is resumed, a user-space process can (optionally) use /sys/power/mm_reclaim/release to pre-load back to memory all (or some) of the swapped out pages in order to have a more responsive system. =3D=3D Conclusion =3D=3D Opportunistic memory reclaim can provide a significant benefit to those systems where being able to hibernate quickly is important. The typical use case is with "spot" cloud instances: low-priority instances that can be stopped at any time (prior advice) to prioritize other more privileged instances [2]. Being able to quickly stop low-priority instances that are mostly idle for the majority of time can be critical to provide a better quality of service in the overall cloud infrastructure. =3D=3D See also =3D=3D [1] https://lwn.net/Articles/821158/ [2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruption= s.html Signed-off-by: Andrea Righi --- Documentation/ABI/testing/sysfs-power | 38 +++++++++++ include/linux/swapfile.h | 1 + kernel/power/hibernate.c | 94 ++++++++++++++++++++++++++- mm/swapfile.c | 30 +++++++++ 4 files changed, 162 insertions(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/te= sting/sysfs-power index 5e6ead29124c..b33db9816a8c 100644 --- a/Documentation/ABI/testing/sysfs-power +++ b/Documentation/ABI/testing/sysfs-power @@ -192,6 +192,44 @@ Description: Reading from this file will display the current value, which is set to 1 MB by default. =20 +What: /sys/power/mm_reclaim/ +Date: May 2020 +Contact: Andrea Righi +Description: + The /sys/power/mm_reclaim directory contains all the + opportunistic memory reclaim files. + +What: /sys/power/mm_reclaim/run +Date: May 2020 +Contact: Andrea Righi +Description: + The /sys/power/mm_reclaim/run file allows user space to trigger + opportunistic memory reclaim. When a string representing a + non-negative number is written to this file, it will be assumed + to represent the amount of pages to be reclaimed (0 is a special + value that means "as many pages as possible"). + + When opportunistic memory reclaim is started the system will be + put into an artificial memory pressure condition and memory + will be reclaimed by dropping clean page cache pages, swapping + out anonymous pages, etc. + + NOTE: it is possible to interrupt the memory reclaim sending a + signal to writer of this file. + +What: /sys/power/mm_reclaim/release +Date: May 2020 +Contact: Andrea Righi +Description: + Force swapped out pages to be loaded back to memory. When a + string representing a non-negative number is written to this + file, it will be assumed to represent the amount of pages to be + pulled back to memory from the swap device(s) (0 is a special + value that means "as many pages as possible"). + + NOTE: it is possible to interrupt the memory release sending a + signal to writer of this file. + What: /sys/power/autosleep Date: April 2012 Contact: Rafael J. Wysocki diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h index ac4d0ccd1f7b..6f4144099958 100644 --- a/include/linux/swapfile.h +++ b/include/linux/swapfile.h @@ -9,6 +9,7 @@ extern spinlock_t swap_lock; extern struct plist_head swap_active_head; extern struct swap_info_struct *swap_info[]; +extern void swap_unuse(unsigned long pages); extern int try_to_unuse_wait(unsigned int type, bool frontswap, bool wai= t, unsigned long pages_to_unuse); static inline int diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 30bd28d1d418..caa06eb5a09f 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -31,6 +31,7 @@ #include #include #include +#include #include =20 #include "power.h" @@ -1150,6 +1151,92 @@ static ssize_t reserved_size_store(struct kobject = *kobj, =20 power_attr(reserved_size); =20 +/* + * Try to reclaim some memory in the system, stop when one of the follow= ing + * conditions occurs: + * - at least "nr_pages" have been reclaimed + * - no more pages can be reclaimed + * - current task explicitly interrupted by a signal (e.g., user space + * timeout) + * + * @nr_pages - amount of pages to be reclaimed (0 means "as many pages = as + * possible"). + */ +static void do_mm_reclaim(unsigned long nr_pages) +{ + while (nr_pages > 0) { + unsigned long reclaimed; + + if (signal_pending(current)) + break; + reclaimed =3D shrink_all_memory(nr_pages); + if (!reclaimed) + break; + nr_pages -=3D min_t(unsigned long, reclaimed, nr_pages); + } +} + +static ssize_t run_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return -EINVAL; +} + +static ssize_t run_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t n) +{ + unsigned long nr_pages; + int ret; + + ret =3D kstrtoul(buf, 0, &nr_pages); + if (ret) + return ret; + if (!nr_pages) + nr_pages =3D ULONG_MAX; + do_mm_reclaim(nr_pages); + + return n; +} + +power_attr(run); + +static ssize_t release_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return -EINVAL; +} + +static ssize_t release_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t n) +{ + unsigned long nr_pages; + int ret; + + ret =3D kstrtoul(buf, 0, &nr_pages); + if (ret) + return ret; + if (!nr_pages) + nr_pages =3D ULONG_MAX; + swap_unuse(nr_pages); + + return n; +} + +power_attr(release); + +static struct attribute *mm_reclaim_attrs[] =3D { + &run_attr.attr, + &release_attr.attr, + NULL, +}; + +static struct attribute_group mm_reclaim_attr_group =3D { + .name =3D "mm_reclaim", + .attrs =3D mm_reclaim_attrs, +}; + static struct attribute * g[] =3D { &disk_attr.attr, &resume_offset_attr.attr, @@ -1164,10 +1251,15 @@ static const struct attribute_group attr_group =3D= { .attrs =3D g, }; =20 +static const struct attribute_group *attr_groups[] =3D { + &attr_group, + &mm_reclaim_attr_group, + NULL, +}; =20 static int __init pm_disk_init(void) { - return sysfs_create_group(power_kobj, &attr_group); + return sysfs_create_groups(power_kobj, attr_groups); } =20 core_initcall(pm_disk_init); diff --git a/mm/swapfile.c b/mm/swapfile.c index 651471ccf133..7391f122ad73 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1749,6 +1749,36 @@ int free_swap_and_cache(swp_entry_t entry) } =20 #ifdef CONFIG_HIBERNATION +/* + * Force pages to be pulled back to memory from all swap devices. + * + * @nr_pages - number of pages to be pulled from all swap devices + * (0 =3D all pages from any swap device). + */ +void swap_unuse(unsigned long pages) +{ + int type; + + spin_lock(&swap_lock); + for (type =3D 0; type < nr_swapfiles; type++) { + struct swap_info_struct *sis =3D swap_info[type]; + struct block_device *bdev; + + if (!(sis->flags & SWP_WRITEOK)) + continue; + bdev =3D bdgrab(sis->bdev); + if (!bdev) + continue; + spin_unlock(&swap_lock); + + try_to_unuse_wait(type, false, false, pages); + + bdput(sis->bdev); + spin_lock(&swap_lock); + } + spin_unlock(&swap_lock); +} + /* * Find the swap type that corresponds to given device (if any). * --=20 2.25.1