linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] THP Shrinker
@ 2022-10-26  0:26 alexlzhu
  2022-10-26  0:26 ` [PATCH v5 1/5] mm: add thp_utilization metrics to debugfs alexlzhu
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: alexlzhu @ 2022-10-26  0:26 UTC (permalink / raw)
  To: linux-mm, kernel-team
  Cc: riel, hannes, willy, ningzhang, yuzhao, Alexander Zhu

From: Alexander Zhu <alexlzhu@fb.com>

Transparent Hugepages use a larger page size of 2MB in comparison to
normal sized pages that are 4kb. A larger page size allows for fewer TLB
cache misses and thus more efficient use of the CPU. Using a larger page
size also results in more memory waste, which can hurt performance in some
use cases. THPs are currently enabled in the Linux Kernel by applications
in limited virtual address ranges via the madvise system call.  The THP
shrinker tries to find a balance between increased use of THPs, and
increased use of memory. It shrinks the size of memory by removing the
underutilized THPs that are identified by the thp_utilization scanner. 

In our experiments we have noticed that the least utilized THPs are almost
entirely unutilized.

Sample Output: 

Utilized[0-50]: 1331 680884
Utilized[51-101]: 9 3983
Utilized[102-152]: 3 1187
Utilized[153-203]: 0 0
Utilized[204-255]: 2 539
Utilized[256-306]: 5 1135
Utilized[307-357]: 1 192
Utilized[358-408]: 0 0
Utilized[409-459]: 1 57
Utilized[460-512]: 400 13
Last Scan Time: 223.98s
Last Scan Duration: 70.65s

Above is a sample obtained from one of our test machines when THP is always
enabled. Of the 1331 THPs in this thp_utilization sample that have from
0-50 utilized subpages, we see that there are 680884 free pages. This
comes out to 680884 / (512 * 1331) = 99.91% zero pages in the least
utilized bucket. This represents 680884 * 4KB = 2.7GB memory waste.

Also note that the vast majority of pages are either in the least utilized
[0-50] or most utilized [460-512] buckets. The least utilized THPs are 
responsible for almost all of the memory waste when THP is always 
enabled. Thus by clearing out THPs in the lowest utilization bucket
we extract most of the improvement in CPU efficiency. We have seen 
similar results on our production hosts.

This patchset introduces the THP shrinker we have developed to identify
and split the least utilized THPs. It includes the thp_utilization 
changes that groups anonymous THPs into buckets, the split_huge_page()
changes that identify and zap zero 4KB pages within THPs and the shrinker
changes. It should be noted that the split_huge_page() changes are based
off previous work done by Yu Zhao. 

In the future, we intend to allow additional tuning to the shrinker
based on workload depending on CPU/IO/Memory pressure and the 
amount of anonymous memory. The long term goal is to eventually always 
enable THP for all applications and deprecate madvise entirely.

In production we thus far have observed 2-3% reduction in overall cpu
usage on stateless web servers when THP is always enabled.

Alexander Zhu (5):
  mm: add thp_utilization metrics to debugfs
  mm: changes to split_huge_page() to free zero filled tail pages
  mm: do not remap clean subpages when splitting isolated thp
  mm: add selftests to split_huge_page() to verify unmap/zap of zero
    pages
  mm: THP low utilization shrinker

 Documentation/admin-guide/mm/transhuge.rst    |   9 +
 include/linux/huge_mm.h                       |   9 +
 include/linux/list_lru.h                      |  24 ++
 include/linux/mm_types.h                      |   5 +
 include/linux/rmap.h                          |   2 +-
 include/linux/vm_event_item.h                 |   3 +
 mm/Makefile                                   |   2 +-
 mm/huge_memory.c                              | 155 +++++++++++-
 mm/list_lru.c                                 |  49 ++++
 mm/migrate.c                                  |  73 +++++-
 mm/migrate_device.c                           |   4 +-
 mm/page_alloc.c                               |   6 +
 mm/thp_utilization.c                          | 222 ++++++++++++++++++
 mm/vmstat.c                                   |   3 +
 .../selftests/vm/split_huge_page_test.c       | 115 ++++++++-
 tools/testing/selftests/vm/vm_util.c          |  23 ++
 tools/testing/selftests/vm/vm_util.h          |   3 +
 17 files changed, 689 insertions(+), 18 deletions(-)
 create mode 100644 mm/thp_utilization.c

-- 
2.30.2



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v5 1/5] mm: add thp_utilization metrics to debugfs
  2022-10-26  0:26 [PATCH v5 0/5] THP Shrinker alexlzhu
@ 2022-10-26  0:26 ` alexlzhu
  2022-10-26  0:26 ` [PATCH v5 2/5] mm: changes to split_huge_page() to free zero filled tail pages alexlzhu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: alexlzhu @ 2022-10-26  0:26 UTC (permalink / raw)
  To: linux-mm, kernel-team
  Cc: riel, hannes, willy, ningzhang, yuzhao, Alexander Zhu

From: Alexander Zhu <alexlzhu@fb.com>

This change introduces a tool that scans through all of physical
memory for anonymous THPs and groups them into buckets based
on utilization. It also includes an interface under
/sys/kernel/debug/thp_utilization.

Sample Output:

Utilized[0-50]: 1331 680884
Utilized[51-101]: 9 3983
Utilized[102-152]: 3 1187
Utilized[153-203]: 0 0
Utilized[204-255]: 2 539
Utilized[256-306]: 5 1135
Utilized[307-357]: 1 192
Utilized[358-408]: 0 0
Utilized[409-459]: 1 57
Utilized[460-512]: 400 13
Last Scan Time: 223.98s
Last Scan Duration: 70.65s

This indicates that there are 1331 THPs that have between 0 and 50
utilized (non zero) pages. In total there are 680884 zero pages in
this utilization bucket. THPs in the [0-50] bucket compose 76% of total
THPs, and are responsible for 99% of total zero pages across all
THPs. In other words, the least utilized THPs are responsible for almost
all of the memory waste when THP is always enabled. Similar results
have been observed across production workloads.

The last two lines indicate the timestamp and duration of the most recent
scan through all of physical memory. Here we see that the last scan
occurred 223.98 seconds after boot time and took 70.65 seconds.

Utilization of a THP is defined as the percentage of nonzero
pages in the THP. The worker thread will scan through all
of physical memory and obtain utilization of all anonymous
THPs. It will gather this information by periodically scanning
through all of physical memory for anonymous THPs, group them
into buckets based on utilization, and report utilization
information through debugfs under /sys/kernel/debug/thp_utilization.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v3 to v4
-changed thp_utilization_bucket() function to take folios, saves conversion between page and folio
-added newlines where they were previously missing in v2-v3
-moved the thp utilization code out into its own file under mm/thp_utilization.c
-removed is_anonymous_transparent_hugepage function. Use folio_test_anon and folio_test_trans_huge instead.
-changed thp_number_utilized_pages to use memchr_inv

v1 to v2
-reversed ordering of is_transparent_hugepage and PageAnon in is_anon_transparent_hugepage, page->mapping is only meaningful for user pages

RFC to v1
-Refactored out the code to obtain the thp_utilization_bucket, as that now has to be used in multiple places.

 Documentation/admin-guide/mm/transhuge.rst |   9 +
 mm/Makefile                                |   2 +-
 mm/thp_utilization.c                       | 203 +++++++++++++++++++++
 3 files changed, 213 insertions(+), 1 deletion(-)
 create mode 100644 mm/thp_utilization.c

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 8ee78ec232eb..21d86303c97e 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -304,6 +304,15 @@ To identify what applications are mapping file transparent huge pages, it
 is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
 for each mapping.
 
+The utilization of transparent hugepages can be viewed by reading
+``/sys/kernel/debug/thp_utilization``. The utilization of a THP is defined
+as the ratio of non zero filled 4kb pages to the total number of pages in a
+THP. The buckets are labelled by the range of total utilized 4kb pages with
+one line per utilization bucket. Each line contains the total number of
+THPs in that bucket and the total number of zero filled 4kb pages summed
+over all THPs in that bucket. The last two lines show the timestamp and
+duration respectively of the most recent scan over all of physical memory.
+
 Note that reading the smaps file is expensive and reading it
 frequently will incur overhead.
 
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..5f76dc6ce044 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -95,7 +95,7 @@ obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
-obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o thp_utilization.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
diff --git a/mm/thp_utilization.c b/mm/thp_utilization.c
new file mode 100644
index 000000000000..7b79f8759d12
--- /dev/null
+++ b/mm/thp_utilization.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Copyright (C) 2022  Meta, Inc.
+ *  Authors: Alexander Zhu, Johannes Weiner, Rik van Riel
+ */
+
+#include <linux/mm.h>
+#include <linux/debugfs.h>
+#include <linux/highmem.h>
+/*
+ * The number of utilization buckets THPs will be grouped in
+ * under /sys/kernel/debug/thp_utilization.
+ */
+#define THP_UTIL_BUCKET_NR 10
+/*
+ * The number of hugepages to scan through on each periodic
+ * run of the scanner that generates /sys/kernel/debug/thp_utilization.
+ */
+#define THP_UTIL_SCAN_SIZE 256
+
+static void thp_utilization_workfn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
+
+struct thp_scan_info_bucket {
+	int nr_thps;
+	int nr_zero_pages;
+};
+
+struct thp_scan_info {
+	struct thp_scan_info_bucket buckets[THP_UTIL_BUCKET_NR];
+	struct zone *scan_zone;
+	struct timespec64 last_scan_duration;
+	struct timespec64 last_scan_time;
+	unsigned long pfn;
+};
+
+/*
+ * thp_scan_debugfs is referred to when /sys/kernel/debug/thp_utilization
+ * is opened. thp_scan is used to keep track fo the current scan through
+ * physical memory.
+ */
+static struct thp_scan_info thp_scan_debugfs;
+static struct thp_scan_info thp_scan;
+
+#ifdef CONFIG_DEBUG_FS
+static int thp_utilization_show(struct seq_file *seqf, void *pos)
+{
+	int i;
+	int start;
+	int end;
+
+	for (i = 0; i < THP_UTIL_BUCKET_NR; i++) {
+		start = i * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR;
+		end = (i + 1 == THP_UTIL_BUCKET_NR)
+			   ? HPAGE_PMD_NR
+			   : ((i + 1) * HPAGE_PMD_NR / THP_UTIL_BUCKET_NR - 1);
+		/* The last bucket will need to contain 100 */
+		seq_printf(seqf, "Utilized[%d-%d]: %d %d\n", start, end,
+			   thp_scan_debugfs.buckets[i].nr_thps,
+			   thp_scan_debugfs.buckets[i].nr_zero_pages);
+	}
+
+	seq_printf(seqf, "Last Scan Time: %lu.%02lus\n",
+		   (unsigned long)thp_scan_debugfs.last_scan_time.tv_sec,
+		   (thp_scan_debugfs.last_scan_time.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	seq_printf(seqf, "Last Scan Duration: %lu.%02lus\n",
+		   (unsigned long)thp_scan_debugfs.last_scan_duration.tv_sec,
+		   (thp_scan_debugfs.last_scan_duration.tv_nsec / (NSEC_PER_SEC / 100)));
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(thp_utilization);
+
+static int __init thp_utilization_debugfs(void)
+{
+	debugfs_create_file("thp_utilization", 0200, NULL, NULL,
+			    &thp_utilization_fops);
+	return 0;
+}
+late_initcall(thp_utilization_debugfs);
+#endif
+
+static int thp_utilization_bucket(int num_utilized_pages)
+{
+	int bucket;
+
+	if (num_utilized_pages < 0 || num_utilized_pages > HPAGE_PMD_NR)
+		return -1;
+
+	/* Group THPs into utilization buckets */
+	bucket = num_utilized_pages * THP_UTIL_BUCKET_NR / HPAGE_PMD_NR;
+	return min(bucket, THP_UTIL_BUCKET_NR - 1);
+}
+
+static int thp_number_utilized_pages(struct folio *folio)
+{
+	int thp_nr_utilized_pages = HPAGE_PMD_NR;
+	void *kaddr;
+	int i;
+
+	if (!folio || !folio_test_anon(folio) || !folio_test_transhuge(folio))
+		return -1;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		kaddr = kmap_local_folio(folio, i);
+		if (memchr_inv(kaddr, 0, PAGE_SIZE))
+			thp_nr_utilized_pages--;
+
+		kunmap_local(kaddr);
+	}
+
+	return thp_nr_utilized_pages;
+}
+
+static void thp_scan_next_zone(void)
+{
+	struct timespec64 current_time;
+	bool update_debugfs;
+	/*
+	 * THP utilization worker thread has reached the end
+	 * of the memory zone. Proceed to the next zone.
+	 */
+	thp_scan.scan_zone = next_zone(thp_scan.scan_zone);
+	update_debugfs = !thp_scan.scan_zone;
+	thp_scan.scan_zone = update_debugfs ? (first_online_pgdat())->node_zones
+			: thp_scan.scan_zone;
+	thp_scan.pfn = (thp_scan.scan_zone->zone_start_pfn + HPAGE_PMD_NR - 1)
+			& ~(HPAGE_PMD_SIZE - 1);
+	if (!update_debugfs)
+		return;
+
+	/*
+	 * If the worker has scanned through all of physical memory then
+	 * update information displayed in /sys/kernel/debug/thp_utilization
+	 */
+	ktime_get_ts64(&current_time);
+	thp_scan_debugfs.last_scan_duration = timespec64_sub(current_time,
+							     thp_scan_debugfs.last_scan_time);
+	thp_scan_debugfs.last_scan_time = current_time;
+
+	memcpy(&thp_scan_debugfs.buckets, &thp_scan.buckets, sizeof(thp_scan.buckets));
+	memset(&thp_scan.buckets, 0, sizeof(thp_scan.buckets));
+}
+
+static void thp_util_scan(unsigned long pfn_end)
+{
+	struct page *page = NULL;
+	int bucket, current_pfn, num_utilized_pages;
+	int i;
+	/*
+	 * Scan through each memory zone in chunks of THP_UTIL_SCAN_SIZE
+	 * PFNs every second looking for anonymous THPs.
+	 */
+	for (i = 0; i < THP_UTIL_SCAN_SIZE; i++) {
+		current_pfn = thp_scan.pfn;
+		thp_scan.pfn += HPAGE_PMD_NR;
+		if (current_pfn >= pfn_end)
+			return;
+
+		page = pfn_to_online_page(current_pfn);
+		if (!page)
+			continue;
+
+		num_utilized_pages = thp_number_utilized_pages(page_folio(page));
+		bucket = thp_utilization_bucket(num_utilized_pages);
+		if (bucket < 0)
+			continue;
+
+		thp_scan.buckets[bucket].nr_thps++;
+		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
+	}
+}
+
+static void thp_utilization_workfn(struct work_struct *work)
+{
+	unsigned long pfn_end;
+	/*
+	 * Worker function that scans through all of physical memory
+	 * for anonymous THPs.
+	 */
+	if (!thp_scan.scan_zone)
+		thp_scan.scan_zone = (first_online_pgdat())->node_zones;
+
+	pfn_end = zone_end_pfn(thp_scan.scan_zone);
+	/* If we have reached the end of the zone or end of physical memory
+	 * move on to the next zone. Otherwise, scan the next PFNs in the
+	 * current zone.
+	 */
+	if (!managed_zone(thp_scan.scan_zone) || thp_scan.pfn >= pfn_end)
+		thp_scan_next_zone();
+	else
+		thp_util_scan(pfn_end);
+
+	schedule_delayed_work(&thp_utilization_work, HZ);
+}
+
+static int __init thp_scan_init(void)
+{
+	schedule_delayed_work(&thp_utilization_work, HZ);
+	return 0;
+}
+subsys_initcall(thp_scan_init);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 2/5] mm: changes to split_huge_page() to free zero filled tail pages
  2022-10-26  0:26 [PATCH v5 0/5] THP Shrinker alexlzhu
  2022-10-26  0:26 ` [PATCH v5 1/5] mm: add thp_utilization metrics to debugfs alexlzhu
@ 2022-10-26  0:26 ` alexlzhu
  2022-10-26  0:26 ` [PATCH v5 3/5] mm: do not remap clean subpages when splitting isolated thp alexlzhu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: alexlzhu @ 2022-10-26  0:26 UTC (permalink / raw)
  To: linux-mm, kernel-team
  Cc: riel, hannes, willy, ningzhang, yuzhao, Alexander Zhu

From: Alexander Zhu <alexlzhu@fb.com>

Currently, when /sys/kernel/mm/transparent_hugepage/enabled=always is set
there are a large number of transparent hugepages that are almost entirely
zero filled.  This is mentioned in a number of previous patchsets
including:
https://lore.kernel.org/all/20210731063938.1391602-1-yuzhao@google.com/
https://lore.kernel.org/all/
1635422215-99394-1-git-send-email-ningzhang@linux.alibaba.com/

Currently, split_huge_page() does not have a way to identify zero filled
pages within the THP. Thus these zero pages get remapped and continue to
create memory waste. In this patch, we identify and free tail pages that
are zero filled in split_huge_page(). In this way, we avoid mapping these
pages back into page table entries and can free up unused memory within
THPs. This is based off the previously mentioned patchset by Yu Zhao.
However, we chose to free anonymous zero tail pages whenever they are
encountered instead of only on reclaim or migration.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v4 to v5
-split out split_huge_page changes into three different patches. One for zapping zero pages, one for not remapping zero pages, and one for self tests.

 include/linux/vm_event_item.h |  3 +++
 mm/huge_memory.c              | 37 +++++++++++++++++++++++++++++++++++
 mm/vmstat.c                   |  3 +++
 3 files changed, 43 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3518dba1e02f..3618b10ddec9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -111,6 +111,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 		THP_SPLIT_PUD,
 #endif
+		THP_SPLIT_FREE,
+		THP_SPLIT_UNMAP,
+		THP_SPLIT_REMAP_READONLY_ZERO_PAGE,
 		THP_ZERO_PAGE_ALLOC,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cc4a5f4791e..363218e8a22a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2496,6 +2496,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	unsigned int nr = thp_nr_pages(head);
+	LIST_HEAD(pages_to_free);
+	int nr_pages_to_free = 0;
 	int i;
 
 	/* complete memcg works before add pages to LRU */
@@ -2572,6 +2574,34 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 			continue;
 		unlock_page(subpage);
 
+		/*
+		 * If a tail page has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * tail page was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && page_ref_freeze(subpage, 2)) {
+			VM_BUG_ON_PAGE(PageLRU(subpage), subpage);
+			VM_BUG_ON_PAGE(PageCompound(subpage), subpage);
+			VM_BUG_ON_PAGE(page_mapped(subpage), subpage);
+
+			ClearPageActive(subpage);
+			ClearPageUnevictable(subpage);
+			list_move(&subpage->lru, &pages_to_free);
+			nr_pages_to_free++;
+			continue;
+		}
+
+		/*
+		 * If a tail page has only one reference left, it will be freed
+		 * by the call to free_page_and_swap_cache below. Since zero
+		 * subpages are no longer remapped, there will only be one
+		 * reference left in cases outside of reclaim or migration.
+		 */
+		if (page_ref_count(subpage) == 1)
+			nr_pages_to_free++;
+
 		/*
 		 * Subpages may be freed if there wasn't any mapping
 		 * like if add_to_swap() is running on a lru page that
@@ -2581,6 +2611,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (!nr_pages_to_free)
+		return;
+
+	mem_cgroup_uncharge_list(&pages_to_free);
+	free_unref_page_list(&pages_to_free);
+	count_vm_events(THP_SPLIT_FREE, nr_pages_to_free);
 }
 
 /* Racy check whether the huge page can be split */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b2371d745e00..3d802eb6754d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1359,6 +1359,9 @@ const char * const vmstat_text[] = {
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	"thp_split_pud",
 #endif
+	"thp_split_free",
+	"thp_split_unmap",
+	"thp_split_remap_readonly_zero_page",
 	"thp_zero_page_alloc",
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 3/5] mm: do not remap clean subpages when splitting isolated thp
  2022-10-26  0:26 [PATCH v5 0/5] THP Shrinker alexlzhu
  2022-10-26  0:26 ` [PATCH v5 1/5] mm: add thp_utilization metrics to debugfs alexlzhu
  2022-10-26  0:26 ` [PATCH v5 2/5] mm: changes to split_huge_page() to free zero filled tail pages alexlzhu
@ 2022-10-26  0:26 ` alexlzhu
  2022-10-26  0:26 ` [PATCH v5 4/5] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages alexlzhu
  2022-10-26  0:26 ` [PATCH v5 5/5] mm: THP low utilization shrinker alexlzhu
  4 siblings, 0 replies; 6+ messages in thread
From: alexlzhu @ 2022-10-26  0:26 UTC (permalink / raw)
  To: linux-mm, kernel-team
  Cc: riel, hannes, willy, ningzhang, yuzhao, Alexander Zhu

From: Alexander Zhu <alexlzhu@fb.com>

Changes to avoid remap on zero pages that are free'd in split_huge_page().
Pages are not remapped except in the case of userfaultfd. In the case
of userfaultfd we remap to the shared zero page, similar to what is
done by KSM.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v1 to v2

-Only trigger the unmap_clean/zap in split_huge_page on anonymous THPs. We cannot zap zero pages for file THPs.

RFC to v1
-Added support to map to the read only zero page when splitting a THP registered with userfaultfd. 

 include/linux/rmap.h |  2 +-
 mm/huge_memory.c     |  8 ++---
 mm/migrate.c         | 73 +++++++++++++++++++++++++++++++++++++++-----
 mm/migrate_device.c  |  4 +--
 4 files changed, 73 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..3f83bbcf1333 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -428,7 +428,7 @@ int folio_mkclean(struct folio *);
 int pfn_mkclean_range(unsigned long pfn, unsigned long nr_pages, pgoff_t pgoff,
 		      struct vm_area_struct *vma);
 
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked);
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_clean);
 
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 363218e8a22a..f68a353e0adf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2373,7 +2373,7 @@ static void unmap_folio(struct folio *folio)
 		try_to_unmap(folio, ttu_flags | TTU_IGNORE_MLOCK);
 }
 
-static void remap_page(struct folio *folio, unsigned long nr)
+static void remap_page(struct folio *folio, unsigned long nr, bool unmap_clean)
 {
 	int i = 0;
 
@@ -2381,7 +2381,7 @@ static void remap_page(struct folio *folio, unsigned long nr)
 	if (!folio_test_anon(folio))
 		return;
 	for (;;) {
-		remove_migration_ptes(folio, folio, true);
+		remove_migration_ptes(folio, folio, true, unmap_clean);
 		i += folio_nr_pages(folio);
 		if (i >= nr)
 			break;
@@ -2560,7 +2560,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 	local_irq_enable();
 
-	remap_page(folio, nr);
+	remap_page(folio, nr, PageAnon(head));
 
 	if (PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2789,7 +2789,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio));
+		remap_page(folio, folio_nr_pages(folio), false);
 		ret = -EBUSY;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 1379e1912772..bc96a084d925 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -30,6 +30,7 @@
 #include <linux/writeback.h>
 #include <linux/mempolicy.h>
 #include <linux/vmalloc.h>
+#include <linux/vm_event_item.h>
 #include <linux/security.h>
 #include <linux/backing-dev.h>
 #include <linux/compaction.h>
@@ -168,13 +169,62 @@ void putback_movable_pages(struct list_head *l)
 	}
 }
 
+static bool try_to_unmap_clean(struct page_vma_mapped_walk *pvmw, struct page *page)
+{
+	void *addr;
+	bool dirty;
+	pte_t newpte;
+
+	VM_BUG_ON_PAGE(PageCompound(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(pte_present(*pvmw->pte), page);
+
+	if (PageMlocked(page) || (pvmw->vma->vm_flags & VM_LOCKED))
+		return false;
+
+	/*
+	 * The pmd entry mapping the old thp was flushed and the pte mapping
+	 * this subpage has been non present. Therefore, this subpage is
+	 * inaccessible. We don't need to remap it if it contains only zeros.
+	 */
+	addr = kmap_local_page(page);
+	dirty = memchr_inv(addr, 0, PAGE_SIZE);
+	kunmap_local(addr);
+
+	if (dirty)
+		return false;
+
+	pte_clear_not_present_full(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, false);
+
+	if (userfaultfd_armed(pvmw->vma)) {
+		newpte = pte_mkspecial(pfn_pte(page_to_pfn(ZERO_PAGE(pvmw->address)),
+					       pvmw->vma->vm_page_prot));
+		ptep_clear_flush(pvmw->vma, pvmw->address, pvmw->pte);
+		set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte);
+		dec_mm_counter(pvmw->vma->vm_mm, MM_ANONPAGES);
+		count_vm_event(THP_SPLIT_REMAP_READONLY_ZERO_PAGE);
+		return true;
+	}
+
+	dec_mm_counter(pvmw->vma->vm_mm, mm_counter(page));
+	count_vm_event(THP_SPLIT_UNMAP);
+	return true;
+}
+
+struct rmap_walk_arg {
+	struct folio *folio;
+	bool unmap_clean;
+};
+
 /*
  * Restore a potential migration pte to a working pte entry
  */
 static bool remove_migration_pte(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long addr, void *old)
+		struct vm_area_struct *vma, unsigned long addr, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	struct rmap_walk_arg *rmap_walk_arg = arg;
+	DEFINE_FOLIO_VMA_WALK(pvmw, rmap_walk_arg->folio, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
@@ -197,6 +247,8 @@ static bool remove_migration_pte(struct folio *folio,
 			continue;
 		}
 #endif
+		if (rmap_walk_arg->unmap_clean && try_to_unmap_clean(&pvmw, new))
+			continue;
 
 		folio_get(folio);
 		pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
@@ -272,13 +324,20 @@ static bool remove_migration_pte(struct folio *folio,
  * Get rid of all migration entries and replace them by
  * references to the indicated page.
  */
-void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
+void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked, bool unmap_clean)
 {
+	struct rmap_walk_arg rmap_walk_arg = {
+		.folio = src,
+		.unmap_clean = unmap_clean,
+	};
+
 	struct rmap_walk_control rwc = {
 		.rmap_one = remove_migration_pte,
-		.arg = src,
+		.arg = &rmap_walk_arg,
 	};
 
+	VM_BUG_ON_FOLIO(unmap_clean && src != dst, src);
+
 	if (locked)
 		rmap_walk_locked(dst, &rwc);
 	else
@@ -872,7 +931,7 @@ static int writeout(struct address_space *mapping, struct folio *folio)
 	 * At this point we know that the migration attempt cannot
 	 * be successful.
 	 */
-	remove_migration_ptes(folio, folio, false);
+	remove_migration_ptes(folio, folio, false, false);
 
 	rc = mapping->a_ops->writepage(&folio->page, &wbc);
 
@@ -1128,7 +1187,7 @@ static int __unmap_and_move(struct folio *src, struct folio *dst,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 out_unlock_both:
 	folio_unlock(dst);
@@ -1338,7 +1397,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 
 	if (page_was_mapped)
 		remove_migration_ptes(src,
-			rc == MIGRATEPAGE_SUCCESS ? dst : src, false);
+			rc == MIGRATEPAGE_SUCCESS ? dst : src, false, false);
 
 unlock_put_anon:
 	folio_unlock(dst);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6fa682eef7a0..6508a083d7fd 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -421,7 +421,7 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns,
 			continue;
 
 		folio = page_folio(page);
-		remove_migration_ptes(folio, folio, false);
+		remove_migration_ptes(folio, folio, false, false);
 
 		src_pfns[i] = 0;
 		folio_unlock(folio);
@@ -847,7 +847,7 @@ void migrate_device_finalize(unsigned long *src_pfns,
 
 		src = page_folio(page);
 		dst = page_folio(newpage);
-		remove_migration_ptes(src, dst, false);
+		remove_migration_ptes(src, dst, false, false);
 		folio_unlock(src);
 
 		if (is_zone_device_page(page))
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 4/5] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages
  2022-10-26  0:26 [PATCH v5 0/5] THP Shrinker alexlzhu
                   ` (2 preceding siblings ...)
  2022-10-26  0:26 ` [PATCH v5 3/5] mm: do not remap clean subpages when splitting isolated thp alexlzhu
@ 2022-10-26  0:26 ` alexlzhu
  2022-10-26  0:26 ` [PATCH v5 5/5] mm: THP low utilization shrinker alexlzhu
  4 siblings, 0 replies; 6+ messages in thread
From: alexlzhu @ 2022-10-26  0:26 UTC (permalink / raw)
  To: linux-mm, kernel-team
  Cc: riel, hannes, willy, ningzhang, yuzhao, Alexander Zhu

From: Alexander Zhu <alexlzhu@fb.com>

Self tests to verify the RssAnon value to make sure zero
pages are not remapped except in the case of userfaultfd.
Also includes a self test for the userfaultfd use case.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v1 to v2
-Modified split_huge_page self test based off more recent changes. 

RFC to v1
Also added a self test to verify that userfaultfd change is working.

 .../selftests/vm/split_huge_page_test.c       | 115 +++++++++++++++++-
 tools/testing/selftests/vm/vm_util.c          |  23 ++++
 tools/testing/selftests/vm/vm_util.h          |   3 +
 3 files changed, 140 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index 76e1c36dd9e5..42f0e79a4508 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,9 @@
 #include <sys/mount.h>
 #include <malloc.h>
 #include <stdbool.h>
+#include <sys/syscall.h> /* Definition of SYS_* constants */
+#include <linux/userfaultfd.h>
+#include <sys/ioctl.h>
 #include "vm_util.h"
 
 uint64_t pagesize;
@@ -88,6 +91,115 @@ static void write_debugfs(const char *fmt, ...)
 	}
 }
 
+static char *allocate_zero_filled_hugepage(size_t len)
+{
+	char *result;
+	size_t i;
+
+	result = memalign(pmd_pagesize, len);
+	if (!result) {
+		printf("Fail to allocate memory\n");
+		exit(EXIT_FAILURE);
+	}
+
+	madvise(result, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		result[i] = (char)0;
+
+	return result;
+}
+
+static void verify_rss_anon_split_huge_page_all_zeroes(char *one_page, int nr_hpages, size_t len)
+{
+	uint64_t rss_anon_before, rss_anon_after;
+	size_t i;
+
+	if (!check_huge_anon(one_page, 4, pmd_pagesize)) {
+		printf("No THP is allocated\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_before = rss_anon();
+	if (!rss_anon_before) {
+		printf("No RssAnon is allocated before split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all THPs */
+	write_debugfs(PID_FMT, getpid(), (uint64_t)one_page,
+		      (uint64_t)one_page + len);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)0) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+	if (!check_huge_anon(one_page, 0, pmd_pagesize)) {
+		printf("Still AnonHugePages not split\n");
+		exit(EXIT_FAILURE);
+	}
+
+	rss_anon_after = rss_anon();
+	if (rss_anon_after >= rss_anon_before) {
+		printf("Incorrect RssAnon value. Before: %ld After: %ld\n",
+		       rss_anon_before, rss_anon_after);
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_pmd_zero_pages(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+
+	one_page = allocate_zero_filled_hugepage(len);
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages successful\n");
+	free(one_page);
+}
+
+void split_pmd_zero_pages_uffd(void)
+{
+	char *one_page;
+	int nr_hpages = 4;
+	size_t len = nr_hpages * pmd_pagesize;
+	long uffd; /* userfaultfd file descriptor */
+	struct uffdio_api uffdio_api;
+	struct uffdio_register uffdio_register;
+
+	/* Create and enable userfaultfd object. */
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd == -1) {
+		perror("userfaultfd");
+		exit(1);
+	}
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
+		perror("ioctl-UFFDIO_API");
+		exit(1);
+	}
+
+	one_page = allocate_zero_filled_hugepage(len);
+
+	uffdio_register.range.start = (unsigned long)one_page;
+	uffdio_register.range.len = len;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
+		perror("ioctl-UFFDIO_REGISTER");
+		exit(1);
+	}
+
+	verify_rss_anon_split_huge_page_all_zeroes(one_page, nr_hpages, len);
+	printf("Split zero filled huge pages with uffd successful\n");
+	free(one_page);
+}
+
 void split_pmd_thp(void)
 {
 	char *one_page;
@@ -121,7 +233,6 @@ void split_pmd_thp(void)
 			exit(EXIT_FAILURE);
 		}
 
-
 	if (check_huge_anon(one_page, 0, pmd_pagesize)) {
 		printf("Still AnonHugePages not split\n");
 		exit(EXIT_FAILURE);
@@ -301,6 +412,8 @@ int main(int argc, char **argv)
 	pageshift = ffs(pagesize) - 1;
 	pmd_pagesize = read_pmd_pagesize();
 
+	split_pmd_zero_pages();
+	split_pmd_zero_pages_uffd();
 	split_pmd_thp();
 	split_pte_mapped_thp();
 	split_file_backed_thp();
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index f11f8adda521..72f3edc64aaf 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -6,6 +6,7 @@
 
 #define PMD_SIZE_FILE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SMAP_FILE_PATH "/proc/self/smaps"
+#define STATUS_FILE_PATH "/proc/self/status"
 #define MAX_LINE_LENGTH 500
 
 uint64_t pagemap_get_entry(int fd, char *start)
@@ -72,6 +73,28 @@ uint64_t read_pmd_pagesize(void)
 	return strtoul(buf, NULL, 10);
 }
 
+uint64_t rss_anon(void)
+{
+	uint64_t rss_anon = 0;
+	int ret;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+
+	fp = fopen(STATUS_FILE_PATH, "r");
+	if (!fp)
+		ksft_exit_fail_msg("%s: Failed to open file %s\n", __func__, STATUS_FILE_PATH);
+
+	if (!check_for_pattern(fp, "RssAnon:", buffer, sizeof(buffer)))
+		goto err_out;
+
+	if (sscanf(buffer, "RssAnon:%10ld kB", &rss_anon) != 1)
+		ksft_exit_fail_msg("Reading status error\n");
+
+err_out:
+	fclose(fp);
+	return rss_anon;
+}
+
 bool __check_huge(void *addr, char *pattern, int nr_hpages,
 		  uint64_t hpage_size)
 {
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 5c35de454e08..dd1885f66097 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -1,12 +1,15 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <stdint.h>
 #include <stdbool.h>
+#include <stddef.h>
+#include <stdio.h>
 
 uint64_t pagemap_get_entry(int fd, char *start);
 bool pagemap_is_softdirty(int fd, char *start);
 void clear_softdirty(void);
 bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
 uint64_t read_pmd_pagesize(void);
+uint64_t rss_anon(void);
 bool check_huge_anon(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_file(void *addr, int nr_hpages, uint64_t hpage_size);
 bool check_huge_shmem(void *addr, int nr_hpages, uint64_t hpage_size);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v5 5/5] mm: THP low utilization shrinker
  2022-10-26  0:26 [PATCH v5 0/5] THP Shrinker alexlzhu
                   ` (3 preceding siblings ...)
  2022-10-26  0:26 ` [PATCH v5 4/5] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages alexlzhu
@ 2022-10-26  0:26 ` alexlzhu
  4 siblings, 0 replies; 6+ messages in thread
From: alexlzhu @ 2022-10-26  0:26 UTC (permalink / raw)
  To: linux-mm, kernel-team
  Cc: riel, hannes, willy, ningzhang, yuzhao, Alexander Zhu

From: Alexander Zhu <alexlzhu@fb.com>

This patch introduces a shrinker that will remove THPs in the lowest
utilization bucket. As previously mentioned, we have observed that
almost all of the memory waste when THPs are always enabled
is contained in the lowest utilization bucket. The shrinker will
add these THPs to a list_lru and split anonymous THPs based off
information from kswapd. It requires the changes from
thp_utilization to identify the least utilized THPs, and the
changes to split_huge_page to identify and free zero pages
within THPs.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v4 to v5
-fixed bug with lru_to_folio, was corrupting the folio 
-fixed bug with memchr_inv in mm/thp_utilization. zero page should mean !memchr_inv(kaddr, 0, PAGE_SIZE)  


v3 to v4

-added some comments regardling trylock
-change the relock to be unconditional in low_util_free_page
-only expose can_shrink_thp, abstract the thp_utilization and bucket logic to be private to mm/thp_utilization.c

v2 to v3
-put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
-removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
-moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
-moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
-removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 

v1 to v2
-Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
-Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
-Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
-Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 

RFC to v1
-Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
-Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
-Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 

 include/linux/huge_mm.h  |   9 ++++
 include/linux/list_lru.h |  24 +++++++++
 include/linux/mm_types.h |   5 ++
 mm/huge_memory.c         | 110 ++++++++++++++++++++++++++++++++++++++-
 mm/list_lru.c            |  49 +++++++++++++++++
 mm/page_alloc.c          |   6 +++
 mm/thp_utilization.c     |  21 +++++++-
 7 files changed, 221 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a1341fdcf666..1745c94eb103 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -178,6 +178,8 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
+bool can_shrink_thp(struct folio *folio);
+
 void prep_transhuge_page(struct page *page);
 void free_transhuge_page(struct page *page);
 
@@ -189,6 +191,8 @@ static inline int split_huge_page(struct page *page)
 }
 void deferred_split_huge_page(struct page *page);
 
+void add_underutilized_thp(struct page *page);
+
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze, struct folio *folio);
 
@@ -302,6 +306,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
 	return &page[2].deferred_list;
 }
 
+static inline struct list_head *page_underutilized_thp_list(struct page *page)
+{
+	return &page[3].underutilized_thp_list;
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b35968ee9fb5..c2cf146ea880 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
  */
 bool list_lru_add(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_add_page: add an element to the lru list's tail
+ * @list_lru: the lru pointer
+ * @page: the page containing the item
+ * @item: the item to be deleted.
+ *
+ * This function works the same as list_lru_add in terms of list
+ * manipulation. Used for non slab objects contained in the page.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
 /**
  * list_lru_del: delete an element to the lru list
  * @list_lru: the lru pointer
@@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
  */
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
+/**
+ * list_lru_del_page: delete an element to the lru list
+ * @list_lru: the lru pointer
+ * @page: the page containing the item
+ * @item: the item to be deleted.
+ *
+ * This function works the same as list_lru_del in terms of list
+ * manipulation. Used for non slab objects contained in the page.
+ *
+ * Return value: true if the list was updated, false otherwise
+ */
+bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
 /**
  * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..da1d1cf42158 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,6 +152,11 @@ struct page {
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
+		struct { /* Third tail page of compound page */
+			unsigned long _compound_pad_3; /* compound_head */
+			unsigned long _compound_pad_4;
+			struct list_head underutilized_thp_list;
+		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f68a353e0adf..4e6d3e88a4bf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -71,6 +71,8 @@ static atomic_t huge_zero_refcount;
 struct page *huge_zero_page __read_mostly;
 unsigned long huge_zero_pfn __read_mostly = ~0UL;
 
+static struct list_lru huge_low_util_page_lru;
+
 bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
 			bool smaps, bool in_pf, bool enforce_sysfs)
 {
@@ -234,6 +236,53 @@ static struct shrinker huge_zero_page_shrinker = {
 	.seeks = DEFAULT_SEEKS,
 };
 
+static enum lru_status low_util_free_page(struct list_head *item,
+					  struct list_lru_one *lru,
+					  spinlock_t *lru_lock,
+					  void *cb_arg)
+{
+	struct folio *folio = page_folio(list_entry(item, struct page, underutilized_thp_list));
+	struct page *head = &folio->page;
+
+	if (get_page_unless_zero(head)) {
+		/* Inverse lock order from add_underutilized_thp() */
+		if (!trylock_page(head)) {
+			put_page(head);
+			return LRU_SKIP;
+		}
+		list_lru_isolate(lru, item);
+		spin_unlock_irq(lru_lock);
+		if (can_shrink_thp(folio))
+			split_huge_page(head);
+		spin_lock_irq(lru_lock);
+		unlock_page(head);
+		put_page(head);
+	}
+
+	return LRU_REMOVED_RETRY;
+}
+
+static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
+						     struct shrink_control *sc)
+{
+	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
+}
+
+static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
+						    struct shrink_control *sc)
+{
+	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
+							sc, low_util_free_page, NULL);
+}
+
+static struct shrinker huge_low_util_page_shrinker = {
+	.count_objects = shrink_huge_low_util_page_count,
+	.scan_objects = shrink_huge_low_util_page_scan,
+	.seeks = DEFAULT_SEEKS,
+	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
+		SHRINKER_NONSLAB,
+};
+
 #ifdef CONFIG_SYSFS
 static ssize_t enabled_show(struct kobject *kobj,
 			    struct kobj_attribute *attr, char *buf)
@@ -485,6 +534,9 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_slab;
 
+	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
+	if (err)
+		goto err_low_util_shrinker;
 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
 	if (err)
 		goto err_hzp_shrinker;
@@ -492,6 +544,9 @@ static int __init hugepage_init(void)
 	if (err)
 		goto err_split_shrinker;
 
+	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
+	if (err)
+		goto err_low_util_list_lru;
 	/*
 	 * By default disable transparent hugepages on smaller systems,
 	 * where the extra memory used could hurt more than TLB overhead
@@ -508,10 +563,14 @@ static int __init hugepage_init(void)
 
 	return 0;
 err_khugepaged:
+	list_lru_destroy(&huge_low_util_page_lru);
+err_low_util_list_lru:
 	unregister_shrinker(&deferred_split_shrinker);
 err_split_shrinker:
 	unregister_shrinker(&huge_zero_page_shrinker);
 err_hzp_shrinker:
+	unregister_shrinker(&huge_low_util_page_shrinker);
+err_low_util_shrinker:
 	khugepaged_destroy();
 err_slab:
 	hugepage_exit_sysfs(hugepage_kobj);
@@ -586,6 +645,7 @@ void prep_transhuge_page(struct page *page)
 	 */
 
 	INIT_LIST_HEAD(page_deferred_list(page));
+	INIT_LIST_HEAD(page_underutilized_thp_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
@@ -2451,8 +2511,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 			 LRU_GEN_MASK | LRU_REFS_MASK));
 
 	/* ->mapping in first tail page is compound_mapcount */
-	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
-			page_tail);
+	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
 	page_tail->private = 0;
@@ -2660,6 +2719,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	struct folio *folio = page_folio(page);
 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
+	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int extra_pins, ret;
@@ -2767,6 +2827,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			list_del(page_deferred_list(&folio->page));
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
+		/* Frozen refs lock out additions, test can be lockless */
+		if (!list_empty(underutilized_thp_list))
+			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
+					  underutilized_thp_list);
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -2809,6 +2873,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 void free_transhuge_page(struct page *page)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
+	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
@@ -2817,6 +2882,13 @@ void free_transhuge_page(struct page *page)
 		list_del(page_deferred_list(page));
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	/* A dead page cannot be re-added to the THP shrinker, test can be lockless */
+	if (!list_empty(underutilized_thp_list))
+		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
+
+	if (PageLRU(page))
+		__folio_clear_lru_flags(page_folio(page));
+
 	free_compound_page(page);
 }
 
@@ -2857,6 +2929,40 @@ void deferred_split_huge_page(struct page *page)
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 }
 
+void add_underutilized_thp(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+	if (PageSwapCache(page))
+		return;
+
+	/*
+	 * Need to take a reference on the page to prevent the page from getting free'd from
+	 * under us while we are adding the THP to the shrinker.
+	 */
+	if (!get_page_unless_zero(page))
+		return;
+
+	if (is_huge_zero_page(page))
+		goto out_put;
+
+	/* Stabilize page->memcg to allocate and add to the same list */
+	lock_page(page);
+
+#ifdef CONFIG_MEMCG_KMEM
+	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
+		goto out_unlock;
+#endif
+
+	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
+
+out_unlock:
+	unlock_page(page);
+out_put:
+	put_page(page);
+}
+
 static unsigned long deferred_split_count(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index a05e5bef3b40..8cc56a84b554 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_add);
 
+bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
+{
+	int nid = page_to_nid(page);
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nlru->lock, flags);
+	if (list_empty(item)) {
+		memcg = page_memcg(page);
+		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		list_add_tail(item, &l->list);
+		/* Set shrinker bit if the first element was added */
+		if (!l->nr_items++)
+			set_shrinker_bit(memcg, nid,
+					 lru_shrinker_id(lru));
+		nlru->nr_items++;
+		spin_unlock_irqrestore(&nlru->lock, flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&nlru->lock, flags);
+	return false;
+}
+EXPORT_SYMBOL_GPL(list_lru_add_page);
+
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
@@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
+bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
+{
+	int nid = page_to_nid(page);
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	spin_lock_irqsave(&nlru->lock, flags);
+	if (!list_empty(item)) {
+		memcg = page_memcg(page);
+		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
+		list_del_init(item);
+		l->nr_items--;
+		nlru->nr_items--;
+		spin_unlock_irqrestore(&nlru->lock, flags);
+		return true;
+	}
+	spin_unlock_irqrestore(&nlru->lock, flags);
+	return false;
+}
+EXPORT_SYMBOL_GPL(list_lru_del_page);
+
 void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
 {
 	list_del_init(item);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e20ade858e71..31380526a9f4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
 		 * deferred_list.next -- ignore value.
 		 */
 		break;
+	case 3:
+		/*
+		 * the third tail page: ->mapping is
+		 * underutilized_thp_list.next -- ignore value.
+		 */
+		break;
 	default:
 		if (page->mapping != TAIL_MAPPING) {
 			bad_page(page, "corrupted mapping in tail page");
diff --git a/mm/thp_utilization.c b/mm/thp_utilization.c
index 7b79f8759d12..0cb18f122c57 100644
--- a/mm/thp_utilization.c
+++ b/mm/thp_utilization.c
@@ -98,13 +98,16 @@ static int thp_number_utilized_pages(struct folio *folio)
 	int thp_nr_utilized_pages = HPAGE_PMD_NR;
 	void *kaddr;
 	int i;
+	bool zero_page;
 
 	if (!folio || !folio_test_anon(folio) || !folio_test_transhuge(folio))
 		return -1;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		kaddr = kmap_local_folio(folio, i);
-		if (memchr_inv(kaddr, 0, PAGE_SIZE))
+		zero_page = !memchr_inv(kaddr, 0, PAGE_SIZE);
+
+		if (zero_page)
 			thp_nr_utilized_pages--;
 
 		kunmap_local(kaddr);
@@ -113,6 +116,19 @@ static int thp_number_utilized_pages(struct folio *folio)
 	return thp_nr_utilized_pages;
 }
 
+bool can_shrink_thp(struct folio *folio)
+{
+	int bucket, num_utilized_pages;
+
+	if (!folio || !folio_test_anon(folio) || !folio_test_transhuge(folio))
+		return false;
+
+	num_utilized_pages = thp_number_utilized_pages(folio);
+	bucket = thp_utilization_bucket(num_utilized_pages);
+
+	return bucket >= 0 && bucket < THP_UTIL_BUCKET_NR - 1;
+}
+
 static void thp_scan_next_zone(void)
 {
 	struct timespec64 current_time;
@@ -167,6 +183,9 @@ static void thp_util_scan(unsigned long pfn_end)
 		if (bucket < 0)
 			continue;
 
+		if (bucket < THP_UTIL_BUCKET_NR - 1)
+			add_underutilized_thp(page);
+
 		thp_scan.buckets[bucket].nr_thps++;
 		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
 	}
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-26  0:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-26  0:26 [PATCH v5 0/5] THP Shrinker alexlzhu
2022-10-26  0:26 ` [PATCH v5 1/5] mm: add thp_utilization metrics to debugfs alexlzhu
2022-10-26  0:26 ` [PATCH v5 2/5] mm: changes to split_huge_page() to free zero filled tail pages alexlzhu
2022-10-26  0:26 ` [PATCH v5 3/5] mm: do not remap clean subpages when splitting isolated thp alexlzhu
2022-10-26  0:26 ` [PATCH v5 4/5] mm: add selftests to split_huge_page() to verify unmap/zap of zero pages alexlzhu
2022-10-26  0:26 ` [PATCH v5 5/5] mm: THP low utilization shrinker alexlzhu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).