linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Split huge pages to any lower order pages and selftests.
@ 2020-11-19 16:05 Zi Yan
  2020-11-19 16:05 ` [PATCH 1/7] XArray: Fix splitting to non-zero orders Zi Yan
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:05 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

With Matthew's THP in pagecache patches[1], we will be able to handle any size
pagecache THPs, but currently split_huge_page can only split a THP to order-0
pages. This can easily erase the benefit of having pagecache THPs, when
operations like truncate might want to keep pages larger than order-0. In
response, here is the patches to add support for splitting a THP to any lower
order pages. In addition, this patchset prepares for my PUD THP patchset[2],
since splitting a PUD THP to multiple PMD THPs can be handled by
split_huge_page_to_list_to_order function added by this patchset, which reduces
a lot of redundant code without just replicating split_huge_page for PUD THP.

To help the tests of splitting huge pages, I added a new debugfs interface
at <debugfs>/split_huge_pages_in_range_pid, so developers can split THPs in a
given range from a process with the given pid by writing
"<pid>,<vaddr_start>,<vaddr_end>,<to_order>" to the interface. I also added a
new test program to test 1) splitting PMD THPs, 2) splitting PTE-mapped THPs,
3) splitting pagecache THPs to any lower order, 4) truncating a pagecache
THP to a page with a lower order, and 5) punching holes in a pagecache THP to
cause splitting THPs to lower order THPs.

The patchset is on top of Matthew's pagecache/next tree[3].

* Patch 1 is cherry-picked from Matthew's recent xarray fix [4] just to make sure
  Patch 3 to 7 can run without problem. I let Matthew decide how it should get
  picked up.
* Patch 2 is self-contained and can be merged if it looks OK.

Comments and/or suggestions are welcome.

ChangeLog
===
From RFC:
1. Fixed debugfs to handle splitting PTE-mapped THPs properly and added stats
   for split THPs.
2. Added a new test case for splitting PTE-mapped THPs. Each of the four PTEs
   points to a different subpage from four THPs and used kpageflags to check
   whether a PTE points to a THP or not (AnonHugePages from smap does not show
   PTE-mapped THPs).
3. mem_cgroup_split_huge_fixup() takes order instead of nr.
4. split_page_owner takes old_order and new_order instead of nr and new_order.
5. Corrected __split_page_owner declaration and fixed its implementation when
   splitting a THP to a new order.
6. Renamed left to remaining in truncate_inode_partial_page().
7. Use VM_BUG_ON instead of WARN_ONCE when splitting a THP to the unsupported
   order-0 and splitting anonymous THPs to non-zero orders.
8. Added punching holes in a file as a new pagecache THP split test case, which
   uncovered an xarray bug.


[1] https://lore.kernel.org/linux-mm/20201029193405.29125-1-willy@infradead.org/
[2] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi.yan@sent.com/
[3] https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/next
[4] https://git.infradead.org/users/willy/xarray.git


Matthew Wilcox (Oracle) (1):
  XArray: Fix splitting to non-zero orders

Zi Yan (6):
  mm: huge_memory: add new debugfs interface to trigger split huge page
    on any page range.
  mm: memcg: make memcg huge page split support any order split.
  mm: page_owner: add support for splitting to any order in split
    page_owner.
  mm: thp: split huge page to any lower order pages.
  mm: truncate: split thp to a non-zero order if possible.
  mm: huge_memory: enable debugfs to split huge pages to any order.

 include/linux/huge_mm.h                       |   8 +
 include/linux/memcontrol.h                    |   5 +-
 include/linux/page_owner.h                    |  10 +-
 lib/test_xarray.c                             |  26 +-
 lib/xarray.c                                  |   4 +-
 mm/huge_memory.c                              | 219 ++++++--
 mm/internal.h                                 |   1 +
 mm/memcontrol.c                               |   6 +-
 mm/migrate.c                                  |   2 +-
 mm/page_alloc.c                               |   2 +-
 mm/page_owner.c                               |  13 +-
 mm/swap.c                                     |   1 -
 mm/truncate.c                                 |  29 +-
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   1 +
 .../selftests/vm/split_huge_page_test.c       | 479 ++++++++++++++++++
 16 files changed, 742 insertions(+), 65 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

--
2.28.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/7] XArray: Fix splitting to non-zero orders
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
@ 2020-11-19 16:05 ` Zi Yan
  2020-11-19 16:06 ` [PATCH 2/7] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:05 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Splitting an order-4 entry into order-2 entries would leave the array
containing pointers to 000040008000c000 instead of 000044448888cccc.
This is a one-character fix, but enhance the test suite to check this
case.

Reported-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 lib/test_xarray.c | 26 ++++++++++++++------------
 lib/xarray.c      |  4 ++--
 2 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8294f43f4981..8b1c318189ce 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1530,24 +1530,24 @@ static noinline void check_store_range(struct xarray *xa)
 
 #ifdef CONFIG_XARRAY_MULTI
 static void check_split_1(struct xarray *xa, unsigned long index,
-							unsigned int order)
+				unsigned int order, unsigned int new_order)
 {
-	XA_STATE(xas, xa, index);
-	void *entry;
-	unsigned int i = 0;
+	XA_STATE_ORDER(xas, xa, index, new_order);
+	unsigned int i;
 
 	xa_store_order(xa, index, order, xa, GFP_KERNEL);
 
 	xas_split_alloc(&xas, xa, order, GFP_KERNEL);
 	xas_lock(&xas);
 	xas_split(&xas, xa, order);
+	for (i = 0; i < (1 << order); i += (1 << new_order))
+		__xa_store(xa, index + i, xa_mk_index(index + i), 0);
 	xas_unlock(&xas);
 
-	xa_for_each(xa, index, entry) {
-		XA_BUG_ON(xa, entry != xa);
-		i++;
+	for (i = 0; i < (1 << order); i++) {
+		unsigned int val = index + (i & ~((1 << new_order) - 1));
+		XA_BUG_ON(xa, xa_load(xa, index + i) != xa_mk_index(val));
 	}
-	XA_BUG_ON(xa, i != 1 << order);
 
 	xa_set_mark(xa, index, XA_MARK_0);
 	XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));
@@ -1557,14 +1557,16 @@ static void check_split_1(struct xarray *xa, unsigned long index,
 
 static noinline void check_split(struct xarray *xa)
 {
-	unsigned int order;
+	unsigned int order, new_order;
 
 	XA_BUG_ON(xa, !xa_empty(xa));
 
 	for (order = 1; order < 2 * XA_CHUNK_SHIFT; order++) {
-		check_split_1(xa, 0, order);
-		check_split_1(xa, 1UL << order, order);
-		check_split_1(xa, 3UL << order, order);
+		for (new_order = 0; new_order < order; new_order++) {
+			check_split_1(xa, 0, order, new_order);
+			check_split_1(xa, 1UL << order, order, new_order);
+			check_split_1(xa, 3UL << order, order, new_order);
+		}
 	}
 }
 #else
diff --git a/lib/xarray.c b/lib/xarray.c
index fc70e37c4c17..74915ba018c4 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1012,7 +1012,7 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 
 	do {
 		unsigned int i;
-		void *sibling;
+		void *sibling = NULL;
 		struct xa_node *node;
 
 		node = kmem_cache_alloc(radix_tree_node_cachep, gfp);
@@ -1022,7 +1022,7 @@ void xas_split_alloc(struct xa_state *xas, void *entry, unsigned int order,
 		for (i = 0; i < XA_CHUNK_SIZE; i++) {
 			if ((i & mask) == 0) {
 				RCU_INIT_POINTER(node->slots[i], entry);
-				sibling = xa_mk_sibling(0);
+				sibling = xa_mk_sibling(i);
 			} else {
 				RCU_INIT_POINTER(node->slots[i], sibling);
 			}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/7] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
  2020-11-19 16:05 ` [PATCH 1/7] XArray: Fix splitting to non-zero orders Zi Yan
@ 2020-11-19 16:06 ` Zi Yan
  2020-11-19 16:06 ` [PATCH 3/7] mm: memcg: make memcg huge page split support any order split Zi Yan
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:06 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Huge pages in the process with the given pid and virtual address range
are split. It is used to test split huge page function. In addition,
a testing program is added to tools/testing/selftests/vm to utilize the
interface by splitting PMD THPs and PTE-mapped THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c                              |  98 ++++++
 mm/internal.h                                 |   1 +
 mm/migrate.c                                  |   2 +-
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   1 +
 .../selftests/vm/split_huge_page_test.c       | 313 ++++++++++++++++++
 6 files changed, 415 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1bf51d3f2f2d..88d8b7fce5d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include <linux/mm.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/highmem.h>
@@ -2934,10 +2935,107 @@ static int split_huge_pages_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
 		"%llu\n");
 
+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppops)
+{
+	static DEFINE_MUTEX(mutex);
+	ssize_t ret;
+	char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+	int pid;
+	unsigned long vaddr_start, vaddr_end, addr;
+	nodemask_t task_nodes;
+	struct mm_struct *mm;
+	unsigned long total = 0, split = 0;
+
+	ret = mutex_lock_interruptible(&mutex);
+	if (ret)
+		return ret;
+
+	ret = -EFAULT;
+
+	memset(input_buf, 0, 80);
+	if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+		goto out;
+
+	input_buf[79] = '\0';
+	ret = sscanf(input_buf, "%d,0x%lx,0x%lx", &pid, &vaddr_start, &vaddr_end);
+	if (ret != 3) {
+		ret = -EINVAL;
+		goto out;
+	}
+	vaddr_start &= PAGE_MASK;
+	vaddr_end &= PAGE_MASK;
+
+	ret = strlen(input_buf);
+	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+		 pid, vaddr_start, vaddr_end);
+
+	mm = find_mm_struct(pid, &task_nodes);
+	if (IS_ERR(mm)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mmap_read_lock(mm);
+	/*
+	 * always increase addr by PAGE_SIZE, since we could have a PTE page
+	 * table filled with PTE-mapped THPs, each of which is distinct.
+	 */
+	for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
+		struct vm_area_struct *vma = find_vma(mm, addr);
+		unsigned int follflags;
+		struct page *page;
+
+		if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+			break;
+
+		/* FOLL_DUMP to ignore special (like zero) pages */
+		follflags = FOLL_GET | FOLL_DUMP;
+		page = follow_page(vma, addr, follflags);
+
+		if (IS_ERR(page))
+			break;
+		if (!page)
+			break;
+
+		if (!is_transparent_hugepage(page))
+			continue;
+
+		total++;
+		if (!can_split_huge_page(compound_head(page), NULL))
+			continue;
+
+		if (!trylock_page(page))
+			continue;
+
+		if (!split_huge_page(page))
+			split++;
+
+		unlock_page(page);
+		put_page(page);
+	}
+	mmap_read_unlock(mm);
+	mmput(mm);
+
+	pr_debug("%lu of %lu THP split\n", split, total);
+out:
+	mutex_unlock(&mutex);
+	return ret;
+
+}
+
+static const struct file_operations split_huge_pages_in_range_pid_fops = {
+	.owner	 = THIS_MODULE,
+	.write	 = split_huge_pages_in_range_pid_write,
+	.llseek  = no_llseek,
+};
+
 static int __init split_huge_pages_debugfs(void)
 {
 	debugfs_create_file("split_huge_pages", 0200, NULL, NULL,
 			    &split_huge_pages_fops);
+	debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
+			    &split_huge_pages_in_range_pid_fops);
 	return 0;
 }
 late_initcall(split_huge_pages_debugfs);
diff --git a/mm/internal.h b/mm/internal.h
index fbebc3ff288c..b94b2d96e47a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -627,4 +627,5 @@ struct migration_target_control {
 
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
 void page_cache_free_page(struct address_space *mapping, struct page *page);
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 6dfc7ea08f78..8fb328f9e180 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1853,7 +1853,7 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
 	return nr_pages ? -EFAULT : 0;
 }
 
-static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 {
 	struct task_struct *task;
 	struct mm_struct *mm;
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index c8deddc81e7a..da92ded5a27c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -23,3 +23,4 @@ write_to_hugetlbfs
 hmm-tests
 memfd_secret
 local_config.*
+split_huge_page_test
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 9ab98946fbf2..1cc4e5b76dac 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -43,6 +43,7 @@ TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
+TEST_GEN_FILES += split_huge_page_test
 
 ifeq ($(ARCH),x86_64)
 CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32)
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
new file mode 100644
index 000000000000..cd2ced8c1261
--- /dev/null
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -0,0 +1,313 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include "numa.h"
+#include <unistd.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <malloc.h>
+#include <stdbool.h>
+
+uint64_t pagesize;
+unsigned int pageshift;
+uint64_t pmd_pagesize;
+
+#define PMD_SIZE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
+#define SMAP_PATH "/proc/self/smaps"
+#define INPUT_MAX 80
+
+#define PFN_MASK     ((1UL<<55)-1)
+#define KPF_THP      (1UL<<22)
+
+int is_backed_by_thp(char *vaddr, int pagemap_file, int kpageflags_file)
+{
+	uint64_t paddr;
+	uint64_t page_flags;
+
+	if (pagemap_file) {
+		pread(pagemap_file, &paddr, sizeof(paddr),
+			((long)vaddr >> pageshift) * sizeof(paddr));
+
+		if (kpageflags_file) {
+			pread(kpageflags_file, &page_flags, sizeof(page_flags),
+				(paddr & PFN_MASK) * sizeof(page_flags));
+
+			return !!(page_flags & KPF_THP);
+		}
+	}
+	return 0;
+}
+
+
+static uint64_t read_pmd_pagesize(void)
+{
+	int fd;
+	char buf[20];
+	ssize_t num_read;
+
+	fd = open(PMD_SIZE_PATH, O_RDONLY);
+	if (fd == -1) {
+		perror("Open hpage_pmd_size failed");
+		exit(EXIT_FAILURE);
+	}
+	num_read = read(fd, buf, 19);
+	if (num_read < 1) {
+		close(fd);
+		perror("Read hpage_pmd_size failed");
+		exit(EXIT_FAILURE);
+	}
+	buf[num_read] = '\0';
+	close(fd);
+
+	return strtoul(buf, NULL, 10);
+}
+
+static int write_file(const char *path, const char *buf, size_t buflen)
+{
+	int fd;
+	ssize_t numwritten;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return 0;
+
+	numwritten = write(fd, buf, buflen - 1);
+	close(fd);
+	if (numwritten < 1)
+		return 0;
+
+	return (unsigned int) numwritten;
+}
+
+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
+{
+	char input[INPUT_MAX];
+	int ret;
+
+	ret = snprintf(input, INPUT_MAX, "%d,0x%lx,0x%lx", pid, vaddr_start,
+			vaddr_end);
+	if (ret >= INPUT_MAX) {
+		printf("%s: Debugfs input is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) {
+		perror(SPLIT_DEBUGFS);
+		exit(EXIT_FAILURE);
+	}
+}
+
+#define MAX_LINE_LENGTH 500
+
+static bool check_for_pattern(FILE *fp, const char *pattern, char *buf)
+{
+	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
+		if (!strncmp(buf, pattern, strlen(pattern)))
+			return true;
+	}
+	return false;
+}
+
+static uint64_t check_huge(void *addr)
+{
+	uint64_t thp = 0;
+	int ret;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+	char addr_pattern[MAX_LINE_LENGTH];
+
+	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-",
+		       (unsigned long) addr);
+	if (ret >= MAX_LINE_LENGTH) {
+		printf("%s: Pattern is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+
+
+	fp = fopen(SMAP_PATH, "r");
+	if (!fp) {
+		printf("%s: Failed to open file %s\n", __func__, SMAP_PATH);
+		exit(EXIT_FAILURE);
+	}
+	if (!check_for_pattern(fp, addr_pattern, buffer))
+		goto err_out;
+
+	/*
+	 * Fetch the AnonHugePages: in the same block and check the number of
+	 * hugepages.
+	 */
+	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
+		goto err_out;
+
+	if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1) {
+		printf("Reading smap error\n");
+		exit(EXIT_FAILURE);
+	}
+
+err_out:
+	fclose(fp);
+	return thp;
+}
+
+void split_pmd_thp(void)
+{
+	char *one_page;
+	size_t len = 4 * pmd_pagesize;
+	uint64_t thp_size;
+	size_t i;
+
+	one_page = memalign(pmd_pagesize, len);
+
+	madvise(one_page, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		one_page[i] = (char)i;
+
+	thp_size = check_huge(one_page);
+	if (!thp_size) {
+		printf("No THP is allocatd");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all possible huge pages */
+	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
+
+	for (i = 0; i < len; i++)
+		if (one_page[i] != (char)i) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+
+
+	thp_size = check_huge(one_page);
+	if (thp_size) {
+		printf("Still %ld kB AnonHugePages not split\n", thp_size);
+		exit(EXIT_FAILURE);
+	}
+
+	printf("Split huge pages successful\n");
+	free(one_page);
+}
+
+void split_pte_mapped_thp(void)
+{
+	char *one_page, *pte_mapped, *pte_mapped2;
+	size_t len = 4 * pmd_pagesize;
+	uint64_t thp_size;
+	size_t i;
+	const char *pagemap_template = "/proc/%d/pagemap";
+	const char *kpageflags_proc = "/proc/kpageflags";
+	char pagemap_proc[255];
+	int pagemap_fd;
+	int kpageflags_fd;
+
+	if (snprintf(pagemap_proc, 255, pagemap_template, getpid()) < 0) {
+		perror("get pagemap proc error");
+		exit(EXIT_FAILURE);
+	}
+	pagemap_fd = open(pagemap_proc, O_RDONLY);
+
+	if (pagemap_fd == -1) {
+		perror("read pagemap:");
+		exit(EXIT_FAILURE);
+	}
+
+	kpageflags_fd = open(kpageflags_proc, O_RDONLY);
+
+	if (kpageflags_fd == -1) {
+		perror("read kpageflags:");
+		exit(EXIT_FAILURE);
+	}
+
+	one_page = mmap((void *)(1UL << 30), len, PROT_READ | PROT_WRITE,
+			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+
+	madvise(one_page, len, MADV_HUGEPAGE);
+
+	for (i = 0; i < len; i++)
+		one_page[i] = (char)i;
+
+	thp_size = check_huge(one_page);
+	if (!thp_size) {
+		printf("No THP is allocatd");
+		exit(EXIT_FAILURE);
+	}
+
+	pte_mapped = mremap(one_page, pagesize, pagesize, MREMAP_MAYMOVE);
+
+	for (i = 1; i < 4; i++) {
+		pte_mapped2 = mremap(one_page + pmd_pagesize * i + pagesize * i,
+				     pagesize, pagesize,
+				     MREMAP_MAYMOVE|MREMAP_FIXED,
+				     pte_mapped + pagesize * i);
+		if (pte_mapped2 == (char *)-1) {
+			perror("mremap failed");
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	/* smap does not show THPs after mremap, use kpageflags instead */
+	thp_size = 0;
+	for (i = 0; i < pagesize * 4; i++)
+		if (i % pagesize == 0 &&
+		    is_backed_by_thp(&pte_mapped[i], pagemap_fd, kpageflags_fd))
+			thp_size++;
+
+	if (thp_size != 4) {
+		printf("Some THPs are missing during mremap\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all possible huge pages */
+	write_debugfs(getpid(), (uint64_t)pte_mapped,
+		      (uint64_t)pte_mapped + pagesize * 4);
+
+	/* smap does not show THPs after mremap, use kpageflags instead */
+	thp_size = 0;
+	for (i = 0; i < pagesize * 4; i++) {
+		if (pte_mapped[i] != (char)i) {
+			printf("%ld byte corrupted\n", i);
+			exit(EXIT_FAILURE);
+		}
+		if (i % pagesize == 0 &&
+		    is_backed_by_thp(&pte_mapped[i], pagemap_fd, kpageflags_fd))
+			thp_size++;
+	}
+
+	if (thp_size) {
+		printf("Still %ld THPs not split\n", thp_size);
+		exit(EXIT_FAILURE);
+	}
+
+	printf("Split PTE-mapped huge pages successful\n");
+	munmap(one_page, len);
+	close(pagemap_fd);
+	close(kpageflags_fd);
+}
+
+int main(int argc, char **argv)
+{
+	if (geteuid() != 0) {
+		printf("Please run the benchmark as root\n");
+		exit(EXIT_FAILURE);
+	}
+
+	pagesize = getpagesize();
+	pageshift = ffs(pagesize) - 1;
+	pmd_pagesize = read_pmd_pagesize();
+
+	split_pmd_thp();
+	split_pte_mapped_thp();
+
+	return 0;
+}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/7] mm: memcg: make memcg huge page split support any order split.
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
  2020-11-19 16:05 ` [PATCH 1/7] XArray: Fix splitting to non-zero orders Zi Yan
  2020-11-19 16:06 ` [PATCH 2/7] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
@ 2020-11-19 16:06 ` Zi Yan
  2020-11-19 16:06 ` [PATCH 4/7] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:06 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It sets memcg information for the pages after the split. A new parameter
new_order is added to tell the new page order, always 0 for now. It
prepares for upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Roman Gushchin <guro@fb.com>
---
 include/linux/memcontrol.h | 5 +++--
 mm/huge_memory.c           | 2 +-
 mm/memcontrol.c            | 6 +++---
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a8d5daf95988..39707feae505 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1062,7 +1062,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_order);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1396,7 +1396,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	return 0;
 }
 
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void mem_cgroup_split_huge_fixup(struct page *head,
+					       unsigned int new_order)
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88d8b7fce5d7..d7ab5cac5851 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2428,7 +2428,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
 	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(head);
+	mem_cgroup_split_huge_fixup(head, 0);
 
 	if (PageAnon(head) && PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index de5869dd354d..4521ed3a51b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3223,15 +3223,15 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
  * Because tail pages are not marked as "used", set it. We're under
  * pgdat->lru_lock and migration entries setup in all page mappings.
  */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_order)
 {
 	struct mem_cgroup *memcg = page_memcg(head);
-	int i;
+	int i, new_nr = 1 << new_order;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < thp_nr_pages(head); i++) {
+	for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
 		css_get(&memcg->css);
 		head[i].memcg_data = (unsigned long)memcg;
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/7] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
                   ` (2 preceding siblings ...)
  2020-11-19 16:06 ` [PATCH 3/7] mm: memcg: make memcg huge page split support any order split Zi Yan
@ 2020-11-19 16:06 ` Zi Yan
  2020-11-19 16:06 ` [PATCH 5/7] mm: thp: split huge page to any lower order pages Zi Yan
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:06 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It adds a new_order parameter to set new page order in page owner and
uses old_order instead of nr to make the parameters look consistent.
It prepares for upcoming changes to support split huge page to any
lower order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page_owner.h | 10 ++++++----
 mm/huge_memory.c           |  3 ++-
 mm/page_alloc.c            |  2 +-
 mm/page_owner.c            | 13 +++++++------
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..9caaed51403c 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -11,7 +11,8 @@ extern struct page_ext_operations page_owner_ops;
 extern void __reset_page_owner(struct page *page, unsigned int order);
 extern void __set_page_owner(struct page *page,
 			unsigned int order, gfp_t gfp_mask);
-extern void __split_page_owner(struct page *page, unsigned int nr);
+extern void __split_page_owner(struct page *page, unsigned int old_order,
+			unsigned int new_order);
 extern void __copy_page_owner(struct page *oldpage, struct page *newpage);
 extern void __set_page_owner_migrate_reason(struct page *page, int reason);
 extern void __dump_page_owner(struct page *page);
@@ -31,10 +32,11 @@ static inline void set_page_owner(struct page *page,
 		__set_page_owner(page, order, gfp_mask);
 }
 
-static inline void split_page_owner(struct page *page, unsigned int nr)
+static inline void split_page_owner(struct page *page, unsigned int old_order,
+			unsigned int new_order)
 {
 	if (static_branch_unlikely(&page_owner_inited))
-		__split_page_owner(page, nr);
+		__split_page_owner(page, old_order, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
 {
@@ -60,7 +62,7 @@ static inline void set_page_owner(struct page *page,
 {
 }
 static inline void split_page_owner(struct page *page,
-			unsigned int order)
+			unsigned int old_order, unsigned int new_order)
 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7ab5cac5851..aae7405a0989 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2422,6 +2422,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
+	unsigned int order = thp_order(head);
 	unsigned int nr = thp_nr_pages(head);
 	int i;
 
@@ -2458,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	ClearPageCompound(head);
 
-	split_page_owner(head, nr);
+	split_page_owner(head, order, 0);
 
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63d8d8b72c10..414f26950190 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3297,7 +3297,7 @@ void split_page(struct page *page, unsigned int order)
 
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
-	split_page_owner(page, 1 << order);
+	split_page_owner(page, order, 0);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..00a679a1230b 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,19 +204,20 @@ void __set_page_owner_migrate_reason(struct page *page, int reason)
 	page_owner->last_migrate_reason = reason;
 }
 
-void __split_page_owner(struct page *page, unsigned int nr)
+void __split_page_owner(struct page *page, unsigned int old_order,
+			unsigned int new_order)
 {
-	int i;
-	struct page_ext *page_ext = lookup_page_ext(page);
+	int i, old_nr = 1 << old_order, new_nr = 1 << new_order;
+	struct page_ext *page_ext;
 	struct page_owner *page_owner;
 
 	if (unlikely(!page_ext))
 		return;
 
-	for (i = 0; i < nr; i++) {
+	for (i = 0; i < old_nr; i += new_nr) {
+		page_ext = lookup_page_ext(page + i);
 		page_owner = get_page_owner(page_ext);
-		page_owner->order = 0;
-		page_ext = page_ext_next(page_ext);
+		page_owner->order = new_order;
 	}
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/7] mm: thp: split huge page to any lower order pages.
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
                   ` (3 preceding siblings ...)
  2020-11-19 16:06 ` [PATCH 4/7] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
@ 2020-11-19 16:06 ` Zi Yan
  2020-11-19 16:06 ` [PATCH 6/7] mm: truncate: split thp to a non-zero order if possible Zi Yan
  2020-11-19 16:06 ` [PATCH 7/7] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:06 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page
order. Also we need to reinitialize page_deferred_list after removing
the page from the split_queue, otherwise a subsequent split will see
list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after
truncating a pagecache THP. For anonymous THPs, we can only split them
to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h |   8 +++
 mm/huge_memory.c        | 119 +++++++++++++++++++++++++++++-----------
 mm/swap.c               |   1 -
 3 files changed, 96 insertions(+), 32 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7723deda33e2..0c856f805617 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -182,6 +182,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order);
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
@@ -385,6 +387,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	return 0;
 }
+static inline int
+split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return 0;
+}
 static inline int split_huge_page(struct page *page)
 {
 	return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aae7405a0989..cc70f70862d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2325,12 +2325,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 
 static void unmap_page(struct page *page)
 {
-	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK |
-		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_RMAP_LOCKED;
 	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	if (thp_order(page) >= HPAGE_PMD_ORDER)
+		ttu_flags |= TTU_SPLIT_HUGE_PMD;
+
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2338,21 +2340,23 @@ static void unmap_page(struct page *page)
 	VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
-static void remap_page(struct page *page, unsigned int nr)
+static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
 {
-	int i;
-	if (PageTransHuge(page)) {
+	unsigned int i;
+
+	if (thp_nr_pages(page) == nr) {
 		remove_migration_ptes(page, page, true);
 	} else {
-		for (i = 0; i < nr; i++)
+		for (i = 0; i < nr; i += new_nr)
 			remove_migration_ptes(page + i, page + i, true);
 	}
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-		struct lruvec *lruvec, struct list_head *list)
+		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
 {
 	struct page *page_tail = head + tail;
+	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
 
 	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
 
@@ -2376,6 +2380,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
+			 compound_head_flag |
 			 (1L << PG_dirty)));
 
 	/* ->mapping in first tail page is compound_mapcount */
@@ -2384,7 +2389,10 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	page_tail->mapping = head->mapping;
 	page_tail->index = head->index + tail;
 
-	/* Page flags must be visible before we make the page non-compound. */
+	/*
+	 * Page flags must be visible before we make the page non-compound or
+	 * a compound page in new_order.
+	 */
 	smp_wmb();
 
 	/*
@@ -2394,10 +2402,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * which needs correct compound_head().
 	 */
 	clear_compound_head(page_tail);
+	if (new_order) {
+		prep_compound_page(page_tail, new_order);
+		thp_prep(page_tail);
+	}
 
 	/* Finally unfreeze refcount. Additional reference from page cache. */
-	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
-					  PageSwapCache(head)));
+	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
+					   PageSwapCache(head)) ?
+						thp_nr_pages(page_tail) : 0));
 
 	if (page_is_young(head))
 		set_page_young(page_tail);
@@ -2415,7 +2428,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end, unsigned long flags, unsigned int new_order)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2424,12 +2437,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned long offset = 0;
 	unsigned int order = thp_order(head);
 	unsigned int nr = thp_nr_pages(head);
+	unsigned int new_nr = 1 << new_order;
 	int i;
 
 	lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
 	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(head, 0);
+	mem_cgroup_split_huge_fixup(head, new_order);
 
 	if (PageAnon(head) && PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2439,46 +2453,54 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	for (i = nr - 1; i >= 1; i--) {
-		__split_huge_page_tail(head, i, lruvec, list);
+	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
+		__split_huge_page_tail(head, i, lruvec, list, new_order);
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
 			__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
-				shmem_uncharge(head->mapping->host, 1);
+				shmem_uncharge(head->mapping->host, new_nr);
 			put_page(head + i);
 		} else if (!PageAnon(page)) {
 			__xa_store(&head->mapping->i_pages, head[i].index,
 					head + i, 0);
 		} else if (swap_cache) {
+			/*
+			 * split anonymous THPs (including swapped out ones) to
+			 * non-zero order not supported
+			 */
+			VM_BUG_ON(new_order);
 			__xa_store(&swap_cache->i_pages, offset + i,
 					head + i, 0);
 		}
 	}
 
-	ClearPageCompound(head);
+	if (!new_order)
+		ClearPageCompound(head);
+	else
+		set_compound_order(head, new_order);
 
-	split_page_owner(head, order, 0);
+	split_page_owner(head, order, new_order);
 
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
 		/* Additional pin to swap cache */
 		if (PageSwapCache(head)) {
-			page_ref_add(head, 2);
+			page_ref_add(head, 1 + new_nr);
 			xa_unlock(&swap_cache->i_pages);
 		} else {
 			page_ref_inc(head);
 		}
 	} else {
 		/* Additional pin to page cache */
-		page_ref_add(head, 2);
+		page_ref_add(head, 1 + new_nr);
 		xa_unlock(&head->mapping->i_pages);
 	}
 
 	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 
-	remap_page(head, nr);
+	remap_page(head, nr, new_nr);
 
 	if (PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2486,7 +2508,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		split_swap_cluster(entry);
 	}
 
-	for (i = 0; i < nr; i++) {
+	/*
+	 * set page to its compound_head when split to THPs, so that GUP pin and
+	 * PG_locked are transferred to the right after-split page
+	 */
+	if (new_order)
+		page = compound_head(page);
+
+	for (i = 0; i < nr; i += new_nr) {
 		struct page *subpage = head + i;
 		if (subpage == page)
 			continue;
@@ -2604,37 +2633,61 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
  *
+ * See split_huge_page_to_list_to_order() for more details.
+ *
+ * Returns 0 if the hugepage is split successfully.
+ * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
+ * us.
+ */
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	return split_huge_page_to_list_to_order(page, list, 0);
+}
+
+/*
+ * This function splits huge page into pages in @new_order. @page can point to
+ * any subpage of huge page to split. Split doesn't change the position of
+ * @page.
+ *
  * Only caller must hold pin on the @page, otherwise split fails with -EBUSY.
  * The huge page must be locked.
  *
  * If @list is null, tail pages will be added to LRU list, otherwise, to @list.
  *
- * Both head page and tail pages will inherit mapping, flags, and so on from
- * the hugepage.
+ * Pages in new_order will inherit mapping, flags, and so on from the hugepage.
  *
- * GUP pin and PG_locked transferred to @page. Rest subpages can be freed if
- * they are not mapped.
+ * GUP pin and PG_locked transferred to @page or the compound page @page belongs
+ * to. Rest subpages can be freed if they are not mapped.
  *
  * Returns 0 if the hugepage is split successfully.
  * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
  * us.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order)
 {
 	struct page *head = compound_head(page);
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
-	XA_STATE(xas, &head->mapping->i_pages, head->index);
+	/* reset xarray order to new order after split */
+	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
 	unsigned long flags;
 	pgoff_t end;
 
+	VM_BUG_ON(thp_order(head) <= new_order);
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
 	VM_BUG_ON_PAGE(!PageLocked(head), head);
 	VM_BUG_ON_PAGE(!PageCompound(head), head);
 
+	/* Cannot split THP to order-1 (no order-1 THPs) */
+	VM_BUG_ON(new_order == 1);
+
+	/* Split anonymous THP to non-zero order not support */
+	VM_BUG_ON(PageAnon(head) && new_order);
+
 	if (PageWriteback(head))
 		return -EBUSY;
 
@@ -2720,18 +2773,22 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
-			list_del(page_deferred_list(head));
+			list_del_init(page_deferred_list(head));
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
 			if (PageSwapBacked(head))
 				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
-			else
+			else if (!new_order)
+				/*
+				 * Decrease THP stats only if split to normal
+				 * pages
+				 */
 				__mod_lruvec_page_state(head, NR_FILE_THPS,
 						-thp_nr_pages(head));
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end, flags, new_order);
 		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2746,7 +2803,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 fail:		if (mapping)
 			xas_unlock(&xas);
 		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
-		remap_page(head, thp_nr_pages(head));
+		remap_page(head, thp_nr_pages(head), 1);
 		ret = -EBUSY;
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index b667870c8a0b..8f53ab593438 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
 	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 6/7] mm: truncate: split thp to a non-zero order if possible.
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
                   ` (4 preceding siblings ...)
  2020-11-19 16:06 ` [PATCH 5/7] mm: thp: split huge page to any lower order pages Zi Yan
@ 2020-11-19 16:06 ` Zi Yan
  2020-11-19 16:06 ` [PATCH 7/7] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:06 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

To minimize the number of pages after a truncation, when truncating a
THP, we do not need to split it all the way down to order-0. The THP has
at most three parts, the part before offset, the part to be truncated,
the part left at the end. Use the non-zero minimum of them to decide
what order we split the THP to.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/truncate.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 20bd17538ec2..2e93d702f2c6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -237,7 +237,9 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 {
 	loff_t pos = page_offset(page);
-	unsigned int offset, length;
+	unsigned int offset, length, remaining, min_subpage_size = PAGE_SIZE;
+	unsigned int new_order;
+
 
 	if (pos < start)
 		offset = start - pos;
@@ -248,6 +250,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 		length = length - offset;
 	else
 		length = end + 1 - pos - offset;
+	remaining = thp_size(page) - offset - length;
 
 	wait_on_page_writeback(page);
 	if (length == thp_size(page)) {
@@ -267,7 +270,29 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 		do_invalidatepage(page, offset, length);
 	if (!PageTransHuge(page))
 		return true;
-	return split_huge_page(page) == 0;
+
+	/*
+	 * find the non-zero minimum of offset, length, and remaining and use it
+	 * to decide the new order of the page after split
+	 */
+	if (offset && remaining)
+		min_subpage_size = min_t(unsigned int,
+					 min_t(unsigned int, offset, length),
+					 remaining);
+	else if (!offset)
+		min_subpage_size = min_t(unsigned int, length, remaining);
+	else /* remaining == 0 */
+		min_subpage_size = min_t(unsigned int, length, offset);
+
+	min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
+
+	new_order = ilog2(min_subpage_size/PAGE_SIZE);
+
+	/* order-1 THP not supported, downgrade to order-0 */
+	if (new_order == 1)
+		new_order = 0;
+
+	return split_huge_page_to_list_to_order(page, NULL, new_order) == 0;
 }
 
 /*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 7/7] mm: huge_memory: enable debugfs to split huge pages to any order.
  2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
                   ` (5 preceding siblings ...)
  2020-11-19 16:06 ` [PATCH 6/7] mm: truncate: split thp to a non-zero order if possible Zi Yan
@ 2020-11-19 16:06 ` Zi Yan
  6 siblings, 0 replies; 8+ messages in thread
From: Zi Yan @ 2020-11-19 16:06 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox, Kirill A . Shutemov
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, linux-kselftest,
	Yang Shi, Michal Hocko, John Hubbard, Ralph Campbell,
	David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It is used to test split_huge_page_to_list_to_order for pagecache THPs.
Also add test cases for split_huge_page_to_list_to_order via both
debugfs, truncating a file, and punching holes in a file.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c                              |  13 +-
 .../selftests/vm/split_huge_page_test.c       | 192 ++++++++++++++++--
 2 files changed, 186 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cc70f70862d8..d6ce7be65fb2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2999,7 +2999,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 	static DEFINE_MUTEX(mutex);
 	ssize_t ret;
 	char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
-	int pid;
+	int pid, to_order = 0;
 	unsigned long vaddr_start, vaddr_end, addr;
 	nodemask_t task_nodes;
 	struct mm_struct *mm;
@@ -3016,8 +3016,9 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 		goto out;
 
 	input_buf[79] = '\0';
-	ret = sscanf(input_buf, "%d,0x%lx,0x%lx", &pid, &vaddr_start, &vaddr_end);
-	if (ret != 3) {
+	ret = sscanf(input_buf, "%d,0x%lx,0x%lx,%d", &pid, &vaddr_start, &vaddr_end, &to_order);
+	/* cannot split to order-1 THP, which is not possible */
+	if ((ret != 3 && ret != 4) || to_order == 1) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -3025,8 +3026,8 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 	vaddr_end &= PAGE_MASK;
 
 	ret = strlen(input_buf);
-	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
-		 pid, vaddr_start, vaddr_end);
+	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx], to order: %d\n",
+		 pid, vaddr_start, vaddr_end, to_order);
 
 	mm = find_mm_struct(pid, &task_nodes);
 	if (IS_ERR(mm)) {
@@ -3066,7 +3067,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 		if (!trylock_page(page))
 			continue;
 
-		if (!split_huge_page(page))
+		if (!split_huge_page_to_list_to_order(page, NULL, to_order))
 			split++;
 
 		unlock_page(page);
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index cd2ced8c1261..bfd35ae9cfd2 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,7 @@
 #include <sys/wait.h>
 #include <malloc.h>
 #include <stdbool.h>
+#include <time.h>
 
 uint64_t pagesize;
 unsigned int pageshift;
@@ -24,6 +25,7 @@ uint64_t pmd_pagesize;
 #define PMD_SIZE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
 #define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
 #define SMAP_PATH "/proc/self/smaps"
+#define THP_FS_PATH "/mnt/thp_fs"
 #define INPUT_MAX 80
 
 #define PFN_MASK     ((1UL<<55)-1)
@@ -89,19 +91,20 @@ static int write_file(const char *path, const char *buf, size_t buflen)
 	return (unsigned int) numwritten;
 }
 
-static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end, int order)
 {
 	char input[INPUT_MAX];
 	int ret;
 
-	ret = snprintf(input, INPUT_MAX, "%d,0x%lx,0x%lx", pid, vaddr_start,
-			vaddr_end);
+	ret = snprintf(input, INPUT_MAX, "%d,0x%lx,0x%lx,%d", pid, vaddr_start,
+			vaddr_end, order);
 	if (ret >= INPUT_MAX) {
 		printf("%s: Debugfs input is too long\n", __func__);
 		exit(EXIT_FAILURE);
 	}
 
-	if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) {
+	/* order == 1 is an invalid input that should be detected. */
+	if (order != 1 && !write_file(SPLIT_DEBUGFS, input, ret + 1)) {
 		perror(SPLIT_DEBUGFS);
 		exit(EXIT_FAILURE);
 	}
@@ -118,7 +121,7 @@ static bool check_for_pattern(FILE *fp, const char *pattern, char *buf)
 	return false;
 }
 
-static uint64_t check_huge(void *addr)
+static uint64_t check_huge(void *addr, const char *prefix)
 {
 	uint64_t thp = 0;
 	int ret;
@@ -143,13 +146,13 @@ static uint64_t check_huge(void *addr)
 		goto err_out;
 
 	/*
-	 * Fetch the AnonHugePages: in the same block and check the number of
+	 * Fetch the @prefix in the same block and check the number of
 	 * hugepages.
 	 */
-	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
+	if (!check_for_pattern(fp, prefix, buffer))
 		goto err_out;
 
-	if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1) {
+	if (sscanf(&buffer[strlen(prefix)], "%10ld kB", &thp) != 1) {
 		printf("Reading smap error\n");
 		exit(EXIT_FAILURE);
 	}
@@ -173,14 +176,14 @@ void split_pmd_thp(void)
 	for (i = 0; i < len; i++)
 		one_page[i] = (char)i;
 
-	thp_size = check_huge(one_page);
+	thp_size = check_huge(one_page, "AnonHugePages:");
 	if (!thp_size) {
 		printf("No THP is allocatd");
 		exit(EXIT_FAILURE);
 	}
 
 	/* split all possible huge pages */
-	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
+	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len, 0);
 
 	for (i = 0; i < len; i++)
 		if (one_page[i] != (char)i) {
@@ -189,7 +192,7 @@ void split_pmd_thp(void)
 		}
 
 
-	thp_size = check_huge(one_page);
+	thp_size = check_huge(one_page, "AnonHugePages:");
 	if (thp_size) {
 		printf("Still %ld kB AnonHugePages not split\n", thp_size);
 		exit(EXIT_FAILURE);
@@ -237,7 +240,7 @@ void split_pte_mapped_thp(void)
 	for (i = 0; i < len; i++)
 		one_page[i] = (char)i;
 
-	thp_size = check_huge(one_page);
+	thp_size = check_huge(one_page, "AnonHugePages:");
 	if (!thp_size) {
 		printf("No THP is allocatd");
 		exit(EXIT_FAILURE);
@@ -270,7 +273,7 @@ void split_pte_mapped_thp(void)
 
 	/* split all possible huge pages */
 	write_debugfs(getpid(), (uint64_t)pte_mapped,
-		      (uint64_t)pte_mapped + pagesize * 4);
+		      (uint64_t)pte_mapped + pagesize * 4, 0);
 
 	/* smap does not show THPs after mremap, use kpageflags instead */
 	thp_size = 0;
@@ -295,19 +298,182 @@ void split_pte_mapped_thp(void)
 	close(kpageflags_fd);
 }
 
+void create_pagecache_thp_and_fd(size_t fd_size, int *fd, char **addr)
+{
+	const char testfile[] = THP_FS_PATH "/test";
+	size_t i;
+	int dummy;
+
+	srand(time(NULL));
+
+	*fd = open(testfile, O_CREAT | O_RDWR, 0664);
+	if (*fd == -1) {
+		perror("Failed to create a file at "THP_FS_PATH);
+		exit(EXIT_FAILURE);
+	}
+
+	for (i = 0; i < fd_size; i++) {
+		unsigned char byte = (unsigned char)i;
+
+		write(*fd, &byte, sizeof(byte));
+	}
+	close(*fd);
+	sync();
+	*fd = open("/proc/sys/vm/drop_caches", O_WRONLY);
+	if (*fd == -1) {
+		perror("open drop_caches");
+		exit(EXIT_FAILURE);
+	}
+	if (write(*fd, "3", 1) != 1) {
+		perror("write to drop_caches");
+		exit(EXIT_FAILURE);
+	}
+	close(*fd);
+
+	*fd = open(testfile, O_RDWR);
+	if (*fd == -1) {
+		perror("Failed to open a file at "THP_FS_PATH);
+		exit(EXIT_FAILURE);
+	}
+
+	*addr = mmap(NULL, fd_size, PROT_READ|PROT_WRITE, MAP_SHARED, *fd, 0);
+	if (*addr == (char *)-1) {
+		perror("cannot mmap");
+		exit(1);
+	}
+	madvise(*addr, fd_size, MADV_HUGEPAGE);
+
+	for (size_t i = 0; i < fd_size; i++)
+		dummy += *(*addr + i);
+
+	if (!check_huge(*addr, "FilePmdMapped:")) {
+		printf("No pagecache THP generated, please mount a filesystem "
+		       "supporting pagecache THP at "THP_FS_PATH"\n");
+		exit(EXIT_FAILURE);
+	}
+}
+
+void split_thp_in_pagecache_to_order(size_t fd_size, int order)
+{
+	int fd;
+	char *addr;
+	size_t i;
+
+	create_pagecache_thp_and_fd(fd_size, &fd, &addr);
+
+	printf("split %ld kB pagecache page to order %d ... ", fd_size >> 10, order);
+	write_debugfs(getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order);
+
+	for (i = 0; i < fd_size; i++)
+		if (*(addr + i) != (char)i) {
+			printf("%lu byte corrupted in the file\n");
+			exit(EXIT_FAILURE);
+		}
+
+	close(fd);
+	printf("done\n");
+}
+
+void truncate_thp_in_pagecache_to_order(size_t fd_size, int order)
+{
+	int fd;
+	char *addr;
+	size_t i;
+
+	create_pagecache_thp_and_fd(fd_size, &fd, &addr);
+
+	printf("truncate %ld kB pagecache page to size %lu kB ... ", fd_size >> 10, 4UL << order);
+	ftruncate(fd, pagesize << order);
+
+	for (i = 0; i < (pagesize << order); i++)
+		if (*(addr + i) != (char)i) {
+			printf("%lu byte corrupted in the file\n");
+			exit(EXIT_FAILURE);
+		}
+
+	close(fd);
+	printf("done\n");
+}
+
+void punch_hole_in_pagecache_thp(size_t fd_size, off_t offset[], off_t len[], int n)
+{
+	int fd, j;
+	char *addr;
+	size_t i;
+
+	create_pagecache_thp_and_fd(fd_size, &fd, &addr);
+
+	for (j = 0; j < n; j++) {
+		printf("addr: %lx, punch a hole at offset %ld kB with len %ld kB ... ",
+			addr, offset[j] >> 10, len[j] >> 10);
+		fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, offset[j], len[j]);
+		printf("done\n");
+	}
+
+	for (i = 0; i < fd_size; i++) {
+		int in_hole = 0;
+
+		for (j = 0; j < n; j++)
+			if (i >= offset[j] && i <= (offset[j] + len[j])) {
+				in_hole = 1;
+				break;
+			}
+
+		if (in_hole)
+			continue;
+		if (*(addr + i) != (char)i) {
+			printf("%lu byte corrupted in the file\n", i);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	close(fd);
+
+}
+
 int main(int argc, char **argv)
 {
+	int i;
+	size_t fd_size;
+	off_t offset[2], len[2];
+
 	if (geteuid() != 0) {
 		printf("Please run the benchmark as root\n");
 		exit(EXIT_FAILURE);
 	}
 
+	setbuf(stdout, NULL);
+
 	pagesize = getpagesize();
 	pageshift = ffs(pagesize) - 1;
 	pmd_pagesize = read_pmd_pagesize();
+	fd_size = 2 * pmd_pagesize;
 
 	split_pmd_thp();
 	split_pte_mapped_thp();
 
+	for (i = 8; i >= 0; i--)
+		split_thp_in_pagecache_to_order(fd_size, i);
+
+	/*
+	 * for i is 1, truncate code in the kernel should create order-0 pages
+	 * instead of order-1 THPs, since order-1 THP is not supported. No error
+	 * is expected.
+	 */
+	for (i = 8; i >= 0; i--)
+		truncate_thp_in_pagecache_to_order(fd_size, i);
+
+	offset[0] = 123 * pagesize;
+	offset[1] = 4 * pagesize;
+	len[0] = 200 * pagesize;
+	len[1] = 16 * pagesize;
+	punch_hole_in_pagecache_thp(fd_size, offset, len, 2);
+
+	offset[0] = 259 * pagesize;
+	offset[1] = 33 * pagesize;
+	len[0] = 129 * pagesize;
+	len[1] = 16 * pagesize;
+	punch_hole_in_pagecache_thp(fd_size, offset, len, 2);
+
 	return 0;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-11-19 16:07 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-19 16:05 [PATCH 0/7] Split huge pages to any lower order pages and selftests Zi Yan
2020-11-19 16:05 ` [PATCH 1/7] XArray: Fix splitting to non-zero orders Zi Yan
2020-11-19 16:06 ` [PATCH 2/7] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
2020-11-19 16:06 ` [PATCH 3/7] mm: memcg: make memcg huge page split support any order split Zi Yan
2020-11-19 16:06 ` [PATCH 4/7] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
2020-11-19 16:06 ` [PATCH 5/7] mm: thp: split huge page to any lower order pages Zi Yan
2020-11-19 16:06 ` [PATCH 6/7] mm: truncate: split thp to a non-zero order if possible Zi Yan
2020-11-19 16:06 ` [PATCH 7/7] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).