linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Split huge pages to any lower order pages.
@ 2020-11-11 20:40 Zi Yan
  2020-11-11 20:40 ` [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Hi all,

With Matthew's THP in pagecache patches[1], we will be able to handle any size
pagecache THPs, but currently split_huge_page can only split a THP to order-0
pages. This can easily erase the benefit of having pagecache THPs, when
operations like truncate might want to keep pages larger than order-0. In
response, here is the patches to add support for splitting a THP to any lower
order pages. In addition, this patchset prepares for my PUD THP patchset[2],
since splitting a PUD THP to multiple PMD THPs can be handled by
split_huge_page_to_list_to_order function added by this patchset, which reduces
a lot of redundant code without just replicating split_huge_page for PUD THP.

The patchset is on top of Matthew's pagecache/next tree[3].

To ease the tests of split_huge_page functions, I added a new debugfs interface
at <debugfs>/split_huge_pages_in_range_pid, so developers can split THPs in a
given range from a process with the given pid by writing
"<pid>,<vaddr_start>,<vaddr_end>,<to_order>" to the interface. I also added a
new test program to test 1) split PMD THPs, 2) split pagecache THPs to any lower
order, and 3) truncating a pagecache THP to a page with a lower order.

Suggestions and comments are welcome. Thanks.


[1] https://lore.kernel.org/linux-mm/20201029193405.29125-1-willy@infradead.org/
[2] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi.yan@sent.com/
[3] https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/heads/next

Zi Yan (6):
  mm: huge_memory: add new debugfs interface to trigger split huge page
    on any page range.
  mm: memcg: make memcg huge page split support any order split.
  mm: page_owner: add support for splitting to any order in split
    page_owner.
  mm: thp: add support for split huge page to any lower order pages.
  mm: truncate: split thp to a non-zero order if possible.
  mm: huge_memory: enable debugfs to split huge pages to any order.

 include/linux/huge_mm.h                       |   8 +
 include/linux/memcontrol.h                    |   5 +-
 include/linux/page_owner.h                    |   7 +-
 mm/huge_memory.c                              | 177 ++++++++++--
 mm/internal.h                                 |   1 +
 mm/memcontrol.c                               |   4 +-
 mm/migrate.c                                  |   2 +-
 mm/page_alloc.c                               |   2 +-
 mm/page_owner.c                               |   6 +-
 mm/swap.c                                     |   1 -
 mm/truncate.c                                 |  22 +-
 tools/testing/selftests/vm/Makefile           |   1 +
 .../selftests/vm/split_huge_page_test.c       | 255 ++++++++++++++++++
 13 files changed, 453 insertions(+), 38 deletions(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

--
2.28.0



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.
  2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
@ 2020-11-11 20:40 ` Zi Yan
  2020-11-12 22:22   ` Ralph Campbell
  2020-11-16 16:06   ` Kirill A. Shutemov
  2020-11-11 20:40 ` [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split Zi Yan
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

Huge pages in the process with the given pid and virtual address range
are split. It is used to test split huge page function. In addition,
a testing program is added to tools/testing/selftests/vm to utilize the
interface by splitting PMD THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c                              |  98 +++++++++++
 mm/internal.h                                 |   1 +
 mm/migrate.c                                  |   2 +-
 tools/testing/selftests/vm/Makefile           |   1 +
 .../selftests/vm/split_huge_page_test.c       | 161 ++++++++++++++++++
 5 files changed, 262 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 207ebca8c654..c4fead5ead31 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -7,6 +7,7 @@
 
 #include <linux/mm.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/sched/coredump.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/highmem.h>
@@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val)
 DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
 		"%llu\n");
 
+static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppops)
+{
+	static DEFINE_MUTEX(mutex);
+	ssize_t ret;
+	char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
+	int pid;
+	unsigned long vaddr_start, vaddr_end, addr;
+	nodemask_t task_nodes;
+	struct mm_struct *mm;
+
+	ret = mutex_lock_interruptible(&mutex);
+	if (ret)
+		return ret;
+
+	ret = -EFAULT;
+
+	memset(input_buf, 0, 80);
+	if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
+		goto out;
+
+	input_buf[80] = '\0';
+	ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end);
+	if (ret != 3) {
+		ret = -EINVAL;
+		goto out;
+	}
+	vaddr_start &= PAGE_MASK;
+	vaddr_end &= PAGE_MASK;
+
+	ret = strlen(input_buf);
+	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
+		 pid, vaddr_start, vaddr_end);
+
+	mm = find_mm_struct(pid, &task_nodes);
+	if (IS_ERR(mm)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mmap_read_lock(mm);
+	for (addr = vaddr_start; addr < vaddr_end;) {
+		struct vm_area_struct *vma = find_vma(mm, addr);
+		unsigned int follflags;
+		struct page *page;
+
+		if (!vma || addr < vma->vm_start || !vma_migratable(vma))
+			break;
+
+		/* FOLL_DUMP to ignore special (like zero) pages */
+		follflags = FOLL_GET | FOLL_DUMP;
+		page = follow_page(vma, addr, follflags);
+
+		if (IS_ERR(page))
+			break;
+		if (!page)
+			break;
+
+		if (!is_transparent_hugepage(page))
+			goto next;
+
+		if (!can_split_huge_page(page, NULL))
+			goto next;
+
+		if (!trylock_page(page))
+			goto next;
+
+		addr += page_size(page) - PAGE_SIZE;
+
+		/* reset addr if split fails */
+		if (split_huge_page(page))
+			addr -= (page_size(page) - PAGE_SIZE);
+
+		unlock_page(page);
+next:
+		/* next page */
+		addr += page_size(page);
+		put_page(page);
+	}
+	mmap_read_unlock(mm);
+
+
+	mmput(mm);
+out:
+	mutex_unlock(&mutex);
+	return ret;
+
+}
+
+static const struct file_operations split_huge_pages_in_range_pid_fops = {
+	.owner	 = THIS_MODULE,
+	.write	 = split_huge_pages_in_range_pid_write,
+	.llseek  = no_llseek,
+};
+
 static int __init split_huge_pages_debugfs(void)
 {
 	debugfs_create_file("split_huge_pages", 0200, NULL, NULL,
 			    &split_huge_pages_fops);
+	debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
+			    &split_huge_pages_in_range_pid_fops);
 	return 0;
 }
 late_initcall(split_huge_pages_debugfs);
diff --git a/mm/internal.h b/mm/internal.h
index 3ea43642b99d..fd841a38830f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -624,4 +624,5 @@ struct migration_target_control {
 
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
 void page_cache_free_page(struct address_space *mapping, struct page *page);
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index a50bbb0e029b..e35654d1087d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1851,7 +1851,7 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
 	return nr_pages ? -EFAULT : 0;
 }
 
-static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
+struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
 {
 	struct task_struct *task;
 	struct mm_struct *mm;
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 62fb15f286ee..d9ead0cdd3e9 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -42,6 +42,7 @@ TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
+TEST_GEN_FILES += split_huge_page_test
 
 ifeq ($(ARCH),x86_64)
 CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32)
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
new file mode 100644
index 000000000000..c8a32ae9e13a
--- /dev/null
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include "numa.h"
+#include <unistd.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <malloc.h>
+#include <stdbool.h>
+
+#define PAGE_4KB (4096UL)
+#define PAGE_2MB (512UL*PAGE_4KB)
+#define PAGE_1GB (512UL*PAGE_2MB)
+
+#define PRESENT_MASK (1UL<<63)
+#define SWAPPED_MASK (1UL<<62)
+#define PAGE_TYPE_MASK (1UL<<61)
+#define PFN_MASK     ((1UL<<55)-1)
+
+#define KPF_THP      (1UL<<22)
+#define KPF_PUD_THP      (1UL<<27)
+
+#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
+#define SMAP_PATH "/proc/self/smaps"
+#define INPUT_MAX 80
+
+static int write_file(const char *path, const char *buf, size_t buflen)
+{
+	int fd;
+	ssize_t numwritten;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return 0;
+
+	numwritten = write(fd, buf, buflen - 1);
+	close(fd);
+	if (numwritten < 1)
+		return 0;
+
+	return (unsigned int) numwritten;
+}
+
+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
+{
+	char input[INPUT_MAX];
+	int ret;
+
+	ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start,
+			vaddr_end);
+	if (ret >= INPUT_MAX) {
+		printf("%s: Debugfs input is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) {
+		perror(SPLIT_DEBUGFS);
+		exit(EXIT_FAILURE);
+	}
+}
+
+#define MAX_LINE_LENGTH 500
+
+static bool check_for_pattern(FILE *fp, char *pattern, char *buf)
+{
+	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
+		if (!strncmp(buf, pattern, strlen(pattern)))
+			return true;
+	}
+	return false;
+}
+
+static uint64_t check_huge(void *addr)
+{
+	uint64_t thp = 0;
+	int ret;
+	FILE *fp;
+	char buffer[MAX_LINE_LENGTH];
+	char addr_pattern[MAX_LINE_LENGTH];
+
+	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-",
+		       (unsigned long) addr);
+	if (ret >= MAX_LINE_LENGTH) {
+		printf("%s: Pattern is too long\n", __func__);
+		exit(EXIT_FAILURE);
+	}
+
+
+	fp = fopen(SMAP_PATH, "r");
+	if (!fp) {
+		printf("%s: Failed to open file %s\n", __func__, SMAP_PATH);
+		exit(EXIT_FAILURE);
+	}
+	if (!check_for_pattern(fp, addr_pattern, buffer))
+		goto err_out;
+
+	/*
+	 * Fetch the AnonHugePages: in the same block and check the number of
+	 * hugepages.
+	 */
+	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
+		goto err_out;
+
+	if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1) {
+		printf("Reading smap error\n");
+		exit(EXIT_FAILURE);
+	}
+
+err_out:
+	fclose(fp);
+	return thp;
+}
+
+void split_pmd_thp(void)
+{
+	char *one_page;
+	size_t len = 4 * PAGE_2MB;
+	uint64_t thp_size;
+
+	one_page = memalign(PAGE_1GB, len);
+
+	madvise(one_page, len, MADV_HUGEPAGE);
+
+	memset(one_page, 1, len);
+
+	thp_size = check_huge(one_page);
+	if (!thp_size) {
+		printf("No THP is allocatd");
+		exit(EXIT_FAILURE);
+	}
+
+	/* split all possible huge pages */
+	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
+
+	*one_page = 0;
+
+	thp_size = check_huge(one_page);
+	if (thp_size) {
+		printf("Still %ld kB AnonHugePages not split\n", thp_size);
+		exit(EXIT_FAILURE);
+	}
+
+	printf("Split huge pages successful\n");
+	free(one_page);
+}
+
+int main(int argc, char **argv)
+{
+	split_pmd_thp();
+
+	return 0;
+}
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.
  2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
  2020-11-11 20:40 ` [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
@ 2020-11-11 20:40 ` Zi Yan
  2020-11-12 17:58   ` Ralph Campbell
  2020-11-14  0:23   ` Roman Gushchin
  2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It reads thp_nr_pages and splits to provided new_nr. It prepares for
upcoming changes to support split huge page to any lower order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/memcontrol.h | 5 +++--
 mm/huge_memory.c           | 2 +-
 mm/memcontrol.c            | 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0f4dd7829fb2..b3bac79ceed6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-void mem_cgroup_split_huge_fixup(struct page *head);
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
 #endif
 
 #else /* CONFIG_MEMCG */
@@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	return 0;
 }
 
-static inline void mem_cgroup_split_huge_fixup(struct page *head)
+static inline void mem_cgroup_split_huge_fixup(struct page *head,
+					       unsigned int new_nr)
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4fead5ead31..f599f5b9bf7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
 	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(head);
+	mem_cgroup_split_huge_fixup(head, 1);
 
 	if (PageAnon(head) && PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33f632689cee..e9705ba6bbcc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
  * Because tail pages are not marked as "used", set it. We're under
  * pgdat->lru_lock and migration entries setup in all page mappings.
  */
-void mem_cgroup_split_huge_fixup(struct page *head)
+void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
 {
 	struct mem_cgroup *memcg = page_memcg(head);
 	int i;
@@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 	if (mem_cgroup_disabled())
 		return;
 
-	for (i = 1; i < thp_nr_pages(head); i++) {
+	for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
 		css_get(&memcg->css);
 		head[i].memcg_data = (unsigned long)memcg;
 	}
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
  2020-11-11 20:40 ` [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
  2020-11-11 20:40 ` [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split Zi Yan
@ 2020-11-11 20:40 ` Zi Yan
  2020-11-12 17:57   ` Ralph Campbell
                     ` (3 more replies)
  2020-11-11 20:40 ` [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages Zi Yan
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It adds a new_order parameter to set new page order in page owner.
It prepares for upcoming changes to support split huge page to any lower
order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page_owner.h | 7 ++++---
 mm/huge_memory.c           | 2 +-
 mm/page_alloc.c            | 2 +-
 mm/page_owner.c            | 6 +++---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
index 3468794f83d2..215cbb159568 100644
--- a/include/linux/page_owner.h
+++ b/include/linux/page_owner.h
@@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
 		__set_page_owner(page, order, gfp_mask);
 }
 
-static inline void split_page_owner(struct page *page, unsigned int nr)
+static inline void split_page_owner(struct page *page, unsigned int nr,
+			unsigned int new_order)
 {
 	if (static_branch_unlikely(&page_owner_inited))
-		__split_page_owner(page, nr);
+		__split_page_owner(page, nr, new_order);
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
 {
@@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
 {
 }
 static inline void split_page_owner(struct page *page,
-			unsigned int order)
+			unsigned int nr, unsigned int new_order)
 {
 }
 static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f599f5b9bf7f..8b7d771ee962 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	ClearPageCompound(head);
 
-	split_page_owner(head, nr);
+	split_page_owner(head, nr, 1);
 
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d77220615fd5..a9eead0e091a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
 
 	for (i = 1; i < (1 << order); i++)
 		set_page_refcounted(page + i);
-	split_page_owner(page, 1 << order);
+	split_page_owner(page, 1 << order, 1);
 }
 EXPORT_SYMBOL_GPL(split_page);
 
diff --git a/mm/page_owner.c b/mm/page_owner.c
index b735a8eafcdb..2b7f7e9056dc 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason)
 	page_owner->last_migrate_reason = reason;
 }
 
-void __split_page_owner(struct page *page, unsigned int nr)
+void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order)
 {
 	int i;
 	struct page_ext *page_ext = lookup_page_ext(page);
@@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr)
 	if (unlikely(!page_ext))
 		return;
 
-	for (i = 0; i < nr; i++) {
+	for (i = 0; i < nr; i += (1 << new_order)) {
 		page_owner = get_page_owner(page_ext);
-		page_owner->order = 0;
+		page_owner->order = new_order;
 		page_ext = page_ext_next(page_ext);
 	}
 }
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.
  2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
                   ` (2 preceding siblings ...)
  2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
@ 2020-11-11 20:40 ` Zi Yan
  2020-11-12 22:01   ` Ralph Campbell
  2020-11-14  0:52   ` Roman Gushchin
  2020-11-11 20:40 ` [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible Zi Yan
  2020-11-11 20:40 ` [RFC PATCH 6/6] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan
  5 siblings, 2 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

To split a THP to any lower order pages, we need to reform THPs on
subpages at given order and add page refcount based on the new page
order. Also we need to reinitialize page_deferred_list after removing
the page from the split_queue, otherwise a subsequent split will see
list corruption when checking the page_deferred_list again.

It has many uses, like minimizing the number of pages after
truncating a pagecache THP. For anonymous THPs, we can only split them
to order-0 like before until we add support for any size anonymous THPs.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/huge_mm.h |  8 +++++
 mm/huge_memory.c        | 78 +++++++++++++++++++++++++++++------------
 mm/swap.c               |  1 -
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 60a907a19f7d..9819cd9b4619 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order);
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
@@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	return 0;
 }
+static inline int
+split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return 0;
+}
 static inline int split_huge_page(struct page *page)
 {
 	return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8b7d771ee962..88f50da40c9b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 static void unmap_page(struct page *page)
 {
 	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
-		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
+		TTU_RMAP_LOCKED;
 	bool unmap_success;
 
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 
+	if (thp_order(page) >= HPAGE_PMD_ORDER)
+		ttu_flags |= TTU_SPLIT_HUGE_PMD;
+
 	if (PageAnon(page))
 		ttu_flags |= TTU_SPLIT_FREEZE;
 
@@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
 	VM_BUG_ON_PAGE(!unmap_success, page);
 }
 
-static void remap_page(struct page *page, unsigned int nr)
+static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
 {
 	int i;
-	if (PageTransHuge(page)) {
+	if (thp_nr_pages(page) == nr) {
 		remove_migration_ptes(page, page, true);
 	} else {
-		for (i = 0; i < nr; i++)
+		for (i = 0; i < nr; i += new_nr)
 			remove_migration_ptes(page + i, page + i, true);
 	}
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-		struct lruvec *lruvec, struct list_head *list)
+		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
 {
 	struct page *page_tail = head + tail;
+	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
 
 	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
 
@@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 #ifdef CONFIG_64BIT
 			 (1L << PG_arch_2) |
 #endif
+			 compound_head_flag |
 			 (1L << PG_dirty)));
 
 	/* ->mapping in first tail page is compound_mapcount */
@@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * which needs correct compound_head().
 	 */
 	clear_compound_head(page_tail);
+	if (new_order) {
+		prep_compound_page(page_tail, new_order);
+		thp_prep(page_tail);
+	}
 
 	/* Finally unfreeze refcount. Additional reference from page cache. */
-	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
-					  PageSwapCache(head)));
+	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
+					   PageSwapCache(head)) ?
+						thp_nr_pages(page_tail) : 0));
 
 	if (page_is_young(head))
 		set_page_young(page_tail);
@@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end, unsigned long flags, unsigned int new_order)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	struct address_space *swap_cache = NULL;
 	unsigned long offset = 0;
 	unsigned int nr = thp_nr_pages(head);
+	unsigned int new_nr = 1 << new_order;
 	int i;
 
 	lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
 	/* complete memcg works before add pages to LRU */
-	mem_cgroup_split_huge_fixup(head, 1);
+	mem_cgroup_split_huge_fixup(head, new_nr);
 
 	if (PageAnon(head) && PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		xa_lock(&swap_cache->i_pages);
 	}
 
-	for (i = nr - 1; i >= 1; i--) {
-		__split_huge_page_tail(head, i, lruvec, list);
+	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
+		__split_huge_page_tail(head, i, lruvec, list, new_order);
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
 			__delete_from_page_cache(head + i, NULL);
 			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
-				shmem_uncharge(head->mapping->host, 1);
+				shmem_uncharge(head->mapping->host, new_nr);
 			put_page(head + i);
 		} else if (!PageAnon(page)) {
 			__xa_store(&head->mapping->i_pages, head[i].index,
@@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		}
 	}
 
-	ClearPageCompound(head);
+	if (!new_order)
+		ClearPageCompound(head);
+	else
+		set_compound_order(head, new_order);
 
-	split_page_owner(head, nr, 1);
+	split_page_owner(head, nr, new_nr);
 
 	/* See comment in __split_huge_page_tail() */
 	if (PageAnon(head)) {
 		/* Additional pin to swap cache */
 		if (PageSwapCache(head)) {
-			page_ref_add(head, 2);
+			page_ref_add(head, 1 + new_nr);
 			xa_unlock(&swap_cache->i_pages);
 		} else {
 			page_ref_inc(head);
 		}
 	} else {
 		/* Additional pin to page cache */
-		page_ref_add(head, 2);
+		page_ref_add(head, 1 + new_nr);
 		xa_unlock(&head->mapping->i_pages);
 	}
 
 	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 
-	remap_page(head, nr);
+	remap_page(head, nr, new_nr);
 
 	if (PageSwapCache(head)) {
 		swp_entry_t entry = { .val = page_private(head) };
@@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		split_swap_cluster(entry);
 	}
 
-	for (i = 0; i < nr; i++) {
+	for (i = 0; i < nr; i += new_nr) {
 		struct page *subpage = head + i;
 		if (subpage == page)
 			continue;
@@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
  * us.
  */
 int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	return split_huge_page_to_list_to_order(page, list, 0);
+}
+
+int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order)
 {
 	struct page *head = compound_head(page);
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
 	struct deferred_split *ds_queue = get_deferred_split_queue(head);
-	XA_STATE(xas, &head->mapping->i_pages, head->index);
+	/* reset xarray order to new order after split */
+	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order);
 	struct anon_vma *anon_vma = NULL;
 	struct address_space *mapping = NULL;
 	int count, mapcount, extra_pins, ret;
 	unsigned long flags;
 	pgoff_t end;
 
+	VM_BUG_ON(thp_order(head) <= new_order);
 	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
 	VM_BUG_ON_PAGE(!PageLocked(head), head);
 	VM_BUG_ON_PAGE(!PageCompound(head), head);
 
+	if (new_order == 1) {
+		WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
+		return -EINVAL;
+	}
+
+	if (PageAnon(head) && new_order) {
+		WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
+		return -EINVAL;
+	}
+
 	if (PageWriteback(head))
 		return -EBUSY;
 
@@ -2720,18 +2752,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
-			list_del(page_deferred_list(head));
+			list_del_init(page_deferred_list(head));
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
 			if (PageSwapBacked(head))
 				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
-			else
+			else if (!new_order)
 				__mod_lruvec_page_state(head, NR_FILE_THPS,
 						-thp_nr_pages(head));
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end, flags, new_order);
 		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2746,7 +2778,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 fail:		if (mapping)
 			xas_unlock(&xas);
 		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
-		remap_page(head, thp_nr_pages(head));
+		remap_page(head, thp_nr_pages(head), 0);
 		ret = -EBUSY;
 	}
 
diff --git a/mm/swap.c b/mm/swap.c
index 14c3bac607a6..6c33e6165597 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
 		       struct lruvec *lruvec, struct list_head *list)
 {
 	VM_BUG_ON_PAGE(!PageHead(page), page);
-	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
 	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
 	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
 
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.
  2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
                   ` (3 preceding siblings ...)
  2020-11-11 20:40 ` [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages Zi Yan
@ 2020-11-11 20:40 ` Zi Yan
  2020-11-12 22:08   ` Ralph Campbell
  2020-11-11 20:40 ` [RFC PATCH 6/6] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan
  5 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

To minimize the number of pages after a truncation, when truncating a
THP, we do not need to split it all the way down to order-0. The THP has
at most three parts, the part before offset, the part to be truncated,
the part left at the end. Use the non-zero minimum of them to decide
what order we split the THP to.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/truncate.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 20bd17538ec2..6d8e3c6115bc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
 bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 {
 	loff_t pos = page_offset(page);
-	unsigned int offset, length;
+	unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;
 
 	if (pos < start)
 		offset = start - pos;
@@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 		length = length - offset;
 	else
 		length = end + 1 - pos - offset;
+	left = thp_size(page) - offset - length;
 
 	wait_on_page_writeback(page);
 	if (length == thp_size(page)) {
@@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
 		do_invalidatepage(page, offset, length);
 	if (!PageTransHuge(page))
 		return true;
-	return split_huge_page(page) == 0;
+
+	/*
+	 * find the non-zero minimum of offset, length, and left and use it to
+	 * decide the new order of the page after split
+	 */
+	if (offset && left)
+		min_subpage_size = min_t(unsigned int,
+					 min_t(unsigned int, offset, length),
+					 left);
+	else if (!offset)
+		min_subpage_size = min_t(unsigned int, length, left);
+	else /* !left */
+		min_subpage_size = min_t(unsigned int, length, offset);
+
+	min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
+
+	return split_huge_page_to_list_to_order(page, NULL,
+				ilog2(min_subpage_size/PAGE_SIZE)) == 0;
 }
 
 /*
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC PATCH 6/6] mm: huge_memory: enable debugfs to split huge pages to any order.
  2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
                   ` (4 preceding siblings ...)
  2020-11-11 20:40 ` [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible Zi Yan
@ 2020-11-11 20:40 ` Zi Yan
  5 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-11 20:40 UTC (permalink / raw)
  To: linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans, Zi Yan

From: Zi Yan <ziy@nvidia.com>

It is used to test split_huge_page_to_list_to_order for pagecache THPs.
Also add test cases for split_huge_page_to_list_to_order via both
debugfs and truncating a file.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c                              |  13 +--
 .../selftests/vm/split_huge_page_test.c       | 102 +++++++++++++++++-
 2 files changed, 105 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88f50da40c9b..b7470607a08b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2974,7 +2974,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 	static DEFINE_MUTEX(mutex);
 	ssize_t ret;
 	char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
-	int pid;
+	int pid, to_order = 0;
 	unsigned long vaddr_start, vaddr_end, addr;
 	nodemask_t task_nodes;
 	struct mm_struct *mm;
@@ -2990,8 +2990,9 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 		goto out;
 
 	input_buf[80] = '\0';
-	ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end);
-	if (ret != 3) {
+	ret = sscanf(input_buf, "%d,%lx,%lx,%d", &pid, &vaddr_start, &vaddr_end, &to_order);
+	/* cannot split to order-1 THP, which is not possible */
+	if ((ret != 3 && ret != 4) || to_order == 1) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -2999,8 +3000,8 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 	vaddr_end &= PAGE_MASK;
 
 	ret = strlen(input_buf);
-	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
-		 pid, vaddr_start, vaddr_end);
+	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx], to order: %d\n",
+		 pid, vaddr_start, vaddr_end, to_order);
 
 	mm = find_mm_struct(pid, &task_nodes);
 	if (IS_ERR(mm)) {
@@ -3038,7 +3039,7 @@ static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
 		addr += page_size(page) - PAGE_SIZE;
 
 		/* reset addr if split fails */
-		if (split_huge_page(page))
+		if (split_huge_page_to_list_to_order(page, NULL, to_order))
 			addr -= (page_size(page) - PAGE_SIZE);
 
 		unlock_page(page);
diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
index c8a32ae9e13a..bcbc5a9d327c 100644
--- a/tools/testing/selftests/vm/split_huge_page_test.c
+++ b/tools/testing/selftests/vm/split_huge_page_test.c
@@ -16,6 +16,7 @@
 #include <sys/wait.h>
 #include <malloc.h>
 #include <stdbool.h>
+#include <time.h>
 
 #define PAGE_4KB (4096UL)
 #define PAGE_2MB (512UL*PAGE_4KB)
@@ -31,6 +32,7 @@
 
 #define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
 #define SMAP_PATH "/proc/self/smaps"
+#define THP_FS_PATH "/mnt/thp_fs"
 #define INPUT_MAX 80
 
 static int write_file(const char *path, const char *buf, size_t buflen)
@@ -50,13 +52,13 @@ static int write_file(const char *path, const char *buf, size_t buflen)
 	return (unsigned int) numwritten;
 }
 
-static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
+static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end, int order)
 {
 	char input[INPUT_MAX];
 	int ret;
 
-	ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start,
-			vaddr_end);
+	ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx,%d", pid, vaddr_start,
+			vaddr_end, order);
 	if (ret >= INPUT_MAX) {
 		printf("%s: Debugfs input is too long\n", __func__);
 		exit(EXIT_FAILURE);
@@ -139,7 +141,7 @@ void split_pmd_thp(void)
 	}
 
 	/* split all possible huge pages */
-	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
+	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len, 0);
 
 	*one_page = 0;
 
@@ -153,9 +155,101 @@ void split_pmd_thp(void)
 	free(one_page);
 }
 
+void create_pagecache_thp_and_fd(size_t fd_size, int *fd, char **addr)
+{
+	const char testfile[] = THP_FS_PATH "/test";
+	size_t i;
+	int dummy;
+
+	srand(time(NULL));
+
+	*fd = open(testfile, O_CREAT | O_RDWR, 0664);
+
+	for (i = 0; i < fd_size; i++) {
+		unsigned char byte = rand();
+
+		write(*fd, &byte, sizeof(byte));
+	}
+	close(*fd);
+	sync();
+	*fd = open("/proc/sys/vm/drop_caches", O_WRONLY);
+	if (*fd == -1) {
+		perror("open drop_caches");
+		exit(EXIT_FAILURE);
+	}
+	if (write(*fd, "3", 1) != 1) {
+		perror("write to drop_caches");
+		exit(EXIT_FAILURE);
+	}
+	close(*fd);
+
+	*fd = open(testfile, O_RDWR);
+
+	*addr = mmap(NULL, fd_size, PROT_READ|PROT_WRITE, MAP_SHARED, *fd, 0);
+	if (*addr == (char *)-1) {
+		perror("cannot mmap");
+		exit(1);
+	}
+	madvise(*addr, fd_size, MADV_HUGEPAGE);
+
+	for (size_t i = 0; i < fd_size; i++)
+		dummy += *(*addr + i);
+}
+
+void split_thp_in_pagecache_to_order(int order)
+{
+	int fd;
+	char *addr;
+	size_t fd_size = 2 * PAGE_2MB, i;
+
+	create_pagecache_thp_and_fd(fd_size, &fd, &addr);
+
+	printf("split %ld kB pagecache page to order %d ... ", fd_size >> 10, order);
+	write_debugfs(getpid(), (uint64_t)addr, (uint64_t)addr + fd_size, order);
+
+	for (i = 0; i < fd_size; i++)
+		*(addr + i) = (char)i;
+
+	close(fd);
+	printf("done\n");
+}
+
+void truncate_thp_in_pagecache_to_order(int order)
+{
+	int fd;
+	char *addr;
+	size_t fd_size = 2 * PAGE_2MB, i;
+
+	create_pagecache_thp_and_fd(fd_size, &fd, &addr);
+
+	printf("truncate %ld kB pagecache page to size %lu kB ... ", fd_size >> 10, 4UL << order);
+	ftruncate(fd, PAGE_4KB << order);
+
+	for (i = 0; i < (PAGE_4KB << order); i++)
+		*(addr + i) = (char)i;
+
+	close(fd);
+	printf("done\n");
+}
+
 int main(int argc, char **argv)
 {
+	int i;
+
+	if (geteuid() != 0) {
+		printf("Please run the benchmark as root\n");
+		exit(EXIT_FAILURE);
+	}
+
 	split_pmd_thp();
 
+	for (i = 8; i >= 0; i--)
+		if (i != 1)
+			split_thp_in_pagecache_to_order(i);
+
+	for (i = 8; i >= 0; i--)
+		if (i != 1)
+			truncate_thp_in_pagecache_to_order(i);
+
 	return 0;
 }
-- 
2.28.0



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
@ 2020-11-12 17:57   ` Ralph Campbell
  2020-11-12 17:59     ` Zi Yan
  2020-11-14  0:15   ` Roman Gushchin
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Ralph Campbell @ 2020-11-12 17:57 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	David Nellans


On 11/11/20 12:40 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It adds a new_order parameter to set new page order in page owner.
> It prepares for upcoming changes to support split huge page to any lower
> order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Except for a minor fix below, you can add:
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   include/linux/page_owner.h | 7 ++++---
>   mm/huge_memory.c           | 2 +-
>   mm/page_alloc.c            | 2 +-
>   mm/page_owner.c            | 6 +++---
>   4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> index 3468794f83d2..215cbb159568 100644
> --- a/include/linux/page_owner.h
> +++ b/include/linux/page_owner.h
> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>   		__set_page_owner(page, order, gfp_mask);
>   }
>   
> -static inline void split_page_owner(struct page *page, unsigned int nr)
> +static inline void split_page_owner(struct page *page, unsigned int nr,
> +			unsigned int new_order)
>   {
>   	if (static_branch_unlikely(&page_owner_inited))
> -		__split_page_owner(page, nr);
> +		__split_page_owner(page, nr, new_order);
>   }
>   static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
>   {
> @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
>   {
>   }
>   static inline void split_page_owner(struct page *page,
> -			unsigned int order)
> +			unsigned int nr, unsigned int new_order)
>   {
>   }
>   static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f599f5b9bf7f..8b7d771ee962 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   
>   	ClearPageCompound(head);
>   
> -	split_page_owner(head, nr);
> +	split_page_owner(head, nr, 1);

Shouldn't this be 0, not 1?
(new_order not new_nr).

>   	/* See comment in __split_huge_page_tail() */
>   	if (PageAnon(head)) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d77220615fd5..a9eead0e091a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
>   
>   	for (i = 1; i < (1 << order); i++)
>   		set_page_refcounted(page + i);
> -	split_page_owner(page, 1 << order);
> +	split_page_owner(page, 1 << order, 1);

Ditto, 0.

>   }
>   EXPORT_SYMBOL_GPL(split_page);

> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index b735a8eafcdb..2b7f7e9056dc 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason)
>   	page_owner->last_migrate_reason = reason;
>   }
>   
> -void __split_page_owner(struct page *page, unsigned int nr)
> +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order)
>   {
>   	int i;
>   	struct page_ext *page_ext = lookup_page_ext(page);
> @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr)
>   	if (unlikely(!page_ext))
>   		return;
>   
> -	for (i = 0; i < nr; i++) {
> +	for (i = 0; i < nr; i += (1 << new_order)) {
>   		page_owner = get_page_owner(page_ext);
> -		page_owner->order = 0;
> +		page_owner->order = new_order;
>   		page_ext = page_ext_next(page_ext);
>   	}
>   }
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.
  2020-11-11 20:40 ` [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split Zi Yan
@ 2020-11-12 17:58   ` Ralph Campbell
  2020-11-12 18:00     ` Zi Yan
  2020-11-14  0:23   ` Roman Gushchin
  1 sibling, 1 reply; 36+ messages in thread
From: Ralph Campbell @ 2020-11-12 17:58 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	David Nellans


On 11/11/20 12:40 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It reads thp_nr_pages and splits to provided new_nr. It prepares for
> upcoming changes to support split huge page to any lower order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Looks OK to me.
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   include/linux/memcontrol.h | 5 +++--
>   mm/huge_memory.c           | 2 +-
>   mm/memcontrol.c            | 4 ++--
>   3 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0f4dd7829fb2..b3bac79ceed6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>   }
>   
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void mem_cgroup_split_huge_fixup(struct page *head);
> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
>   #endif
>   
>   #else /* CONFIG_MEMCG */
> @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>   	return 0;
>   }
>   
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +static inline void mem_cgroup_split_huge_fixup(struct page *head,
> +					       unsigned int new_nr)
>   {
>   }
>   
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c4fead5ead31..f599f5b9bf7f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>   
>   	/* complete memcg works before add pages to LRU */
> -	mem_cgroup_split_huge_fixup(head);
> +	mem_cgroup_split_huge_fixup(head, 1);
>   
>   	if (PageAnon(head) && PageSwapCache(head)) {
>   		swp_entry_t entry = { .val = page_private(head) };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 33f632689cee..e9705ba6bbcc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>    * Because tail pages are not marked as "used", set it. We're under
>    * pgdat->lru_lock and migration entries setup in all page mappings.
>    */
> -void mem_cgroup_split_huge_fixup(struct page *head)
> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
>   {
>   	struct mem_cgroup *memcg = page_memcg(head);
>   	int i;
> @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>   	if (mem_cgroup_disabled())
>   		return;
>   
> -	for (i = 1; i < thp_nr_pages(head); i++) {
> +	for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
>   		css_get(&memcg->css);
>   		head[i].memcg_data = (unsigned long)memcg;
>   	}
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-12 17:57   ` Ralph Campbell
@ 2020-11-12 17:59     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-12 17:59 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, David Nellans

[-- Attachment #1: Type: text/plain, Size: 2844 bytes --]

On 12 Nov 2020, at 12:57, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It adds a new_order parameter to set new page order in page owner.
>> It prepares for upcoming changes to support split huge page to any lower
>> order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> Except for a minor fix below, you can add:
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

Thanks.

>
>> ---
>>   include/linux/page_owner.h | 7 ++++---
>>   mm/huge_memory.c           | 2 +-
>>   mm/page_alloc.c            | 2 +-
>>   mm/page_owner.c            | 6 +++---
>>   4 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
>> index 3468794f83d2..215cbb159568 100644
>> --- a/include/linux/page_owner.h
>> +++ b/include/linux/page_owner.h
>> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>>   		__set_page_owner(page, order, gfp_mask);
>>   }
>>  -static inline void split_page_owner(struct page *page, unsigned int nr)
>> +static inline void split_page_owner(struct page *page, unsigned int nr,
>> +			unsigned int new_order)
>>   {
>>   	if (static_branch_unlikely(&page_owner_inited))
>> -		__split_page_owner(page, nr);
>> +		__split_page_owner(page, nr, new_order);
>>   }
>>   static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
>>   {
>> @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
>>   {
>>   }
>>   static inline void split_page_owner(struct page *page,
>> -			unsigned int order)
>> +			unsigned int nr, unsigned int new_order)
>>   {
>>   }
>>   static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index f599f5b9bf7f..8b7d771ee962 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>    	ClearPageCompound(head);
>>  -	split_page_owner(head, nr);
>> +	split_page_owner(head, nr, 1);
>
> Shouldn't this be 0, not 1?
> (new_order not new_nr).
>

Yes, I forgot to fix the call site after I change the function signature. Thanks.

>>   	/* See comment in __split_huge_page_tail() */
>>   	if (PageAnon(head)) {
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d77220615fd5..a9eead0e091a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
>>    	for (i = 1; i < (1 << order); i++)
>>   		set_page_refcounted(page + i);
>> -	split_page_owner(page, 1 << order);
>> +	split_page_owner(page, 1 << order, 1);
>
> Ditto, 0.
>

Sure, will fix this too.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.
  2020-11-12 17:58   ` Ralph Campbell
@ 2020-11-12 18:00     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-12 18:00 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, David Nellans

[-- Attachment #1: Type: text/plain, Size: 433 bytes --]

On 12 Nov 2020, at 12:58, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It reads thp_nr_pages and splits to provided new_nr. It prepares for
>> upcoming changes to support split huge page to any lower order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> Looks OK to me.
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.
  2020-11-11 20:40 ` [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages Zi Yan
@ 2020-11-12 22:01   ` Ralph Campbell
  2020-11-12 22:20     ` Zi Yan
  2020-11-14  0:52   ` Roman Gushchin
  1 sibling, 1 reply; 36+ messages in thread
From: Ralph Campbell @ 2020-11-12 22:01 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	David Nellans


On 11/11/20 12:40 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> To split a THP to any lower order pages, we need to reform THPs on
> subpages at given order and add page refcount based on the new page
> order. Also we need to reinitialize page_deferred_list after removing
> the page from the split_queue, otherwise a subsequent split will see
> list corruption when checking the page_deferred_list again.
> 
> It has many uses, like minimizing the number of pages after
> truncating a pagecache THP. For anonymous THPs, we can only split them
> to order-0 like before until we add support for any size anonymous THPs.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   include/linux/huge_mm.h |  8 +++++
>   mm/huge_memory.c        | 78 +++++++++++++++++++++++++++++------------
>   mm/swap.c               |  1 -
>   3 files changed, 63 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 60a907a19f7d..9819cd9b4619 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>   
>   bool can_split_huge_page(struct page *page, int *pextra_pins);
>   int split_huge_page_to_list(struct page *page, struct list_head *list);
> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order);
>   static inline int split_huge_page(struct page *page)
>   {
>   	return split_huge_page_to_list(page, NULL);
> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
>   {
>   	return 0;
>   }
> +static inline int
> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return 0;
> +}
>   static inline int split_huge_page(struct page *page)
>   {
>   	return 0;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8b7d771ee962..88f50da40c9b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>   static void unmap_page(struct page *page)
>   {
>   	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
> -		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> +		TTU_RMAP_LOCKED;
>   	bool unmap_success;
>   
>   	VM_BUG_ON_PAGE(!PageHead(page), page);
>   
> +	if (thp_order(page) >= HPAGE_PMD_ORDER)
> +		ttu_flags |= TTU_SPLIT_HUGE_PMD;
> +
>   	if (PageAnon(page))
>   		ttu_flags |= TTU_SPLIT_FREEZE;
>   
> @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>   	VM_BUG_ON_PAGE(!unmap_success, page);
>   }
>   
> -static void remap_page(struct page *page, unsigned int nr)
> +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
>   {
>   	int i;

Use unsigned int i?
Maybe a blank line here and the {}'s around if/else aren't needed.

> -	if (PageTransHuge(page)) {
> +	if (thp_nr_pages(page) == nr) {
>   		remove_migration_ptes(page, page, true);
>   	} else {
> -		for (i = 0; i < nr; i++)
> +		for (i = 0; i < nr; i += new_nr)
>   			remove_migration_ptes(page + i, page + i, true);
>   	}
>   }
>   
>   static void __split_huge_page_tail(struct page *head, int tail,
> -		struct lruvec *lruvec, struct list_head *list)
> +		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
>   {
>   	struct page *page_tail = head + tail;
> +	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>   
>   	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
>   
> @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>   #ifdef CONFIG_64BIT
>   			 (1L << PG_arch_2) |
>   #endif
> +			 compound_head_flag |
>   			 (1L << PG_dirty)));
>   
>   	/* ->mapping in first tail page is compound_mapcount */
> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
>   	 * which needs correct compound_head().
>   	 */
>   	clear_compound_head(page_tail);
> +	if (new_order) {
> +		prep_compound_page(page_tail, new_order);
> +		thp_prep(page_tail);
> +	}
>   
>   	/* Finally unfreeze refcount. Additional reference from page cache. */
> -	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
> -					  PageSwapCache(head)));
> +	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
> +					   PageSwapCache(head)) ?
> +						thp_nr_pages(page_tail) : 0));
>   
>   	if (page_is_young(head))
>   		set_page_young(page_tail);
> @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>   }
>   
>   static void __split_huge_page(struct page *page, struct list_head *list,
> -		pgoff_t end, unsigned long flags)
> +		pgoff_t end, unsigned long flags, unsigned int new_order)
>   {
>   	struct page *head = compound_head(page);
>   	pg_data_t *pgdat = page_pgdat(head);
> @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   	struct address_space *swap_cache = NULL;
>   	unsigned long offset = 0;
>   	unsigned int nr = thp_nr_pages(head);
> +	unsigned int new_nr = 1 << new_order;
>   	int i;
>   
>   	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>   
>   	/* complete memcg works before add pages to LRU */
> -	mem_cgroup_split_huge_fixup(head, 1);
> +	mem_cgroup_split_huge_fixup(head, new_nr);
>   
>   	if (PageAnon(head) && PageSwapCache(head)) {
>   		swp_entry_t entry = { .val = page_private(head) };
> @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		xa_lock(&swap_cache->i_pages);
>   	}
>   
> -	for (i = nr - 1; i >= 1; i--) {
> -		__split_huge_page_tail(head, i, lruvec, list);
> +	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
> +		__split_huge_page_tail(head, i, lruvec, list, new_order);
>   		/* Some pages can be beyond i_size: drop them from page cache */
>   		if (head[i].index >= end) {
>   			ClearPageDirty(head + i);
>   			__delete_from_page_cache(head + i, NULL);
>   			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
> -				shmem_uncharge(head->mapping->host, 1);
> +				shmem_uncharge(head->mapping->host, new_nr);
>   			put_page(head + i);
>   		} else if (!PageAnon(page)) {
>   			__xa_store(&head->mapping->i_pages, head[i].index,
> @@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		}
>   	}
>   
> -	ClearPageCompound(head);
> +	if (!new_order)
> +		ClearPageCompound(head);
> +	else
> +		set_compound_order(head, new_order);
>   
> -	split_page_owner(head, nr, 1);
> +	split_page_owner(head, nr, new_nr);

This needs to be "new_order" instead of "new_nr".

>   	/* See comment in __split_huge_page_tail() */
>   	if (PageAnon(head)) {
>   		/* Additional pin to swap cache */
>   		if (PageSwapCache(head)) {
> -			page_ref_add(head, 2);
> +			page_ref_add(head, 1 + new_nr);
>   			xa_unlock(&swap_cache->i_pages);
>   		} else {
>   			page_ref_inc(head);
>   		}
>   	} else {
>   		/* Additional pin to page cache */
> -		page_ref_add(head, 2);
> +		page_ref_add(head, 1 + new_nr);
>   		xa_unlock(&head->mapping->i_pages);
>   	}
>   
>   	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>   
> -	remap_page(head, nr);
> +	remap_page(head, nr, new_nr);
>   
>   	if (PageSwapCache(head)) {
>   		swp_entry_t entry = { .val = page_private(head) };
> @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>   		split_swap_cluster(entry);
>   	}
>   
> -	for (i = 0; i < nr; i++) {
> +	for (i = 0; i < nr; i += new_nr) {
>   		struct page *subpage = head + i;
>   		if (subpage == page)
>   			continue;
> @@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
>    * us.
>    */
>   int split_huge_page_to_list(struct page *page, struct list_head *list)
> +{
> +	return split_huge_page_to_list_to_order(page, list, 0);
> +}
> +
> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order)
>   {
>   	struct page *head = compound_head(page);
>   	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
>   	struct deferred_split *ds_queue = get_deferred_split_queue(head);
> -	XA_STATE(xas, &head->mapping->i_pages, head->index);
> +	/* reset xarray order to new order after split */
> +	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order);
>   	struct anon_vma *anon_vma = NULL;
>   	struct address_space *mapping = NULL;
>   	int count, mapcount, extra_pins, ret;
>   	unsigned long flags;
>   	pgoff_t end;
>   
> +	VM_BUG_ON(thp_order(head) <= new_order);
>   	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
>   	VM_BUG_ON_PAGE(!PageLocked(head), head);
>   	VM_BUG_ON_PAGE(!PageCompound(head), head);
>   
> +	if (new_order == 1) {
> +		WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
> +		return -EINVAL;
> +	}
> +
> +	if (PageAnon(head) && new_order) {
> +		WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
> +		return -EINVAL;
> +	}
> +
>   	if (PageWriteback(head))
>   		return -EBUSY;
>   
> @@ -2720,18 +2752,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>   	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
>   		if (!list_empty(page_deferred_list(head))) {
>   			ds_queue->split_queue_len--;
> -			list_del(page_deferred_list(head));
> +			list_del_init(page_deferred_list(head));
>   		}
>   		spin_unlock(&ds_queue->split_queue_lock);
>   		if (mapping) {
>   			if (PageSwapBacked(head))
>   				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
> -			else
> +			else if (!new_order)
>   				__mod_lruvec_page_state(head, NR_FILE_THPS,
>   						-thp_nr_pages(head));
>   		}
>   
> -		__split_huge_page(page, list, end, flags);
> +		__split_huge_page(page, list, end, flags, new_order);
>   		ret = 0;
>   	} else {
>   		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
> @@ -2746,7 +2778,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>   fail:		if (mapping)
>   			xas_unlock(&xas);
>   		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
> -		remap_page(head, thp_nr_pages(head));
> +		remap_page(head, thp_nr_pages(head), 0);

Shouldn't this be "1" instead of zero?
remap_page() takes new_nr.

>   		ret = -EBUSY;
>   	}
>   
> diff --git a/mm/swap.c b/mm/swap.c
> index 14c3bac607a6..6c33e6165597 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>   		       struct lruvec *lruvec, struct list_head *list)
>   {
>   	VM_BUG_ON_PAGE(!PageHead(page), page);
> -	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>   	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>   	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
>   
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.
  2020-11-11 20:40 ` [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible Zi Yan
@ 2020-11-12 22:08   ` Ralph Campbell
  2020-11-12 22:37     ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: Ralph Campbell @ 2020-11-12 22:08 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	David Nellans


On 11/11/20 12:40 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> To minimize the number of pages after a truncation, when truncating a
> THP, we do not need to split it all the way down to order-0. The THP has
> at most three parts, the part before offset, the part to be truncated,
> the part left at the end. Use the non-zero minimum of them to decide
> what order we split the THP to.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/truncate.c | 22 ++++++++++++++++++++--
>   1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 20bd17538ec2..6d8e3c6115bc 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
>   bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
>   {
>   	loff_t pos = page_offset(page);
> -	unsigned int offset, length;
> +	unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;

Maybe use "remaining" instead of "left" since I think of the latter as the length of the
left side (offset).
  
>   	if (pos < start)
>   		offset = start - pos;
> @@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
>   		length = length - offset;
>   	else
>   		length = end + 1 - pos - offset;
> +	left = thp_size(page) - offset - length;
>   
>   	wait_on_page_writeback(page);
>   	if (length == thp_size(page)) {
> @@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
>   		do_invalidatepage(page, offset, length);
>   	if (!PageTransHuge(page))
>   		return true;
> -	return split_huge_page(page) == 0;
> +
> +	/*
> +	 * find the non-zero minimum of offset, length, and left and use it to
> +	 * decide the new order of the page after split
> +	 */
> +	if (offset && left)
> +		min_subpage_size = min_t(unsigned int,
> +					 min_t(unsigned int, offset, length),
> +					 left);
> +	else if (!offset)
> +		min_subpage_size = min_t(unsigned int, length, left);
> +	else /* !left */
> +		min_subpage_size = min_t(unsigned int, length, offset);
> +
> +	min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
> +
> +	return split_huge_page_to_list_to_order(page, NULL,
> +				ilog2(min_subpage_size/PAGE_SIZE)) == 0;
>   }

What if "min_subpage_size" is 1/2 the THP but offset isn't aligned to 1/2?
Splitting the page in half wouldn't result in a page that could be freed
but maybe splitting to 1/4 would (assuming the THP is at least 8x PAGE_SIZE).


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.
  2020-11-12 22:01   ` Ralph Campbell
@ 2020-11-12 22:20     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-12 22:20 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, David Nellans

[-- Attachment #1: Type: text/plain, Size: 11337 bytes --]

On 12 Nov 2020, at 17:01, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> To split a THP to any lower order pages, we need to reform THPs on
>> subpages at given order and add page refcount based on the new page
>> order. Also we need to reinitialize page_deferred_list after removing
>> the page from the split_queue, otherwise a subsequent split will see
>> list corruption when checking the page_deferred_list again.
>>
>> It has many uses, like minimizing the number of pages after
>> truncating a pagecache THP. For anonymous THPs, we can only split them
>> to order-0 like before until we add support for any size anonymous THPs.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   include/linux/huge_mm.h |  8 +++++
>>   mm/huge_memory.c        | 78 +++++++++++++++++++++++++++++------------
>>   mm/swap.c               |  1 -
>>   3 files changed, 63 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 60a907a19f7d..9819cd9b4619 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>>    bool can_split_huge_page(struct page *page, int *pextra_pins);
>>   int split_huge_page_to_list(struct page *page, struct list_head *list);
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order);
>>   static inline int split_huge_page(struct page *page)
>>   {
>>   	return split_huge_page_to_list(page, NULL);
>> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
>>   {
>>   	return 0;
>>   }
>> +static inline int
>> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>> +{
>> +	return 0;
>> +}
>>   static inline int split_huge_page(struct page *page)
>>   {
>>   	return 0;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8b7d771ee962..88f50da40c9b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>>   static void unmap_page(struct page *page)
>>   {
>>   	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
>> -		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
>> +		TTU_RMAP_LOCKED;
>>   	bool unmap_success;
>>    	VM_BUG_ON_PAGE(!PageHead(page), page);
>>  +	if (thp_order(page) >= HPAGE_PMD_ORDER)
>> +		ttu_flags |= TTU_SPLIT_HUGE_PMD;
>> +
>>   	if (PageAnon(page))
>>   		ttu_flags |= TTU_SPLIT_FREEZE;
>>  @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>>   	VM_BUG_ON_PAGE(!unmap_success, page);
>>   }
>>  -static void remap_page(struct page *page, unsigned int nr)
>> +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
>>   {
>>   	int i;
>
> Use unsigned int i?
> Maybe a blank line here and the {}'s around if/else aren't needed.
>
>> -	if (PageTransHuge(page)) {
>> +	if (thp_nr_pages(page) == nr) {
>>   		remove_migration_ptes(page, page, true);
>>   	} else {
>> -		for (i = 0; i < nr; i++)
>> +		for (i = 0; i < nr; i += new_nr)
>>   			remove_migration_ptes(page + i, page + i, true);
>>   	}
>>   }
>>    static void __split_huge_page_tail(struct page *head, int tail,
>> -		struct lruvec *lruvec, struct list_head *list)
>> +		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
>>   {
>>   	struct page *page_tail = head + tail;
>> +	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>>    	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
>>  @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>   #ifdef CONFIG_64BIT
>>   			 (1L << PG_arch_2) |
>>   #endif
>> +			 compound_head_flag |
>>   			 (1L << PG_dirty)));
>>    	/* ->mapping in first tail page is compound_mapcount */
>> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>   	 * which needs correct compound_head().
>>   	 */
>>   	clear_compound_head(page_tail);
>> +	if (new_order) {
>> +		prep_compound_page(page_tail, new_order);
>> +		thp_prep(page_tail);
>> +	}
>>    	/* Finally unfreeze refcount. Additional reference from page cache. */
>> -	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
>> -					  PageSwapCache(head)));
>> +	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
>> +					   PageSwapCache(head)) ?
>> +						thp_nr_pages(page_tail) : 0));
>>    	if (page_is_young(head))
>>   		set_page_young(page_tail);
>> @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>   }
>>    static void __split_huge_page(struct page *page, struct list_head *list,
>> -		pgoff_t end, unsigned long flags)
>> +		pgoff_t end, unsigned long flags, unsigned int new_order)
>>   {
>>   	struct page *head = compound_head(page);
>>   	pg_data_t *pgdat = page_pgdat(head);
>> @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>   	struct address_space *swap_cache = NULL;
>>   	unsigned long offset = 0;
>>   	unsigned int nr = thp_nr_pages(head);
>> +	unsigned int new_nr = 1 << new_order;
>>   	int i;
>>    	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>>    	/* complete memcg works before add pages to LRU */
>> -	mem_cgroup_split_huge_fixup(head, 1);
>> +	mem_cgroup_split_huge_fixup(head, new_nr);
>>    	if (PageAnon(head) && PageSwapCache(head)) {
>>   		swp_entry_t entry = { .val = page_private(head) };
>> @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>   		xa_lock(&swap_cache->i_pages);
>>   	}
>>  -	for (i = nr - 1; i >= 1; i--) {
>> -		__split_huge_page_tail(head, i, lruvec, list);
>> +	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
>> +		__split_huge_page_tail(head, i, lruvec, list, new_order);
>>   		/* Some pages can be beyond i_size: drop them from page cache */
>>   		if (head[i].index >= end) {
>>   			ClearPageDirty(head + i);
>>   			__delete_from_page_cache(head + i, NULL);
>>   			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
>> -				shmem_uncharge(head->mapping->host, 1);
>> +				shmem_uncharge(head->mapping->host, new_nr);
>>   			put_page(head + i);
>>   		} else if (!PageAnon(page)) {
>>   			__xa_store(&head->mapping->i_pages, head[i].index,
>> @@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>   		}
>>   	}
>>  -	ClearPageCompound(head);
>> +	if (!new_order)
>> +		ClearPageCompound(head);
>> +	else
>> +		set_compound_order(head, new_order);
>>  -	split_page_owner(head, nr, 1);
>> +	split_page_owner(head, nr, new_nr);
>
> This needs to be "new_order" instead of "new_nr".
>
>>   	/* See comment in __split_huge_page_tail() */
>>   	if (PageAnon(head)) {
>>   		/* Additional pin to swap cache */
>>   		if (PageSwapCache(head)) {
>> -			page_ref_add(head, 2);
>> +			page_ref_add(head, 1 + new_nr);
>>   			xa_unlock(&swap_cache->i_pages);
>>   		} else {
>>   			page_ref_inc(head);
>>   		}
>>   	} else {
>>   		/* Additional pin to page cache */
>> -		page_ref_add(head, 2);
>> +		page_ref_add(head, 1 + new_nr);
>>   		xa_unlock(&head->mapping->i_pages);
>>   	}
>>    	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>>  -	remap_page(head, nr);
>> +	remap_page(head, nr, new_nr);
>>    	if (PageSwapCache(head)) {
>>   		swp_entry_t entry = { .val = page_private(head) };
>> @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>   		split_swap_cluster(entry);
>>   	}
>>  -	for (i = 0; i < nr; i++) {
>> +	for (i = 0; i < nr; i += new_nr) {
>>   		struct page *subpage = head + i;
>>   		if (subpage == page)
>>   			continue;
>> @@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
>>    * us.
>>    */
>>   int split_huge_page_to_list(struct page *page, struct list_head *list)
>> +{
>> +	return split_huge_page_to_list_to_order(page, list, 0);
>> +}
>> +
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order)
>>   {
>>   	struct page *head = compound_head(page);
>>   	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
>>   	struct deferred_split *ds_queue = get_deferred_split_queue(head);
>> -	XA_STATE(xas, &head->mapping->i_pages, head->index);
>> +	/* reset xarray order to new order after split */
>> +	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order);
>>   	struct anon_vma *anon_vma = NULL;
>>   	struct address_space *mapping = NULL;
>>   	int count, mapcount, extra_pins, ret;
>>   	unsigned long flags;
>>   	pgoff_t end;
>>  +	VM_BUG_ON(thp_order(head) <= new_order);
>>   	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
>>   	VM_BUG_ON_PAGE(!PageLocked(head), head);
>>   	VM_BUG_ON_PAGE(!PageCompound(head), head);
>>  +	if (new_order == 1) {
>> +		WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (PageAnon(head) && new_order) {
>> +		WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
>> +		return -EINVAL;
>> +	}
>> +
>>   	if (PageWriteback(head))
>>   		return -EBUSY;
>>  @@ -2720,18 +2752,18 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>   	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
>>   		if (!list_empty(page_deferred_list(head))) {
>>   			ds_queue->split_queue_len--;
>> -			list_del(page_deferred_list(head));
>> +			list_del_init(page_deferred_list(head));
>>   		}
>>   		spin_unlock(&ds_queue->split_queue_lock);
>>   		if (mapping) {
>>   			if (PageSwapBacked(head))
>>   				__dec_lruvec_page_state(head, NR_SHMEM_THPS);
>> -			else
>> +			else if (!new_order)
>>   				__mod_lruvec_page_state(head, NR_FILE_THPS,
>>   						-thp_nr_pages(head));
>>   		}
>>  -		__split_huge_page(page, list, end, flags);
>> +		__split_huge_page(page, list, end, flags, new_order);
>>   		ret = 0;
>>   	} else {
>>   		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
>> @@ -2746,7 +2778,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>   fail:		if (mapping)
>>   			xas_unlock(&xas);
>>   		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
>> -		remap_page(head, thp_nr_pages(head));
>> +		remap_page(head, thp_nr_pages(head), 0);
>
> Shouldn't this be "1" instead of zero?
> remap_page() takes new_nr.
>
>>   		ret = -EBUSY;
>>   	}
>>  diff --git a/mm/swap.c b/mm/swap.c
>> index 14c3bac607a6..6c33e6165597 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -983,7 +983,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>>   		       struct lruvec *lruvec, struct list_head *list)
>>   {
>>   	VM_BUG_ON_PAGE(!PageHead(page), page);
>> -	VM_BUG_ON_PAGE(PageCompound(page_tail), page);
>>   	VM_BUG_ON_PAGE(PageLRU(page_tail), page);
>>   	lockdep_assert_held(&lruvec_pgdat(lruvec)->lru_lock);
>>

Thanks for catching all these. Will fix them in the next version.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.
  2020-11-11 20:40 ` [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
@ 2020-11-12 22:22   ` Ralph Campbell
  2020-11-12 22:38     ` Zi Yan
  2020-11-16 16:06   ` Kirill A. Shutemov
  1 sibling, 1 reply; 36+ messages in thread
From: Ralph Campbell @ 2020-11-12 22:22 UTC (permalink / raw)
  To: Zi Yan, linux-mm, Matthew Wilcox
  Cc: Kirill A . Shutemov, Roman Gushchin, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	David Nellans


On 11/11/20 12:40 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Huge pages in the process with the given pid and virtual address range
> are split. It is used to test split huge page function. In addition,
> a testing program is added to tools/testing/selftests/vm to utilize the
> interface by splitting PMD THPs.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>   mm/huge_memory.c                              |  98 +++++++++++
>   mm/internal.h                                 |   1 +
>   mm/migrate.c                                  |   2 +-
>   tools/testing/selftests/vm/Makefile           |   1 +
>   .../selftests/vm/split_huge_page_test.c       | 161 ++++++++++++++++++
>   5 files changed, 262 insertions(+), 1 deletion(-)
>   create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c

Don't forget to update ".gitignore" to include "split_huge_page_test".


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible.
  2020-11-12 22:08   ` Ralph Campbell
@ 2020-11-12 22:37     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-12 22:37 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, David Nellans

[-- Attachment #1: Type: text/plain, Size: 3290 bytes --]

On 12 Nov 2020, at 17:08, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> To minimize the number of pages after a truncation, when truncating a
>> THP, we do not need to split it all the way down to order-0. The THP has
>> at most three parts, the part before offset, the part to be truncated,
>> the part left at the end. Use the non-zero minimum of them to decide
>> what order we split the THP to.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/truncate.c | 22 ++++++++++++++++++++--
>>   1 file changed, 20 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 20bd17538ec2..6d8e3c6115bc 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -237,7 +237,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page)
>>   bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
>>   {
>>   	loff_t pos = page_offset(page);
>> -	unsigned int offset, length;
>> +	unsigned int offset, length, left, min_subpage_size = PAGE_SIZE;
>
> Maybe use "remaining" instead of "left" since I think of the latter as the length of the
> left side (offset).

Sure. Will change the name.

>
>>   	if (pos < start)
>>   		offset = start - pos;
>> @@ -248,6 +248,7 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
>>   		length = length - offset;
>>   	else
>>   		length = end + 1 - pos - offset;
>> +	left = thp_size(page) - offset - length;
>>    	wait_on_page_writeback(page);
>>   	if (length == thp_size(page)) {
>> @@ -267,7 +268,24 @@ bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end)
>>   		do_invalidatepage(page, offset, length);
>>   	if (!PageTransHuge(page))
>>   		return true;
>> -	return split_huge_page(page) == 0;
>> +
>> +	/*
>> +	 * find the non-zero minimum of offset, length, and left and use it to
>> +	 * decide the new order of the page after split
>> +	 */
>> +	if (offset && left)
>> +		min_subpage_size = min_t(unsigned int,
>> +					 min_t(unsigned int, offset, length),
>> +					 left);
>> +	else if (!offset)
>> +		min_subpage_size = min_t(unsigned int, length, left);
>> +	else /* !left */
>> +		min_subpage_size = min_t(unsigned int, length, offset);
>> +
>> +	min_subpage_size = max_t(unsigned int, PAGE_SIZE, min_subpage_size);
>> +
>> +	return split_huge_page_to_list_to_order(page, NULL,
>> +				ilog2(min_subpage_size/PAGE_SIZE)) == 0;
>>   }
>
> What if "min_subpage_size" is 1/2 the THP but offset isn't aligned to 1/2?
> Splitting the page in half wouldn't result in a page that could be freed
> but maybe splitting to 1/4 would (assuming the THP is at least 8x PAGE_SIZE).

Is it possible? The whole THP is divided into three parts, offset, length, and
remaining (renamed from left). If offset is not aligned to 1/2, it is either
greater than 1/2 or smaller than 1/2. If it is the former, length and remaining
will be smaller than 1/2, so min_subpage_size cannot be 1/2. If it is the latter,
min_subpage_size cannot be 1/2 either. Because min_subpage_size is the smallest
non-zero value of offset, length, and remaining. Let me know if I miss anything.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.
  2020-11-12 22:22   ` Ralph Campbell
@ 2020-11-12 22:38     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-12 22:38 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, David Nellans

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]

On 12 Nov 2020, at 17:22, Ralph Campbell wrote:

> On 11/11/20 12:40 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Huge pages in the process with the given pid and virtual address range
>> are split. It is used to test split huge page function. In addition,
>> a testing program is added to tools/testing/selftests/vm to utilize the
>> interface by splitting PMD THPs.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>   mm/huge_memory.c                              |  98 +++++++++++
>>   mm/internal.h                                 |   1 +
>>   mm/migrate.c                                  |   2 +-
>>   tools/testing/selftests/vm/Makefile           |   1 +
>>   .../selftests/vm/split_huge_page_test.c       | 161 ++++++++++++++++++
>>   5 files changed, 262 insertions(+), 1 deletion(-)
>>   create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>
> Don't forget to update ".gitignore" to include "split_huge_page_test".

Sure. Thanks for pointing this out.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
  2020-11-12 17:57   ` Ralph Campbell
@ 2020-11-14  0:15   ` Roman Gushchin
  2020-11-14  1:08     ` Zi Yan
  2020-11-16 16:25   ` Kirill A. Shutemov
  2020-11-17 21:10   ` Matthew Wilcox
  3 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2020-11-14  0:15 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It adds a new_order parameter to set new page order in page owner.
> It prepares for upcoming changes to support split huge page to any lower
> order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/page_owner.h | 7 ++++---
>  mm/huge_memory.c           | 2 +-
>  mm/page_alloc.c            | 2 +-
>  mm/page_owner.c            | 6 +++---
>  4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> index 3468794f83d2..215cbb159568 100644
> --- a/include/linux/page_owner.h
> +++ b/include/linux/page_owner.h
> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>  		__set_page_owner(page, order, gfp_mask);
>  }
>  
> -static inline void split_page_owner(struct page *page, unsigned int nr)
> +static inline void split_page_owner(struct page *page, unsigned int nr,
> +			unsigned int new_order)
>  {
>  	if (static_branch_unlikely(&page_owner_inited))
> -		__split_page_owner(page, nr);
> +		__split_page_owner(page, nr, new_order);
>  }
>  static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
>  {
> @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page *page,
>  {
>  }
>  static inline void split_page_owner(struct page *page,
> -			unsigned int order)
> +			unsigned int nr, unsigned int new_order)

With the addition of the new argument it's a bit hard to understand
what the function is supposed to do. It seems like nr == page_order(page),
is it right? Maybe we can pass old_order and new_order? Or just the page
and the new order?

>  {
>  }
>  static inline void copy_page_owner(struct page *oldpage, struct page *newpage)
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f599f5b9bf7f..8b7d771ee962 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  
>  	ClearPageCompound(head);
>  
> -	split_page_owner(head, nr);
> +	split_page_owner(head, nr, 1);
>  
>  	/* See comment in __split_huge_page_tail() */
>  	if (PageAnon(head)) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d77220615fd5..a9eead0e091a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int order)
>  
>  	for (i = 1; i < (1 << order); i++)
>  		set_page_refcounted(page + i);
> -	split_page_owner(page, 1 << order);
> +	split_page_owner(page, 1 << order, 1);
>  }
>  EXPORT_SYMBOL_GPL(split_page);
>  
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index b735a8eafcdb..2b7f7e9056dc 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page *page, int reason)
>  	page_owner->last_migrate_reason = reason;
>  }
>  
> -void __split_page_owner(struct page *page, unsigned int nr)
> +void __split_page_owner(struct page *page, unsigned int nr, unsigned int new_order)
>  {
>  	int i;
>  	struct page_ext *page_ext = lookup_page_ext(page);
> @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, unsigned int nr)
>  	if (unlikely(!page_ext))
>  		return;
>  
> -	for (i = 0; i < nr; i++) {
> +	for (i = 0; i < nr; i += (1 << new_order)) {
>  		page_owner = get_page_owner(page_ext);
> -		page_owner->order = 0;
> +		page_owner->order = new_order;
>  		page_ext = page_ext_next(page_ext);

I believe there cannot be any leftovers because nr is always a power of 2.
Is it true? Converting nr argument to order (if it's possible) will make it obvious.

Other than that the patch looks good to me.

Thanks!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.
  2020-11-11 20:40 ` [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split Zi Yan
  2020-11-12 17:58   ` Ralph Campbell
@ 2020-11-14  0:23   ` Roman Gushchin
  2020-11-14  0:56     ` Zi Yan
  1 sibling, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2020-11-14  0:23 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Wed, Nov 11, 2020 at 03:40:04PM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It reads thp_nr_pages and splits to provided new_nr. It prepares for
> upcoming changes to support split huge page to any lower order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/memcontrol.h | 5 +++--
>  mm/huge_memory.c           | 2 +-
>  mm/memcontrol.c            | 4 ++--
>  3 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0f4dd7829fb2..b3bac79ceed6 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -void mem_cgroup_split_huge_fixup(struct page *head);
> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
>  #endif
>  
>  #else /* CONFIG_MEMCG */
> @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	return 0;
>  }
>  
> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
> +static inline void mem_cgroup_split_huge_fixup(struct page *head,
> +					       unsigned int new_nr)
>  {
>  }
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c4fead5ead31..f599f5b9bf7f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>  
>  	/* complete memcg works before add pages to LRU */
> -	mem_cgroup_split_huge_fixup(head);
> +	mem_cgroup_split_huge_fixup(head, 1);
>  
>  	if (PageAnon(head) && PageSwapCache(head)) {
>  		swp_entry_t entry = { .val = page_private(head) };
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 33f632689cee..e9705ba6bbcc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>   * Because tail pages are not marked as "used", set it. We're under
>   * pgdat->lru_lock and migration entries setup in all page mappings.
>   */
> -void mem_cgroup_split_huge_fixup(struct page *head)
> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)

I'd go with unsigned int new_order, then it's obvious that we can split
the original page without any leftovers.

Other than that the patch looks good!
Acked-by: Roman Gushchin <guro@fb.com>

Thanks!

>  {
>  	struct mem_cgroup *memcg = page_memcg(head);
>  	int i;
> @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  	if (mem_cgroup_disabled())
>  		return;
>  
> -	for (i = 1; i < thp_nr_pages(head); i++) {
> +	for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
>  		css_get(&memcg->css);
>  		head[i].memcg_data = (unsigned long)memcg;
>  	}
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.
  2020-11-11 20:40 ` [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages Zi Yan
  2020-11-12 22:01   ` Ralph Campbell
@ 2020-11-14  0:52   ` Roman Gushchin
  2020-11-14  1:00     ` Zi Yan
  1 sibling, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2020-11-14  0:52 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> To split a THP to any lower order pages, we need to reform THPs on
> subpages at given order and add page refcount based on the new page
> order. Also we need to reinitialize page_deferred_list after removing
> the page from the split_queue, otherwise a subsequent split will see
> list corruption when checking the page_deferred_list again.
> 
> It has many uses, like minimizing the number of pages after
> truncating a pagecache THP. For anonymous THPs, we can only split them
> to order-0 like before until we add support for any size anonymous THPs.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/huge_mm.h |  8 +++++
>  mm/huge_memory.c        | 78 +++++++++++++++++++++++++++++------------
>  mm/swap.c               |  1 -
>  3 files changed, 63 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 60a907a19f7d..9819cd9b4619 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>  
>  bool can_split_huge_page(struct page *page, int *pextra_pins);
>  int split_huge_page_to_list(struct page *page, struct list_head *list);
> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order);
>  static inline int split_huge_page(struct page *page)
>  {
>  	return split_huge_page_to_list(page, NULL);
> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
>  {
>  	return 0;
>  }
> +static inline int
> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
> +		unsigned int new_order)

It was
int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
		unsigned int new_order);
above.


> +{
> +	return 0;
> +}
>  static inline int split_huge_page(struct page *page)
>  {
>  	return 0;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8b7d771ee962..88f50da40c9b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>  static void unmap_page(struct page *page)
>  {
>  	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
> -		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
> +		TTU_RMAP_LOCKED;
>  	bool unmap_success;
>  
>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> +	if (thp_order(page) >= HPAGE_PMD_ORDER)
> +		ttu_flags |= TTU_SPLIT_HUGE_PMD;
> +
>  	if (PageAnon(page))
>  		ttu_flags |= TTU_SPLIT_FREEZE;
>  
> @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>  	VM_BUG_ON_PAGE(!unmap_success, page);
>  }
>  
> -static void remap_page(struct page *page, unsigned int nr)
> +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
>  {
>  	int i;
> -	if (PageTransHuge(page)) {
> +	if (thp_nr_pages(page) == nr) {
>  		remove_migration_ptes(page, page, true);
>  	} else {
> -		for (i = 0; i < nr; i++)
> +		for (i = 0; i < nr; i += new_nr)
>  			remove_migration_ptes(page + i, page + i, true);
>  	}
>  }
>  
>  static void __split_huge_page_tail(struct page *head, int tail,
> -		struct lruvec *lruvec, struct list_head *list)
> +		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
>  {
>  	struct page *page_tail = head + tail;
> +	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>  
>  	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
>  
> @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  #ifdef CONFIG_64BIT
>  			 (1L << PG_arch_2) |
>  #endif
> +			 compound_head_flag |
>  			 (1L << PG_dirty)));
>  
>  	/* ->mapping in first tail page is compound_mapcount */
> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  	 * which needs correct compound_head().
>  	 */
>  	clear_compound_head(page_tail);
> +	if (new_order) {
> +		prep_compound_page(page_tail, new_order);
> +		thp_prep(page_tail);
> +	}
>  
>  	/* Finally unfreeze refcount. Additional reference from page cache. */
> -	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
> -					  PageSwapCache(head)));
> +	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
> +					   PageSwapCache(head)) ?
> +						thp_nr_pages(page_tail) : 0));
>  
>  	if (page_is_young(head))
>  		set_page_young(page_tail);
> @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  }
>  
>  static void __split_huge_page(struct page *page, struct list_head *list,
> -		pgoff_t end, unsigned long flags)
> +		pgoff_t end, unsigned long flags, unsigned int new_order)
>  {
>  	struct page *head = compound_head(page);
>  	pg_data_t *pgdat = page_pgdat(head);
> @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	struct address_space *swap_cache = NULL;
>  	unsigned long offset = 0;
>  	unsigned int nr = thp_nr_pages(head);
> +	unsigned int new_nr = 1 << new_order;
>  	int i;
>  
>  	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>  
>  	/* complete memcg works before add pages to LRU */
> -	mem_cgroup_split_huge_fixup(head, 1);
> +	mem_cgroup_split_huge_fixup(head, new_nr);
>  
>  	if (PageAnon(head) && PageSwapCache(head)) {
>  		swp_entry_t entry = { .val = page_private(head) };
> @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		xa_lock(&swap_cache->i_pages);
>  	}
>  
> -	for (i = nr - 1; i >= 1; i--) {
> -		__split_huge_page_tail(head, i, lruvec, list);
> +	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
> +		__split_huge_page_tail(head, i, lruvec, list, new_order);
>  		/* Some pages can be beyond i_size: drop them from page cache */
>  		if (head[i].index >= end) {
>  			ClearPageDirty(head + i);
>  			__delete_from_page_cache(head + i, NULL);
>  			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
> -				shmem_uncharge(head->mapping->host, 1);
> +				shmem_uncharge(head->mapping->host, new_nr);
>  			put_page(head + i);
>  		} else if (!PageAnon(page)) {
>  			__xa_store(&head->mapping->i_pages, head[i].index,
> @@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		}
>  	}
>  
> -	ClearPageCompound(head);
> +	if (!new_order)
> +		ClearPageCompound(head);
> +	else
> +		set_compound_order(head, new_order);
>  
> -	split_page_owner(head, nr, 1);
> +	split_page_owner(head, nr, new_nr);
>  
>  	/* See comment in __split_huge_page_tail() */
>  	if (PageAnon(head)) {
>  		/* Additional pin to swap cache */
>  		if (PageSwapCache(head)) {
> -			page_ref_add(head, 2);
> +			page_ref_add(head, 1 + new_nr);
>  			xa_unlock(&swap_cache->i_pages);
>  		} else {
>  			page_ref_inc(head);
>  		}
>  	} else {
>  		/* Additional pin to page cache */
> -		page_ref_add(head, 2);
> +		page_ref_add(head, 1 + new_nr);
>  		xa_unlock(&head->mapping->i_pages);
>  	}
>  
>  	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>  
> -	remap_page(head, nr);
> +	remap_page(head, nr, new_nr);
>  
>  	if (PageSwapCache(head)) {
>  		swp_entry_t entry = { .val = page_private(head) };
> @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  		split_swap_cluster(entry);
>  	}
>  
> -	for (i = 0; i < nr; i++) {
> +	for (i = 0; i < nr; i += new_nr) {
>  		struct page *subpage = head + i;
>  		if (subpage == page)
>  			continue;
> @@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
>   * us.
>   */
>  int split_huge_page_to_list(struct page *page, struct list_head *list)
> +{
> +	return split_huge_page_to_list_to_order(page, list, 0);
> +}
> +
> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order)
>  {
>  	struct page *head = compound_head(page);
>  	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
>  	struct deferred_split *ds_queue = get_deferred_split_queue(head);
> -	XA_STATE(xas, &head->mapping->i_pages, head->index);
> +	/* reset xarray order to new order after split */
> +	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order);
>  	struct anon_vma *anon_vma = NULL;
>  	struct address_space *mapping = NULL;
>  	int count, mapcount, extra_pins, ret;
>  	unsigned long flags;
>  	pgoff_t end;
>  
> +	VM_BUG_ON(thp_order(head) <= new_order);
>  	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
>  	VM_BUG_ON_PAGE(!PageLocked(head), head);
>  	VM_BUG_ON_PAGE(!PageCompound(head), head);
>  
> +	if (new_order == 1) {
> +		WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
> +		return -EINVAL;
> +	}
> +
> +	if (PageAnon(head) && new_order) {
> +		WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
> +		return -EINVAL;
> +	}

I'd convert those into VM_BUG_ON()'s. Unlikely they will be hit at arbitrary moments
by random users.

Thanks!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split.
  2020-11-14  0:23   ` Roman Gushchin
@ 2020-11-14  0:56     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-14  0:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 3160 bytes --]

On 13 Nov 2020, at 19:23, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:04PM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It reads thp_nr_pages and splits to provided new_nr. It prepares for
>> upcoming changes to support split huge page to any lower order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/memcontrol.h | 5 +++--
>>  mm/huge_memory.c           | 2 +-
>>  mm/memcontrol.c            | 4 ++--
>>  3 files changed, 6 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 0f4dd7829fb2..b3bac79ceed6 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1105,7 +1105,7 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>>  }
>>
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -void mem_cgroup_split_huge_fixup(struct page *head);
>> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr);
>>  #endif
>>
>>  #else /* CONFIG_MEMCG */
>> @@ -1451,7 +1451,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
>>  	return 0;
>>  }
>>
>> -static inline void mem_cgroup_split_huge_fixup(struct page *head)
>> +static inline void mem_cgroup_split_huge_fixup(struct page *head,
>> +					       unsigned int new_nr)
>>  {
>>  }
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index c4fead5ead31..f599f5b9bf7f 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2429,7 +2429,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>  	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>>
>>  	/* complete memcg works before add pages to LRU */
>> -	mem_cgroup_split_huge_fixup(head);
>> +	mem_cgroup_split_huge_fixup(head, 1);
>>
>>  	if (PageAnon(head) && PageSwapCache(head)) {
>>  		swp_entry_t entry = { .val = page_private(head) };
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 33f632689cee..e9705ba6bbcc 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3247,7 +3247,7 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
>>   * Because tail pages are not marked as "used", set it. We're under
>>   * pgdat->lru_lock and migration entries setup in all page mappings.
>>   */
>> -void mem_cgroup_split_huge_fixup(struct page *head)
>> +void mem_cgroup_split_huge_fixup(struct page *head, unsigned int new_nr)
>
> I'd go with unsigned int new_order, then it's obvious that we can split
> the original page without any leftovers.

Makes sense. Will change it.

>
> Other than that the patch looks good!
> Acked-by: Roman Gushchin <guro@fb.com>

Thanks.

>>  {
>>  	struct mem_cgroup *memcg = page_memcg(head);
>>  	int i;
>> @@ -3255,7 +3255,7 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>>  	if (mem_cgroup_disabled())
>>  		return;
>>
>> -	for (i = 1; i < thp_nr_pages(head); i++) {
>> +	for (i = new_nr; i < thp_nr_pages(head); i += new_nr) {
>>  		css_get(&memcg->css);
>>  		head[i].memcg_data = (unsigned long)memcg;
>>  	}
>> -- 
>> 2.28.0
>>


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages.
  2020-11-14  0:52   ` Roman Gushchin
@ 2020-11-14  1:00     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-14  1:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 9629 bytes --]

On 13 Nov 2020, at 19:52, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:06PM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> To split a THP to any lower order pages, we need to reform THPs on
>> subpages at given order and add page refcount based on the new page
>> order. Also we need to reinitialize page_deferred_list after removing
>> the page from the split_queue, otherwise a subsequent split will see
>> list corruption when checking the page_deferred_list again.
>>
>> It has many uses, like minimizing the number of pages after
>> truncating a pagecache THP. For anonymous THPs, we can only split them
>> to order-0 like before until we add support for any size anonymous THPs.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |  8 +++++
>>  mm/huge_memory.c        | 78 +++++++++++++++++++++++++++++------------
>>  mm/swap.c               |  1 -
>>  3 files changed, 63 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 60a907a19f7d..9819cd9b4619 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -189,6 +189,8 @@ bool is_transparent_hugepage(struct page *page);
>>
>>  bool can_split_huge_page(struct page *page, int *pextra_pins);
>>  int split_huge_page_to_list(struct page *page, struct list_head *list);
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order);
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  	return split_huge_page_to_list(page, NULL);
>> @@ -396,6 +398,12 @@ split_huge_page_to_list(struct page *page, struct list_head *list)
>>  {
>>  	return 0;
>>  }
>> +static inline int
>> +split_huge_page_to_order_to_list(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>
> It was
> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> 		unsigned int new_order);
> above.

Right. It should be split_huge_page_to_list_to_order. Will fix it.

>
>> +{
>> +	return 0;
>> +}
>>  static inline int split_huge_page(struct page *page)
>>  {
>>  	return 0;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8b7d771ee962..88f50da40c9b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2327,11 +2327,14 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
>>  static void unmap_page(struct page *page)
>>  {
>>  	enum ttu_flags ttu_flags = TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS |
>> -		TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD;
>> +		TTU_RMAP_LOCKED;
>>  	bool unmap_success;
>>
>>  	VM_BUG_ON_PAGE(!PageHead(page), page);
>>
>> +	if (thp_order(page) >= HPAGE_PMD_ORDER)
>> +		ttu_flags |= TTU_SPLIT_HUGE_PMD;
>> +
>>  	if (PageAnon(page))
>>  		ttu_flags |= TTU_SPLIT_FREEZE;
>>
>> @@ -2339,21 +2342,22 @@ static void unmap_page(struct page *page)
>>  	VM_BUG_ON_PAGE(!unmap_success, page);
>>  }
>>
>> -static void remap_page(struct page *page, unsigned int nr)
>> +static void remap_page(struct page *page, unsigned int nr, unsigned int new_nr)
>>  {
>>  	int i;
>> -	if (PageTransHuge(page)) {
>> +	if (thp_nr_pages(page) == nr) {
>>  		remove_migration_ptes(page, page, true);
>>  	} else {
>> -		for (i = 0; i < nr; i++)
>> +		for (i = 0; i < nr; i += new_nr)
>>  			remove_migration_ptes(page + i, page + i, true);
>>  	}
>>  }
>>
>>  static void __split_huge_page_tail(struct page *head, int tail,
>> -		struct lruvec *lruvec, struct list_head *list)
>> +		struct lruvec *lruvec, struct list_head *list, unsigned int new_order)
>>  {
>>  	struct page *page_tail = head + tail;
>> +	unsigned long compound_head_flag = new_order ? (1L << PG_head) : 0;
>>
>>  	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
>>
>> @@ -2377,6 +2381,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>  #ifdef CONFIG_64BIT
>>  			 (1L << PG_arch_2) |
>>  #endif
>> +			 compound_head_flag |
>>  			 (1L << PG_dirty)));
>>
>>  	/* ->mapping in first tail page is compound_mapcount */
>> @@ -2395,10 +2400,15 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>  	 * which needs correct compound_head().
>>  	 */
>>  	clear_compound_head(page_tail);
>> +	if (new_order) {
>> +		prep_compound_page(page_tail, new_order);
>> +		thp_prep(page_tail);
>> +	}
>>
>>  	/* Finally unfreeze refcount. Additional reference from page cache. */
>> -	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
>> -					  PageSwapCache(head)));
>> +	page_ref_unfreeze(page_tail, 1 + ((!PageAnon(head) ||
>> +					   PageSwapCache(head)) ?
>> +						thp_nr_pages(page_tail) : 0));
>>
>>  	if (page_is_young(head))
>>  		set_page_young(page_tail);
>> @@ -2416,7 +2426,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>  }
>>
>>  static void __split_huge_page(struct page *page, struct list_head *list,
>> -		pgoff_t end, unsigned long flags)
>> +		pgoff_t end, unsigned long flags, unsigned int new_order)
>>  {
>>  	struct page *head = compound_head(page);
>>  	pg_data_t *pgdat = page_pgdat(head);
>> @@ -2424,12 +2434,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>  	struct address_space *swap_cache = NULL;
>>  	unsigned long offset = 0;
>>  	unsigned int nr = thp_nr_pages(head);
>> +	unsigned int new_nr = 1 << new_order;
>>  	int i;
>>
>>  	lruvec = mem_cgroup_page_lruvec(head, pgdat);
>>
>>  	/* complete memcg works before add pages to LRU */
>> -	mem_cgroup_split_huge_fixup(head, 1);
>> +	mem_cgroup_split_huge_fixup(head, new_nr);
>>
>>  	if (PageAnon(head) && PageSwapCache(head)) {
>>  		swp_entry_t entry = { .val = page_private(head) };
>> @@ -2439,14 +2450,14 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>  		xa_lock(&swap_cache->i_pages);
>>  	}
>>
>> -	for (i = nr - 1; i >= 1; i--) {
>> -		__split_huge_page_tail(head, i, lruvec, list);
>> +	for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
>> +		__split_huge_page_tail(head, i, lruvec, list, new_order);
>>  		/* Some pages can be beyond i_size: drop them from page cache */
>>  		if (head[i].index >= end) {
>>  			ClearPageDirty(head + i);
>>  			__delete_from_page_cache(head + i, NULL);
>>  			if (IS_ENABLED(CONFIG_SHMEM) && PageSwapBacked(head))
>> -				shmem_uncharge(head->mapping->host, 1);
>> +				shmem_uncharge(head->mapping->host, new_nr);
>>  			put_page(head + i);
>>  		} else if (!PageAnon(page)) {
>>  			__xa_store(&head->mapping->i_pages, head[i].index,
>> @@ -2457,28 +2468,31 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>  		}
>>  	}
>>
>> -	ClearPageCompound(head);
>> +	if (!new_order)
>> +		ClearPageCompound(head);
>> +	else
>> +		set_compound_order(head, new_order);
>>
>> -	split_page_owner(head, nr, 1);
>> +	split_page_owner(head, nr, new_nr);
>>
>>  	/* See comment in __split_huge_page_tail() */
>>  	if (PageAnon(head)) {
>>  		/* Additional pin to swap cache */
>>  		if (PageSwapCache(head)) {
>> -			page_ref_add(head, 2);
>> +			page_ref_add(head, 1 + new_nr);
>>  			xa_unlock(&swap_cache->i_pages);
>>  		} else {
>>  			page_ref_inc(head);
>>  		}
>>  	} else {
>>  		/* Additional pin to page cache */
>> -		page_ref_add(head, 2);
>> +		page_ref_add(head, 1 + new_nr);
>>  		xa_unlock(&head->mapping->i_pages);
>>  	}
>>
>>  	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
>>
>> -	remap_page(head, nr);
>> +	remap_page(head, nr, new_nr);
>>
>>  	if (PageSwapCache(head)) {
>>  		swp_entry_t entry = { .val = page_private(head) };
>> @@ -2486,7 +2500,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>>  		split_swap_cluster(entry);
>>  	}
>>
>> -	for (i = 0; i < nr; i++) {
>> +	for (i = 0; i < nr; i += new_nr) {
>>  		struct page *subpage = head + i;
>>  		if (subpage == page)
>>  			continue;
>> @@ -2620,21 +2634,39 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
>>   * us.
>>   */
>>  int split_huge_page_to_list(struct page *page, struct list_head *list)
>> +{
>> +	return split_huge_page_to_list_to_order(page, list, 0);
>> +}
>> +
>> +int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order)
>>  {
>>  	struct page *head = compound_head(page);
>>  	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(head);
>> -	XA_STATE(xas, &head->mapping->i_pages, head->index);
>> +	/* reset xarray order to new order after split */
>> +	XA_STATE_ORDER(xas, &head->mapping->i_pages, head->index, new_order);
>>  	struct anon_vma *anon_vma = NULL;
>>  	struct address_space *mapping = NULL;
>>  	int count, mapcount, extra_pins, ret;
>>  	unsigned long flags;
>>  	pgoff_t end;
>>
>> +	VM_BUG_ON(thp_order(head) <= new_order);
>>  	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
>>  	VM_BUG_ON_PAGE(!PageLocked(head), head);
>>  	VM_BUG_ON_PAGE(!PageCompound(head), head);
>>
>> +	if (new_order == 1) {
>> +		WARN_ONCE(1, "Cannot split THP to order-1 (no order-1 THPs)");
>> +		return -EINVAL;
>> +	}
>> +
>> +	if (PageAnon(head) && new_order) {
>> +		WARN_ONCE(1, "Split anonymous THP to non-zero order not support");
>> +		return -EINVAL;
>> +	}
>
> I'd convert those into VM_BUG_ON()'s. Unlikely they will be hit at arbitrary moments
> by random users.

Sure. Will change them.

Thanks.


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-14  0:15   ` Roman Gushchin
@ 2020-11-14  1:08     ` Zi Yan
  2020-11-14  1:38       ` Roman Gushchin
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2020-11-14  1:08 UTC (permalink / raw)
  To: Roman Gushchin, Matthew Wilcox
  Cc: linux-mm, Kirill A . Shutemov, Andrew Morton, linux-kernel,
	linux-kselftest, Yang Shi, Michal Hocko, John Hubbard,
	Ralph Campbell, David Nellans

On 13 Nov 2020, at 19:15, Roman Gushchin wrote:

> On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It adds a new_order parameter to set new page order in page owner.
>> It prepares for upcoming changes to support split huge page to any 
>> lower
>> order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/page_owner.h | 7 ++++---
>>  mm/huge_memory.c           | 2 +-
>>  mm/page_alloc.c            | 2 +-
>>  mm/page_owner.c            | 6 +++---
>>  4 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
>> index 3468794f83d2..215cbb159568 100644
>> --- a/include/linux/page_owner.h
>> +++ b/include/linux/page_owner.h
>> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page 
>> *page,
>>  		__set_page_owner(page, order, gfp_mask);
>>  }
>>
>> -static inline void split_page_owner(struct page *page, unsigned int 
>> nr)
>> +static inline void split_page_owner(struct page *page, unsigned int 
>> nr,
>> +			unsigned int new_order)
>>  {
>>  	if (static_branch_unlikely(&page_owner_inited))
>> -		__split_page_owner(page, nr);
>> +		__split_page_owner(page, nr, new_order);
>>  }
>>  static inline void copy_page_owner(struct page *oldpage, struct page 
>> *newpage)
>>  {
>> @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page 
>> *page,
>>  {
>>  }
>>  static inline void split_page_owner(struct page *page,
>> -			unsigned int order)
>> +			unsigned int nr, unsigned int new_order)
>
> With the addition of the new argument it's a bit hard to understand
> what the function is supposed to do. It seems like nr == 
> page_order(page),
> is it right? Maybe we can pass old_order and new_order? Or just the 
> page
> and the new order?

Yeah, it is a bit confusing. Please see more below.

>
>>  {
>>  }
>>  static inline void copy_page_owner(struct page *oldpage, struct page 
>> *newpage)
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index f599f5b9bf7f..8b7d771ee962 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page 
>> *page, struct list_head *list,
>>
>>  	ClearPageCompound(head);
>>
>> -	split_page_owner(head, nr);
>> +	split_page_owner(head, nr, 1);
>>
>>  	/* See comment in __split_huge_page_tail() */
>>  	if (PageAnon(head)) {
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index d77220615fd5..a9eead0e091a 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned int 
>> order)
>>
>>  	for (i = 1; i < (1 << order); i++)
>>  		set_page_refcounted(page + i);
>> -	split_page_owner(page, 1 << order);
>> +	split_page_owner(page, 1 << order, 1);
>>  }
>>  EXPORT_SYMBOL_GPL(split_page);
>>
>> diff --git a/mm/page_owner.c b/mm/page_owner.c
>> index b735a8eafcdb..2b7f7e9056dc 100644
>> --- a/mm/page_owner.c
>> +++ b/mm/page_owner.c
>> @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page 
>> *page, int reason)
>>  	page_owner->last_migrate_reason = reason;
>>  }
>>
>> -void __split_page_owner(struct page *page, unsigned int nr)
>> +void __split_page_owner(struct page *page, unsigned int nr, unsigned 
>> int new_order)
>>  {
>>  	int i;
>>  	struct page_ext *page_ext = lookup_page_ext(page);
>> @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page, 
>> unsigned int nr)
>>  	if (unlikely(!page_ext))
>>  		return;
>>
>> -	for (i = 0; i < nr; i++) {
>> +	for (i = 0; i < nr; i += (1 << new_order)) {
>>  		page_owner = get_page_owner(page_ext);
>> -		page_owner->order = 0;
>> +		page_owner->order = new_order;
>>  		page_ext = page_ext_next(page_ext);
>
> I believe there cannot be any leftovers because nr is always a power 
> of 2.
> Is it true? Converting nr argument to order (if it's possible) will 
> make it obvious.

Right. nr = thp_nr_pages(head), which is a power of 2. There would not 
be any
leftover.

Matthew recently converted split_page_owner to take nr instead of 
order.[1] But I am not
sure why, since it seems to me that two call sites (__split_huge_page in
mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order 
information.


[1]https://lore.kernel.org/linux-mm/20200908195539.25896-4-willy@infradead.org/


—
Best Regards,
Yan Zi


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-14  1:08     ` Zi Yan
@ 2020-11-14  1:38       ` Roman Gushchin
  2020-11-17 21:05         ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2020-11-14  1:38 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
> On 13 Nov 2020, at 19:15, Roman Gushchin wrote:
> 
> > On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
> > > From: Zi Yan <ziy@nvidia.com>
> > > 
> > > It adds a new_order parameter to set new page order in page owner.
> > > It prepares for upcoming changes to support split huge page to any
> > > lower
> > > order.
> > > 
> > > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > > ---
> > >  include/linux/page_owner.h | 7 ++++---
> > >  mm/huge_memory.c           | 2 +-
> > >  mm/page_alloc.c            | 2 +-
> > >  mm/page_owner.c            | 6 +++---
> > >  4 files changed, 9 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> > > index 3468794f83d2..215cbb159568 100644
> > > --- a/include/linux/page_owner.h
> > > +++ b/include/linux/page_owner.h
> > > @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page
> > > *page,
> > >  		__set_page_owner(page, order, gfp_mask);
> > >  }
> > > 
> > > -static inline void split_page_owner(struct page *page, unsigned int
> > > nr)
> > > +static inline void split_page_owner(struct page *page, unsigned int
> > > nr,
> > > +			unsigned int new_order)
> > >  {
> > >  	if (static_branch_unlikely(&page_owner_inited))
> > > -		__split_page_owner(page, nr);
> > > +		__split_page_owner(page, nr, new_order);
> > >  }
> > >  static inline void copy_page_owner(struct page *oldpage, struct
> > > page *newpage)
> > >  {
> > > @@ -60,7 +61,7 @@ static inline void set_page_owner(struct page
> > > *page,
> > >  {
> > >  }
> > >  static inline void split_page_owner(struct page *page,
> > > -			unsigned int order)
> > > +			unsigned int nr, unsigned int new_order)
> > 
> > With the addition of the new argument it's a bit hard to understand
> > what the function is supposed to do. It seems like nr ==
> > page_order(page),
> > is it right? Maybe we can pass old_order and new_order? Or just the page
> > and the new order?
> 
> Yeah, it is a bit confusing. Please see more below.
> 
> > 
> > >  {
> > >  }
> > >  static inline void copy_page_owner(struct page *oldpage, struct
> > > page *newpage)
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index f599f5b9bf7f..8b7d771ee962 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -2459,7 +2459,7 @@ static void __split_huge_page(struct page
> > > *page, struct list_head *list,
> > > 
> > >  	ClearPageCompound(head);
> > > 
> > > -	split_page_owner(head, nr);
> > > +	split_page_owner(head, nr, 1);
> > > 
> > >  	/* See comment in __split_huge_page_tail() */
> > >  	if (PageAnon(head)) {
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d77220615fd5..a9eead0e091a 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3284,7 +3284,7 @@ void split_page(struct page *page, unsigned
> > > int order)
> > > 
> > >  	for (i = 1; i < (1 << order); i++)
> > >  		set_page_refcounted(page + i);
> > > -	split_page_owner(page, 1 << order);
> > > +	split_page_owner(page, 1 << order, 1);
> > >  }
> > >  EXPORT_SYMBOL_GPL(split_page);
> > > 
> > > diff --git a/mm/page_owner.c b/mm/page_owner.c
> > > index b735a8eafcdb..2b7f7e9056dc 100644
> > > --- a/mm/page_owner.c
> > > +++ b/mm/page_owner.c
> > > @@ -204,7 +204,7 @@ void __set_page_owner_migrate_reason(struct page
> > > *page, int reason)
> > >  	page_owner->last_migrate_reason = reason;
> > >  }
> > > 
> > > -void __split_page_owner(struct page *page, unsigned int nr)
> > > +void __split_page_owner(struct page *page, unsigned int nr,
> > > unsigned int new_order)
> > >  {
> > >  	int i;
> > >  	struct page_ext *page_ext = lookup_page_ext(page);
> > > @@ -213,9 +213,9 @@ void __split_page_owner(struct page *page,
> > > unsigned int nr)
> > >  	if (unlikely(!page_ext))
> > >  		return;
> > > 
> > > -	for (i = 0; i < nr; i++) {
> > > +	for (i = 0; i < nr; i += (1 << new_order)) {
> > >  		page_owner = get_page_owner(page_ext);
> > > -		page_owner->order = 0;
> > > +		page_owner->order = new_order;
> > >  		page_ext = page_ext_next(page_ext);
> > 
> > I believe there cannot be any leftovers because nr is always a power of
> > 2.
> > Is it true? Converting nr argument to order (if it's possible) will make
> > it obvious.
> 
> Right. nr = thp_nr_pages(head), which is a power of 2. There would not be
> any
> leftover.
> 
> Matthew recently converted split_page_owner to take nr instead of order.[1]
> But I am not
> sure why, since it seems to me that two call sites (__split_huge_page in
> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
> information.

Yeah, I'm not sure why too. Maybe Matthew has some input here?
You can also pass new_nr, but IMO orders look so much better here.

Thanks!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.
  2020-11-11 20:40 ` [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
  2020-11-12 22:22   ` Ralph Campbell
@ 2020-11-16 16:06   ` Kirill A. Shutemov
  2020-11-16 17:26     ` Zi Yan
  1 sibling, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2020-11-16 16:06 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans

On Wed, Nov 11, 2020 at 03:40:03PM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Huge pages in the process with the given pid and virtual address range
> are split. It is used to test split huge page function. In addition,
> a testing program is added to tools/testing/selftests/vm to utilize the
> interface by splitting PMD THPs.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c                              |  98 +++++++++++
>  mm/internal.h                                 |   1 +
>  mm/migrate.c                                  |   2 +-
>  tools/testing/selftests/vm/Makefile           |   1 +
>  .../selftests/vm/split_huge_page_test.c       | 161 ++++++++++++++++++
>  5 files changed, 262 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 207ebca8c654..c4fead5ead31 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -7,6 +7,7 @@
>  
>  #include <linux/mm.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/sched/coredump.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/highmem.h>
> @@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val)
>  DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
>  		"%llu\n");
>  
> +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
> +		const char __user *buf, size_t count, loff_t *ppops)
> +{
> +	static DEFINE_MUTEX(mutex);
> +	ssize_t ret;
> +	char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
> +	int pid;
> +	unsigned long vaddr_start, vaddr_end, addr;
> +	nodemask_t task_nodes;
> +	struct mm_struct *mm;
> +
> +	ret = mutex_lock_interruptible(&mutex);
> +	if (ret)
> +		return ret;
> +
> +	ret = -EFAULT;
> +
> +	memset(input_buf, 0, 80);
> +	if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
> +		goto out;
> +
> +	input_buf[80] = '\0';

Hm. Out-of-buffer access?

> +	ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end);

Why hex without 0x prefix?

> +	if (ret != 3) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	vaddr_start &= PAGE_MASK;
> +	vaddr_end &= PAGE_MASK;
> +
> +	ret = strlen(input_buf);
> +	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
> +		 pid, vaddr_start, vaddr_end);
> +
> +	mm = find_mm_struct(pid, &task_nodes);

I don't follow why you need nodemask.

> +	if (IS_ERR(mm)) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	mmap_read_lock(mm);
> +	for (addr = vaddr_start; addr < vaddr_end;) {
> +		struct vm_area_struct *vma = find_vma(mm, addr);
> +		unsigned int follflags;
> +		struct page *page;
> +
> +		if (!vma || addr < vma->vm_start || !vma_migratable(vma))
> +			break;
> +
> +		/* FOLL_DUMP to ignore special (like zero) pages */
> +		follflags = FOLL_GET | FOLL_DUMP;
> +		page = follow_page(vma, addr, follflags);
> +
> +		if (IS_ERR(page))
> +			break;
> +		if (!page)
> +			break;
> +
> +		if (!is_transparent_hugepage(page))
> +			goto next;
> +
> +		if (!can_split_huge_page(page, NULL))
> +			goto next;
> +
> +		if (!trylock_page(page))
> +			goto next;
> +
> +		addr += page_size(page) - PAGE_SIZE;

Who said it was mapped as huge? mremap() allows to construct an PTE page
table that filled with PTE-mapped THPs, each of them distinct.

> +
> +		/* reset addr if split fails */
> +		if (split_huge_page(page))
> +			addr -= (page_size(page) - PAGE_SIZE);
> +
> +		unlock_page(page);
> +next:
> +		/* next page */
> +		addr += page_size(page);

Isn't it the second time if split_huge_page() succeed.

> +		put_page(page);
> +	}
> +	mmap_read_unlock(mm);
> +
> +
> +	mmput(mm);
> +out:
> +	mutex_unlock(&mutex);
> +	return ret;
> +
> +}
> +
> +static const struct file_operations split_huge_pages_in_range_pid_fops = {
> +	.owner	 = THIS_MODULE,
> +	.write	 = split_huge_pages_in_range_pid_write,
> +	.llseek  = no_llseek,
> +};
> +
>  static int __init split_huge_pages_debugfs(void)
>  {
>  	debugfs_create_file("split_huge_pages", 0200, NULL, NULL,
>  			    &split_huge_pages_fops);
> +	debugfs_create_file("split_huge_pages_in_range_pid", 0200, NULL, NULL,
> +			    &split_huge_pages_in_range_pid_fops);
>  	return 0;
>  }
>  late_initcall(split_huge_pages_debugfs);
> diff --git a/mm/internal.h b/mm/internal.h
> index 3ea43642b99d..fd841a38830f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -624,4 +624,5 @@ struct migration_target_control {
>  
>  bool truncate_inode_partial_page(struct page *page, loff_t start, loff_t end);
>  void page_cache_free_page(struct address_space *mapping, struct page *page);
> +struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes);
>  #endif	/* __MM_INTERNAL_H */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index a50bbb0e029b..e35654d1087d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1851,7 +1851,7 @@ static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
>  	return nr_pages ? -EFAULT : 0;
>  }
>  
> -static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
> +struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
>  {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index 62fb15f286ee..d9ead0cdd3e9 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -42,6 +42,7 @@ TEST_GEN_FILES += on-fault-limit
>  TEST_GEN_FILES += thuge-gen
>  TEST_GEN_FILES += transhuge-stress
>  TEST_GEN_FILES += userfaultfd
> +TEST_GEN_FILES += split_huge_page_test
>  
>  ifeq ($(ARCH),x86_64)
>  CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh $(CC) ../x86/trivial_32bit_program.c -m32)
> diff --git a/tools/testing/selftests/vm/split_huge_page_test.c b/tools/testing/selftests/vm/split_huge_page_test.c
> new file mode 100644
> index 000000000000..c8a32ae9e13a
> --- /dev/null
> +++ b/tools/testing/selftests/vm/split_huge_page_test.c
> @@ -0,0 +1,161 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define _GNU_SOURCE
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include "numa.h"
> +#include <unistd.h>
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <string.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <fcntl.h>
> +#include <sys/mman.h>
> +#include <sys/time.h>
> +#include <sys/wait.h>
> +#include <malloc.h>
> +#include <stdbool.h>
> +
> +#define PAGE_4KB (4096UL)
> +#define PAGE_2MB (512UL*PAGE_4KB)
> +#define PAGE_1GB (512UL*PAGE_2MB)
> +
> +#define PRESENT_MASK (1UL<<63)
> +#define SWAPPED_MASK (1UL<<62)
> +#define PAGE_TYPE_MASK (1UL<<61)
> +#define PFN_MASK     ((1UL<<55)-1)
> +
> +#define KPF_THP      (1UL<<22)
> +#define KPF_PUD_THP      (1UL<<27)
> +
> +#define SPLIT_DEBUGFS "/sys/kernel/debug/split_huge_pages_in_range_pid"
> +#define SMAP_PATH "/proc/self/smaps"
> +#define INPUT_MAX 80
> +
> +static int write_file(const char *path, const char *buf, size_t buflen)
> +{
> +	int fd;
> +	ssize_t numwritten;
> +
> +	fd = open(path, O_WRONLY);
> +	if (fd == -1)
> +		return 0;
> +
> +	numwritten = write(fd, buf, buflen - 1);
> +	close(fd);
> +	if (numwritten < 1)
> +		return 0;
> +
> +	return (unsigned int) numwritten;
> +}
> +
> +static void write_debugfs(int pid, uint64_t vaddr_start, uint64_t vaddr_end)
> +{
> +	char input[INPUT_MAX];
> +	int ret;
> +
> +	ret = snprintf(input, INPUT_MAX, "%d,%lx,%lx", pid, vaddr_start,
> +			vaddr_end);
> +	if (ret >= INPUT_MAX) {
> +		printf("%s: Debugfs input is too long\n", __func__);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	if (!write_file(SPLIT_DEBUGFS, input, ret + 1)) {
> +		perror(SPLIT_DEBUGFS);
> +		exit(EXIT_FAILURE);
> +	}
> +}
> +
> +#define MAX_LINE_LENGTH 500
> +
> +static bool check_for_pattern(FILE *fp, char *pattern, char *buf)
> +{
> +	while (fgets(buf, MAX_LINE_LENGTH, fp) != NULL) {
> +		if (!strncmp(buf, pattern, strlen(pattern)))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static uint64_t check_huge(void *addr)
> +{
> +	uint64_t thp = 0;
> +	int ret;
> +	FILE *fp;
> +	char buffer[MAX_LINE_LENGTH];
> +	char addr_pattern[MAX_LINE_LENGTH];
> +
> +	ret = snprintf(addr_pattern, MAX_LINE_LENGTH, "%08lx-",
> +		       (unsigned long) addr);
> +	if (ret >= MAX_LINE_LENGTH) {
> +		printf("%s: Pattern is too long\n", __func__);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +
> +	fp = fopen(SMAP_PATH, "r");
> +	if (!fp) {
> +		printf("%s: Failed to open file %s\n", __func__, SMAP_PATH);
> +		exit(EXIT_FAILURE);
> +	}
> +	if (!check_for_pattern(fp, addr_pattern, buffer))
> +		goto err_out;
> +
> +	/*
> +	 * Fetch the AnonHugePages: in the same block and check the number of
> +	 * hugepages.
> +	 */
> +	if (!check_for_pattern(fp, "AnonHugePages:", buffer))
> +		goto err_out;
> +
> +	if (sscanf(buffer, "AnonHugePages:%10ld kB", &thp) != 1) {
> +		printf("Reading smap error\n");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +err_out:
> +	fclose(fp);
> +	return thp;
> +}
> +
> +void split_pmd_thp(void)
> +{
> +	char *one_page;
> +	size_t len = 4 * PAGE_2MB;
> +	uint64_t thp_size;
> +
> +	one_page = memalign(PAGE_1GB, len);
> +
> +	madvise(one_page, len, MADV_HUGEPAGE);
> +
> +	memset(one_page, 1, len);
> +
> +	thp_size = check_huge(one_page);
> +	if (!thp_size) {
> +		printf("No THP is allocatd");
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	/* split all possible huge pages */
> +	write_debugfs(getpid(), (uint64_t)one_page, (uint64_t)one_page + len);
> +
> +	*one_page = 0;
> +
> +	thp_size = check_huge(one_page);
> +	if (thp_size) {
> +		printf("Still %ld kB AnonHugePages not split\n", thp_size);
> +		exit(EXIT_FAILURE);
> +	}
> +
> +	printf("Split huge pages successful\n");
> +	free(one_page);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	split_pmd_thp();
> +
> +	return 0;
> +}
> -- 
> 2.28.0
> 
> 

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
  2020-11-12 17:57   ` Ralph Campbell
  2020-11-14  0:15   ` Roman Gushchin
@ 2020-11-16 16:25   ` Kirill A. Shutemov
  2020-11-16 17:27     ` Zi Yan
  2020-11-17 21:10   ` Matthew Wilcox
  3 siblings, 1 reply; 36+ messages in thread
From: Kirill A. Shutemov @ 2020-11-16 16:25 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> It adds a new_order parameter to set new page order in page owner.
> It prepares for upcoming changes to support split huge page to any lower
> order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  include/linux/page_owner.h | 7 ++++---
>  mm/huge_memory.c           | 2 +-
>  mm/page_alloc.c            | 2 +-
>  mm/page_owner.c            | 6 +++---
>  4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
> index 3468794f83d2..215cbb159568 100644
> --- a/include/linux/page_owner.h
> +++ b/include/linux/page_owner.h
> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>  		__set_page_owner(page, order, gfp_mask);
>  }
>  
> -static inline void split_page_owner(struct page *page, unsigned int nr)
> +static inline void split_page_owner(struct page *page, unsigned int nr,
> +			unsigned int new_order)
>  {
>  	if (static_branch_unlikely(&page_owner_inited))
> -		__split_page_owner(page, nr);
> +		__split_page_owner(page, nr, new_order);

Hm. Where do you correct __split_page_owner() declaration. I don't see it.

-- 
 Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range.
  2020-11-16 16:06   ` Kirill A. Shutemov
@ 2020-11-16 17:26     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-16 17:26 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 5053 bytes --]

On 16 Nov 2020, at 11:06, Kirill A. Shutemov wrote:

> On Wed, Nov 11, 2020 at 03:40:03PM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Huge pages in the process with the given pid and virtual address range
>> are split. It is used to test split huge page function. In addition,
>> a testing program is added to tools/testing/selftests/vm to utilize the
>> interface by splitting PMD THPs.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/huge_memory.c                              |  98 +++++++++++
>>  mm/internal.h                                 |   1 +
>>  mm/migrate.c                                  |   2 +-
>>  tools/testing/selftests/vm/Makefile           |   1 +
>>  .../selftests/vm/split_huge_page_test.c       | 161 ++++++++++++++++++
>>  5 files changed, 262 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/vm/split_huge_page_test.c
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 207ebca8c654..c4fead5ead31 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -7,6 +7,7 @@
>>
>>  #include <linux/mm.h>
>>  #include <linux/sched.h>
>> +#include <linux/sched/mm.h>
>>  #include <linux/sched/coredump.h>
>>  #include <linux/sched/numa_balancing.h>
>>  #include <linux/highmem.h>
>> @@ -2935,10 +2936,107 @@ static int split_huge_pages_set(void *data, u64 val)
>>  DEFINE_DEBUGFS_ATTRIBUTE(split_huge_pages_fops, NULL, split_huge_pages_set,
>>  		"%llu\n");
>>
>> +static ssize_t split_huge_pages_in_range_pid_write(struct file *file,
>> +		const char __user *buf, size_t count, loff_t *ppops)
>> +{
>> +	static DEFINE_MUTEX(mutex);
>> +	ssize_t ret;
>> +	char input_buf[80]; /* hold pid, start_vaddr, end_vaddr */
>> +	int pid;
>> +	unsigned long vaddr_start, vaddr_end, addr;
>> +	nodemask_t task_nodes;
>> +	struct mm_struct *mm;
>> +
>> +	ret = mutex_lock_interruptible(&mutex);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = -EFAULT;
>> +
>> +	memset(input_buf, 0, 80);
>> +	if (copy_from_user(input_buf, buf, min_t(size_t, count, 80)))
>> +		goto out;
>> +
>> +	input_buf[80] = '\0';
>
> Hm. Out-of-buffer access?

Sorry. Will fix it.

>
>> +	ret = sscanf(input_buf, "%d,%lx,%lx", &pid, &vaddr_start, &vaddr_end);
>
> Why hex without 0x prefix?

No particular reason. Let me add the prefix.

>
>> +	if (ret != 3) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +	vaddr_start &= PAGE_MASK;
>> +	vaddr_end &= PAGE_MASK;
>> +
>> +	ret = strlen(input_buf);
>> +	pr_debug("split huge pages in pid: %d, vaddr: [%lx - %lx]\n",
>> +		 pid, vaddr_start, vaddr_end);
>> +
>> +	mm = find_mm_struct(pid, &task_nodes);
>
> I don't follow why you need nodemask.

I don’t need it. I just reuse the find_mm_struct function from
mm/migrate.c.

>
>> +	if (IS_ERR(mm)) {
>> +		ret = -EINVAL;
>> +		goto out;
>> +	}
>> +
>> +	mmap_read_lock(mm);
>> +	for (addr = vaddr_start; addr < vaddr_end;) {
>> +		struct vm_area_struct *vma = find_vma(mm, addr);
>> +		unsigned int follflags;
>> +		struct page *page;
>> +
>> +		if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>> +			break;
>> +
>> +		/* FOLL_DUMP to ignore special (like zero) pages */
>> +		follflags = FOLL_GET | FOLL_DUMP;
>> +		page = follow_page(vma, addr, follflags);
>> +
>> +		if (IS_ERR(page))
>> +			break;
>> +		if (!page)
>> +			break;
>> +
>> +		if (!is_transparent_hugepage(page))
>> +			goto next;
>> +
>> +		if (!can_split_huge_page(page, NULL))
>> +			goto next;
>> +
>> +		if (!trylock_page(page))
>> +			goto next;
>> +
>> +		addr += page_size(page) - PAGE_SIZE;
>
> Who said it was mapped as huge? mremap() allows to construct an PTE page
> table that filled with PTE-mapped THPs, each of them distinct.

I forgot about this. I was trying to be smart to skip the rest of
subpages if we split a THP. I will increase addr always by PAGE_SIZE
to handle this situation.

>> +
>> +		/* reset addr if split fails */
>> +		if (split_huge_page(page))
>> +			addr -= (page_size(page) - PAGE_SIZE);
>> +
>> +		unlock_page(page);
>> +next:
>> +		/* next page */
>> +		addr += page_size(page);
>
> Isn't it the second time if split_huge_page() succeed.

If split_huge_page() succeeds, page_size(page) would be PAGE_SIZE
and addr was increased by THP size - PAGE_SIZE above, so addr now should
be at the end of the original THP.

Anyway, I will change the code to something like:

        /*
         * always increase addr by PAGE_SIZE, since we could have a PTE page
         * table filled with PTE-mapped THPs, each of which is distinct.
         */
        for (addr = vaddr_start; addr < vaddr_end; addr += PAGE_SIZE) {
				
				...

                if (!trylock_page(page))
                        continue;

                split_huge_page(page);

                unlock_page(page);
                put_page(page);
        }
        mmap_read_unlock(mm);


Thanks for reviewing the patch.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-16 16:25   ` Kirill A. Shutemov
@ 2020-11-16 17:27     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-16 17:27 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: linux-mm, Matthew Wilcox, Kirill A . Shutemov, Roman Gushchin,
	Andrew Morton, linux-kernel, linux-kselftest, Yang Shi,
	Michal Hocko, John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 1396 bytes --]

On 16 Nov 2020, at 11:25, Kirill A. Shutemov wrote:

> On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> It adds a new_order parameter to set new page order in page owner.
>> It prepares for upcoming changes to support split huge page to any lower
>> order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  include/linux/page_owner.h | 7 ++++---
>>  mm/huge_memory.c           | 2 +-
>>  mm/page_alloc.c            | 2 +-
>>  mm/page_owner.c            | 6 +++---
>>  4 files changed, 9 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/page_owner.h b/include/linux/page_owner.h
>> index 3468794f83d2..215cbb159568 100644
>> --- a/include/linux/page_owner.h
>> +++ b/include/linux/page_owner.h
>> @@ -31,10 +31,11 @@ static inline void set_page_owner(struct page *page,
>>  		__set_page_owner(page, order, gfp_mask);
>>  }
>>
>> -static inline void split_page_owner(struct page *page, unsigned int nr)
>> +static inline void split_page_owner(struct page *page, unsigned int nr,
>> +			unsigned int new_order)
>>  {
>>  	if (static_branch_unlikely(&page_owner_inited))
>> -		__split_page_owner(page, nr);
>> +		__split_page_owner(page, nr, new_order);
>
> Hm. Where do you correct __split_page_owner() declaration. I don't see it.

I missed it. Will add it. Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-14  1:38       ` Roman Gushchin
@ 2020-11-17 21:05         ` Matthew Wilcox
  2020-11-17 21:12           ` Zi Yan
  0 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2020-11-17 21:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
> > Matthew recently converted split_page_owner to take nr instead of order.[1]
> > But I am not
> > sure why, since it seems to me that two call sites (__split_huge_page in
> > mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
> > information.
> 
> Yeah, I'm not sure why too. Maybe Matthew has some input here?
> You can also pass new_nr, but IMO orders look so much better here.

If only I'd written that information in the changelog ... oh wait, I did!

    mm/page_owner: change split_page_owner to take a count
    
    The implementation of split_page_owner() prefers a count rather than the
    old order of the page.  When we support a variable size THP, we won't
    have the order at this point, but we will have the number of pages.
    So change the interface to what the caller and callee would prefer.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
                     ` (2 preceding siblings ...)
  2020-11-16 16:25   ` Kirill A. Shutemov
@ 2020-11-17 21:10   ` Matthew Wilcox
  2020-11-17 21:13     ` Zi Yan
  3 siblings, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2020-11-17 21:10 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
> -	for (i = 0; i < nr; i++) {
> +	for (i = 0; i < nr; i += (1 << new_order)) {
>  		page_owner = get_page_owner(page_ext);
> -		page_owner->order = 0;
> +		page_owner->order = new_order;
>  		page_ext = page_ext_next(page_ext);
>  	}

This doesn't do what you're hoping it will.  It's going to set ->order to
new_order for the first N pages instead of every 1/N pages.

You'll need to do something like

		page_ext = lookup_page_ext(page + i);

or add a new page_ext_add(page_ext, 1 << new_order);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-17 21:05         ` Matthew Wilcox
@ 2020-11-17 21:12           ` Zi Yan
  2020-11-17 21:22             ` Matthew Wilcox
  0 siblings, 1 reply; 36+ messages in thread
From: Zi Yan @ 2020-11-17 21:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Roman Gushchin, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:

> On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
>> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
>>> Matthew recently converted split_page_owner to take nr instead of order.[1]
>>> But I am not
>>> sure why, since it seems to me that two call sites (__split_huge_page in
>>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
>>> information.
>>
>> Yeah, I'm not sure why too. Maybe Matthew has some input here?
>> You can also pass new_nr, but IMO orders look so much better here.
>
> If only I'd written that information in the changelog ... oh wait, I did!
>
>     mm/page_owner: change split_page_owner to take a count
>
>     The implementation of split_page_owner() prefers a count rather than the
>     old order of the page.  When we support a variable size THP, we won't
>     have the order at this point, but we will have the number of pages.
>     So change the interface to what the caller and callee would prefer.

There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
mm/huge_memory.c. The former has the page order. The latter has the page order
information before __split_huge_page_tail is called, so we can do
old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order.
What am I missing there?

Thanks.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-17 21:10   ` Matthew Wilcox
@ 2020-11-17 21:13     ` Zi Yan
  0 siblings, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-17 21:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, Kirill A . Shutemov, Roman Gushchin, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

On 17 Nov 2020, at 16:10, Matthew Wilcox wrote:

> On Wed, Nov 11, 2020 at 03:40:05PM -0500, Zi Yan wrote:
>> -	for (i = 0; i < nr; i++) {
>> +	for (i = 0; i < nr; i += (1 << new_order)) {
>>  		page_owner = get_page_owner(page_ext);
>> -		page_owner->order = 0;
>> +		page_owner->order = new_order;
>>  		page_ext = page_ext_next(page_ext);
>>  	}
>
> This doesn't do what you're hoping it will.  It's going to set ->order to
> new_order for the first N pages instead of every 1/N pages.
>
> You'll need to do something like
>
> 		page_ext = lookup_page_ext(page + i);

Will use this. Thanks.

>
> or add a new page_ext_add(page_ext, 1 << new_order);


—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-17 21:12           ` Zi Yan
@ 2020-11-17 21:22             ` Matthew Wilcox
  2020-11-17 21:25               ` Zi Yan
  2020-11-17 21:35               ` Roman Gushchin
  0 siblings, 2 replies; 36+ messages in thread
From: Matthew Wilcox @ 2020-11-17 21:22 UTC (permalink / raw)
  To: Zi Yan
  Cc: Roman Gushchin, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:
> On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:
> 
> > On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
> >> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
> >>> Matthew recently converted split_page_owner to take nr instead of order.[1]
> >>> But I am not
> >>> sure why, since it seems to me that two call sites (__split_huge_page in
> >>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
> >>> information.
> >>
> >> Yeah, I'm not sure why too. Maybe Matthew has some input here?
> >> You can also pass new_nr, but IMO orders look so much better here.
> >
> > If only I'd written that information in the changelog ... oh wait, I did!
> >
> >     mm/page_owner: change split_page_owner to take a count
> >
> >     The implementation of split_page_owner() prefers a count rather than the
> >     old order of the page.  When we support a variable size THP, we won't
> >     have the order at this point, but we will have the number of pages.
> >     So change the interface to what the caller and callee would prefer.
> 
> There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
> mm/huge_memory.c. The former has the page order. The latter has the page order
> information before __split_huge_page_tail is called, so we can do
> old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order.
> What am I missing there?

Sure, we could also do that.  But what I wrote was true at the time I
wrote it.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-17 21:22             ` Matthew Wilcox
@ 2020-11-17 21:25               ` Zi Yan
  2020-11-17 21:35               ` Roman Gushchin
  1 sibling, 0 replies; 36+ messages in thread
From: Zi Yan @ 2020-11-17 21:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Roman Gushchin, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

On 17 Nov 2020, at 16:22, Matthew Wilcox wrote:

> On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:
>> On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:
>>
>>> On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
>>>> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
>>>>> Matthew recently converted split_page_owner to take nr instead of order.[1]
>>>>> But I am not
>>>>> sure why, since it seems to me that two call sites (__split_huge_page in
>>>>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
>>>>> information.
>>>>
>>>> Yeah, I'm not sure why too. Maybe Matthew has some input here?
>>>> You can also pass new_nr, but IMO orders look so much better here.
>>>
>>> If only I'd written that information in the changelog ... oh wait, I did!
>>>
>>>     mm/page_owner: change split_page_owner to take a count
>>>
>>>     The implementation of split_page_owner() prefers a count rather than the
>>>     old order of the page.  When we support a variable size THP, we won't
>>>     have the order at this point, but we will have the number of pages.
>>>     So change the interface to what the caller and callee would prefer.
>>
>> There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
>> mm/huge_memory.c. The former has the page order. The latter has the page order
>> information before __split_huge_page_tail is called, so we can do
>> old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order.
>> What am I missing there?
>
> Sure, we could also do that.  But what I wrote was true at the time I
> wrote it.

Got it. Thanks. Will change it to use old_order to make split_page_owner parameters
look more consistent.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-17 21:22             ` Matthew Wilcox
  2020-11-17 21:25               ` Zi Yan
@ 2020-11-17 21:35               ` Roman Gushchin
  2020-11-17 21:43                 ` Matthew Wilcox
  1 sibling, 1 reply; 36+ messages in thread
From: Roman Gushchin @ 2020-11-17 21:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Tue, Nov 17, 2020 at 09:22:55PM +0000, Matthew Wilcox wrote:
> On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:
> > On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:
> > 
> > > On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
> > >> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
> > >>> Matthew recently converted split_page_owner to take nr instead of order.[1]
> > >>> But I am not
> > >>> sure why, since it seems to me that two call sites (__split_huge_page in
> > >>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
> > >>> information.
> > >>
> > >> Yeah, I'm not sure why too. Maybe Matthew has some input here?
> > >> You can also pass new_nr, but IMO orders look so much better here.
> > >
> > > If only I'd written that information in the changelog ... oh wait, I did!
> > >
> > >     mm/page_owner: change split_page_owner to take a count
> > >
> > >     The implementation of split_page_owner() prefers a count rather than the
> > >     old order of the page.  When we support a variable size THP, we won't
> > >     have the order at this point, but we will have the number of pages.
> > >     So change the interface to what the caller and callee would prefer.
> > 
> > There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
> > mm/huge_memory.c. The former has the page order. The latter has the page order
> > information before __split_huge_page_tail is called, so we can do
> > old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order.
> > What am I missing there?
> 
> Sure, we could also do that.  But what I wrote was true at the time I
> wrote it.

Sure, I was asking about if you're ok with going back to orders or there are better
ideas. I'm sorry if it wasn't clear and sounded differently.

It just seems to me than a function is taking nr and order (as in Zi's last version),
I'd expect that it's a number of pages of given order, or something like this.
So I'd avoid mixing them. Orders are slightly better if nr is always a power of two,
it's just more obvious from looking at the code.

Thanks!


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner.
  2020-11-17 21:35               ` Roman Gushchin
@ 2020-11-17 21:43                 ` Matthew Wilcox
  0 siblings, 0 replies; 36+ messages in thread
From: Matthew Wilcox @ 2020-11-17 21:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Zi Yan, linux-mm, Kirill A . Shutemov, Andrew Morton,
	linux-kernel, linux-kselftest, Yang Shi, Michal Hocko,
	John Hubbard, Ralph Campbell, David Nellans

On Tue, Nov 17, 2020 at 01:35:37PM -0800, Roman Gushchin wrote:
> On Tue, Nov 17, 2020 at 09:22:55PM +0000, Matthew Wilcox wrote:
> > On Tue, Nov 17, 2020 at 04:12:03PM -0500, Zi Yan wrote:
> > > On 17 Nov 2020, at 16:05, Matthew Wilcox wrote:
> > > 
> > > > On Fri, Nov 13, 2020 at 05:38:01PM -0800, Roman Gushchin wrote:
> > > >> On Fri, Nov 13, 2020 at 08:08:58PM -0500, Zi Yan wrote:
> > > >>> Matthew recently converted split_page_owner to take nr instead of order.[1]
> > > >>> But I am not
> > > >>> sure why, since it seems to me that two call sites (__split_huge_page in
> > > >>> mm/huge_memory.c and split_page in mm/page_alloc.c) can pass the order
> > > >>> information.
> > > >>
> > > >> Yeah, I'm not sure why too. Maybe Matthew has some input here?
> > > >> You can also pass new_nr, but IMO orders look so much better here.
> > > >
> > > > If only I'd written that information in the changelog ... oh wait, I did!
> > > >
> > > >     mm/page_owner: change split_page_owner to take a count
> > > >
> > > >     The implementation of split_page_owner() prefers a count rather than the
> > > >     old order of the page.  When we support a variable size THP, we won't
> > > >     have the order at this point, but we will have the number of pages.
> > > >     So change the interface to what the caller and callee would prefer.
> > > 
> > > There are two callers, split_page in mm/page_alloc.c and __split_huge_page in
> > > mm/huge_memory.c. The former has the page order. The latter has the page order
> > > information before __split_huge_page_tail is called, so we can do
> > > old_order = thp_order(head) instead of nr = thp_nr_page(head) and use old_order.
> > > What am I missing there?
> > 
> > Sure, we could also do that.  But what I wrote was true at the time I
> > wrote it.
> 
> Sure, I was asking about if you're ok with going back to orders or there are better
> ideas. I'm sorry if it wasn't clear and sounded differently.
> 
> It just seems to me than a function is taking nr and order (as in Zi's last version),
> I'd expect that it's a number of pages of given order, or something like this.
> So I'd avoid mixing them. Orders are slightly better if nr is always a power of two,
> it's just more obvious from looking at the code.

I think it's awkward no matter which way round we do it.

If we pass old_order, new_order then we create extra work for both caller
and callee.

If we pass old_nr, new_order, it looks weird for humans.

At the end of the day, I'm not that invested in which we do.


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2020-11-17 21:43 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-11 20:40 [RFC PATCH 0/6] Split huge pages to any lower order pages Zi Yan
2020-11-11 20:40 ` [RFC PATCH 1/6] mm: huge_memory: add new debugfs interface to trigger split huge page on any page range Zi Yan
2020-11-12 22:22   ` Ralph Campbell
2020-11-12 22:38     ` Zi Yan
2020-11-16 16:06   ` Kirill A. Shutemov
2020-11-16 17:26     ` Zi Yan
2020-11-11 20:40 ` [RFC PATCH 2/6] mm: memcg: make memcg huge page split support any order split Zi Yan
2020-11-12 17:58   ` Ralph Campbell
2020-11-12 18:00     ` Zi Yan
2020-11-14  0:23   ` Roman Gushchin
2020-11-14  0:56     ` Zi Yan
2020-11-11 20:40 ` [RFC PATCH 3/6] mm: page_owner: add support for splitting to any order in split page_owner Zi Yan
2020-11-12 17:57   ` Ralph Campbell
2020-11-12 17:59     ` Zi Yan
2020-11-14  0:15   ` Roman Gushchin
2020-11-14  1:08     ` Zi Yan
2020-11-14  1:38       ` Roman Gushchin
2020-11-17 21:05         ` Matthew Wilcox
2020-11-17 21:12           ` Zi Yan
2020-11-17 21:22             ` Matthew Wilcox
2020-11-17 21:25               ` Zi Yan
2020-11-17 21:35               ` Roman Gushchin
2020-11-17 21:43                 ` Matthew Wilcox
2020-11-16 16:25   ` Kirill A. Shutemov
2020-11-16 17:27     ` Zi Yan
2020-11-17 21:10   ` Matthew Wilcox
2020-11-17 21:13     ` Zi Yan
2020-11-11 20:40 ` [RFC PATCH 4/6] mm: thp: add support for split huge page to any lower order pages Zi Yan
2020-11-12 22:01   ` Ralph Campbell
2020-11-12 22:20     ` Zi Yan
2020-11-14  0:52   ` Roman Gushchin
2020-11-14  1:00     ` Zi Yan
2020-11-11 20:40 ` [RFC PATCH 5/6] mm: truncate: split thp to a non-zero order if possible Zi Yan
2020-11-12 22:08   ` Ralph Campbell
2020-11-12 22:37     ` Zi Yan
2020-11-11 20:40 ` [RFC PATCH 6/6] mm: huge_memory: enable debugfs to split huge pages to any order Zi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).