All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table
@ 2022-09-27 16:29 Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

Currently, copy-on-write is only used for the mapped memory; the child
process still needs to copy the entire page table from the parent
process during forking. The parent process might take a lot of time and
memory to copy the page table when the parent has a big page table
allocated. For example, the memory usage of a process after forking with
1 GB mapped memory is as follows:

              DEFAULT FORK
          parent         child
VmRSS:   1049688 kB    1048688 kB
VmPTE:      2096 kB       2096 kB

This patch introduces copy-on-write (COW) for the PTE level page tables.
COW PTE improves performance in the situation where the user needs
copies of the program to run on isolated environments. Feedback-based
fuzzers (e.g., AFL) and serverless/microservice frameworks are two major
examples. For instance, COW PTE achieves a 9.3x throughput increase when
running SQLite on a fuzzer (AFL). As COW PTE only boosts performance in
some cases, the patch adds a new sysctl, vm.cow_pte, with the input
process ID (PID) to allow the user to enable COW PTE for a specific
process.

To handle the page table state of each process that has a shared PTE
table, the patch introduces the concept of COW PTE table ownership. This
implementation uses the address of the PMD index to track the ownership
of the PTE table. This helps maintain the state of the COW PTE tables,
such as the RSS and pgtable_bytes. Some PTE tables (e.g., pinned pages
that reside in the table) still need to be copied immediately for
consistency with the current COW logic. As a result, a flag,
COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE table is exclusive
(i.e., only one task owns it at a time) is added to the table’s owner
pointer. Every time a PTE table is copied during the fork, the owner
pointer (and thus the exclusive flag) will be checked to determine
whether the PTE table can be shared across processes.

This patch uses a refcount to track the shared page table's lifetime.
Invoking fork with COW PTE will increase the refcount. A refcount=1
means that the page table is not currently shared with another process
but may be shared. And, when someone writes to the shared PTE table, it
will cause the write fault to break COW PTE. If the shared PTE table's
refcount is one, the process that triggers the fault will reuse the
shared PTE table. Otherwise, the process will decrease the refcount,
copy the information to a new PTE table or dereference all the
information and change the owner if they have the shared PTE table.

After applying COW to PTE, the memory usage after forking is as follows:

                 COW PTE
          parent         child
VmRSS:	 1049968 kB       2576 kB
VmPTE:	    2096 kB         44 kB

The results show that this patch significantly decreases memory usage.
Other improvements such as lower fork latency and page fault latency,
which are the major benefits, are discussed later.

Real-world applications
=======================

We run benchmarks of fuzzing and VM cloning. The experiments were done
with the normal fork or the fork with COW PTE.

With AFL (LLVM mode) and SQLite, COW PTE (503.67 execs/sec) achieves a
9.3x throughput increase over the normal fork version (53.86 execs/sec).

                   fork
     execs_per_sec     unix_time        time
count    26.000000  2.600000e+01   26.000000
mean     53.861538  1.663145e+09   84.423077
std       3.715063  5.911357e+01   59.113567
min      35.980000  1.663145e+09    0.000000
25%      54.440000  1.663145e+09   32.250000
50%      54.610000  1.663145e+09   82.000000
75%      54.837500  1.663145e+09  140.750000
max      55.600000  1.663145e+09  178.000000

                 COW PTE
     execs_per_sec     unix_time        time
count    36.000000  3.600000e+01   36.000000
mean    503.674444  1.663146e+09   88.916667
std      81.805271  5.369191e+01   53.691912
min      84.910000  1.663146e+09    0.000000
25%     472.952500  1.663146e+09   44.500000
50%     504.700000  1.663146e+09   89.000000
75%     553.367500  1.663146e+09  133.250000
max     568.270000  1.663146e+09  178.000000

With TriforceAFL which is for kernel fuzzing with QEMU, COW PTE
(124.31 execs/sec) achieves a 1.3x throughput increase over the
normal fork version (96.44 execs/sec).

                   fork
     execs_per_sec     unix_time        time
count    18.000000  1.800000e+01   18.000000
mean     96.436667  1.663146e+09   84.388889
std      25.260184  6.601795e+01   66.017947
min       6.590000  1.663146e+09    0.000000
25%      91.025000  1.663146e+09   21.250000
50%     100.350000  1.663146e+09   92.000000
75%     111.247500  1.663146e+09  146.750000
max     122.260000  1.663146e+09  169.000000

                 COW PTE
     execs_per_sec     unix_time        time
count    22.000000  2.200000e+01   22.000000
mean    124.305455  1.663147e+09   90.409091
std      32.508728  6.033846e+01   60.338457
min       6.590000  1.663146e+09    0.000000
25%     113.227500  1.663146e+09   26.250000
50%     122.435000  1.663147e+09  112.000000
75%     145.792500  1.663147e+09  141.500000
max     161.280000  1.663147e+09  168.000000

Comparison with uffd
====================

For RFC v1, David Hildenbrand mentioned that uffd-wp is a new way of
snapshotting in QEMU. There is some overlap between uffd and fork use
cases, such as database snapshotting. So the following microbenchmarks
also measure the overhead of uffd-wp and uffd-copy-page.

To be fair in terms of CPU usage, the uffd handlers are pinned to the
same core as the main thread. uffd-wp simulates the work QEMU does with
uffd-wp. It will store the page that causes the fault into a memory
buffer and remove write protection for that page. Also, uffd-copy-page
will allocate the memory and replace the original page that causes the
fault.

Microbenchmark - syscall/registering latency
=============================================

We run microbenchmarks to measure the latency of a fork syscall or
registering uffd with sizes of mapped memory ranging from 0 to 512 MB
for the use cases that focus on lowering startup time (e.g., serverless
frameworks). The results show that the latency of a normal fork and
registering uffd-wp reaches 10 ms and 3.9 ms respectively, while the
latency of registering uffd-copy-page is around 0.007 ms. The latency of
a fork with COW PTE is around 0.625 ms after 200 MB, which is
significantly lower than the normal fork/uffd-wp. In short, with 512 MB
mapped memory, COW PTE decreases latency by 93% for normal fork and 83%
for uffd-wp.

Microbenchmark - page fault latency
====================================

We conducted some microbenchmarks to measure page fault latency with
different patterns of accesses to a 512 MB memory buffer after forking
or registering uffd.

In the first experiment, the program accesses the entire 512 MB memory
by writing to all the pages consecutively. The experiment is done with
normal fork, fork with COW PTE, uffd-wp, and uffd-copy-page and
calculates the single access average latency. The result shows that the
page fault latency of COW PTE (0.000045 ms) is 59.5x faster than the
uffd-wp (0.002676 ms). The low uffd-wp performance is probably because
of the cost of switching between kernel and user mode. What is more
interesting is that COW PTE also improves the average page fault
latency. COW PTE page fault latency (0.000045 ms) is 16.5x lower than
the normal fork fault latency (0.000742 ms). Here are the raw numbers:

Page fault - Access to the entire 512 MB memory
fork mean: 0.000742 ms
COW PTE mean: 0.000045 ms
uffd (wp) mean: 0.002676 ms
uffd (copy-page) mean: 0.008667 ms

The second experiment simulates real-world applications with sparse
accesses. The program randomly accesses the memory by writing to one
random page 1 million times and calculates the average access time.
Since the number of fork and COW PTE are too close to each other, we
cannot simply conclude which one is faster, so we run both 100 times
to get the averages. The result shows that COW PTE (0.000027 ms) is
similar to normal fork (0.000028 ms) and is 2.3x faster than uffd-wp
(0.000060 ms).

Page fault - Random access
fork mean: 0.000028 ms
COW PTE mean: 0.000027 ms
uffd (wp) mean: 0.000060 ms
uffd (copy-page) mean: 0.002363 ms

All the tests were run with QEMU and the kernel was built with the
x86_64 default config.

Summary
=======

In summary, COW PTE reduces the memory footprint of processes and
improves the initialization and page fault latency for various
applications, which would be important to some frameworks that require
very low execution startup (e.g., serverless framework) or
high-throughput short executions of child processes (e.g., testing).

This patch is based on the paper "On-demand-fork: a microsecond fork
for memory-intensive and latency-sensitive applications" [1] from
Purdue University.

Any comments and suggestions are welcome.

Thanks,
Chih-En Lin

---

TODO list:
- Handle the file-backed and shmem with reclaim.
- Handle OOM, KSM, page table walker, and migration.
- Deal with TLB flush in the break COW PTE handler.

RFC v1 -> RFC v2
- Change the clone flag method to sysctl with PID.
- Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
  MMF_COW_PTE_READY, for the sysctl.
- Change the owner pointer to use the folio padding.
- Handle all the VMAs that cover the PTE table when doing the break COW PTE.
- Remove the self-defined refcount to use the _refcount for the page
  table page.
- Add the exclusive flag to let the page table only own by one task in
  some situations.
- Invalidate address range MMU notifier and start the write_seqcount
  when doing the break COW PTE.
- Handle the swap cache and swapoff.

RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/

[1] https://dl.acm.org/doi/10.1145/3447786.3456258

This patch is based on v6.0-rc5.

---

Chih-En Lin (9):
  mm: Add new mm flags for Copy-On-Write PTE table
  mm: pgtable: Add sysctl to enable COW PTE
  mm, pgtable: Add ownership to PTE table
  mm: Add COW PTE fallback functions
  mm, pgtable: Add a refcount to PTE table
  mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag
  mm: Add the break COW PTE handler
  mm: Handle COW PTE with reclaim algorithm
  mm: Introduce Copy-On-Write PTE table

 include/linux/mm.h             |   2 +
 include/linux/mm_types.h       |   5 +-
 include/linux/pgtable.h        | 140 +++++++++++++
 include/linux/rmap.h           |   2 +
 include/linux/sched/coredump.h |   8 +-
 kernel/fork.c                  |   5 +
 kernel/sysctl.c                |   8 +
 mm/Makefile                    |   2 +-
 mm/cow_pte.c                   |  39 ++++
 mm/gup.c                       |  13 +-
 mm/memory.c                    | 360 ++++++++++++++++++++++++++++++++-
 mm/mmap.c                      |   3 +
 mm/mremap.c                    |   3 +
 mm/page_vma_mapped.c           |   5 +
 mm/rmap.c                      |   2 +-
 mm/swapfile.c                  |   1 +
 mm/vmscan.c                    |   1 +
 17 files changed, 587 insertions(+), 12 deletions(-)
 create mode 100644 mm/cow_pte.c

-- 
2.37.3


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 17:23   ` Nadav Amit
  2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

Add MMF_COW_PTE{, _READY} flags to prepare the subsequent
implementation of Copy-On-Write for the page table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/sched/coredump.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 4d0a5be28b70f..f03ff69c90c8c 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -84,7 +84,13 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
+#define MMF_COW_PTE_READY	29
+#define MMF_COW_PTE_READY_MASK	(1 << MMF_COW_PTE_READY)
+
+#define MMF_COW_PTE		30
+#define MMF_COW_PTE_MASK	(1 << MMF_COW_PTE)
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
-				 MMF_DISABLE_THP_MASK)
+				 MMF_DISABLE_THP_MASK | MMF_COW_PTE_MASK)
 
 #endif /* _LINUX_SCHED_COREDUMP_H */
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 17:27   ` Nadav Amit
  2022-09-27 21:22   ` John Hubbard
  2022-09-27 16:29 ` [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Chih-En Lin
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

Add a new sysctl vm.cow_pte to set MMF_COW_PTE_READY flag for enabling
copy-on-write (COW) to the PTE page table during the next time of fork.

Since it has a time gap between using the sysctl to enable the COW PTE
and doing the fork, we use two states to determine the task that wants
to do COW PTE or already doing it.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/pgtable.h |  6 ++++++
 kernel/fork.c           |  5 +++++
 kernel/sysctl.c         |  8 ++++++++
 mm/Makefile             |  2 +-
 mm/cow_pte.c            | 39 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100644 mm/cow_pte.c

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 014ee8f0fbaab..d03d01aefe989 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -937,6 +937,12 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
 	__ptep_modify_prot_commit(vma, addr, ptep, pte);
 }
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
+
+int cow_pte_handler(struct ctl_table *table, int write, void *buffer,
+		    size_t *lenp, loff_t *ppos);
+
+extern int sysctl_cow_pte_pid;
+
 #endif /* CONFIG_MMU */
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 8a9e92068b150..6981944a7c6ec 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2671,6 +2671,11 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 			trace = 0;
 	}
 
+	if (current->mm && test_bit(MMF_COW_PTE_READY, &current->mm->flags)) {
+		clear_bit(MMF_COW_PTE_READY, &current->mm->flags);
+		set_bit(MMF_COW_PTE, &current->mm->flags);
+	}
+
 	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
 	add_latent_entropy();
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 205d605cacc5b..c4f54412ae3a9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2360,6 +2360,14 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= mmap_min_addr_handler,
 	},
+	{
+		.procname	= "cow_pte",
+		.data		= &sysctl_cow_pte_pid,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= cow_pte_handler,
+		.extra1		= SYSCTL_ZERO,
+	},
 #endif
 #ifdef CONFIG_NUMA
 	{
diff --git a/mm/Makefile b/mm/Makefile
index 9a564f8364035..7a568d5066ee6 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,7 +40,7 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o
+			   pgtable-generic.o rmap.o vmalloc.o cow_pte.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/cow_pte.c b/mm/cow_pte.c
new file mode 100644
index 0000000000000..4e50aa4294ce7
--- /dev/null
+++ b/mm/cow_pte.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/sysctl.h>
+#include <linux/pgtable.h>
+#include <linux/sched.h>
+#include <linux/sched/coredump.h>
+#include <linux/pid.h>
+
+/* sysctl will write to this variable */
+int sysctl_cow_pte_pid = -1;
+
+static void set_cow_pte_task(void)
+{
+	struct pid *pid;
+	struct task_struct *task;
+
+	pid = find_get_pid(sysctl_cow_pte_pid);
+	if (!pid) {
+		pr_info("pid %d does not exist\n", sysctl_cow_pte_pid);
+		sysctl_cow_pte_pid = -1;
+		return;
+	}
+	task = get_pid_task(pid, PIDTYPE_PID);
+	if (!test_bit(MMF_COW_PTE, &task->mm->flags))
+		set_bit(MMF_COW_PTE_READY, &task->mm->flags);
+	sysctl_cow_pte_pid = -1;
+}
+
+int cow_pte_handler(struct ctl_table *table, int write, void *buffer,
+		    size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (write && !ret)
+		set_cow_pte_task();
+
+	return ret;
+}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 17:30   ` Nadav Amit
  2022-09-27 16:29 ` [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Chih-En Lin
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

Introduce the ownership to the PTE table. It uses the address of PMD
index to track the ownership to identify which process can update
their page table state from the COWed PTE table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h |  5 ++++-
 include/linux/pgtable.h  | 10 ++++++++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd3..965523dcca3b8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2289,6 +2289,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
 		return false;
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
+	page->cow_pte_owner = NULL;
 	return true;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cf97f3884fda2..42798b59cec4e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,7 +152,10 @@ struct page {
 			struct list_head deferred_list;
 		};
 		struct {	/* Page table pages */
-			unsigned long _pt_pad_1;	/* compound_head */
+			union {
+				unsigned long _pt_pad_1; /* compound_head */
+				pmd_t *cow_pte_owner; /* cow pte: pmd */
+			};
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index d03d01aefe989..9dca787a3f4dd 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -615,6 +615,16 @@ static inline int pte_unused(pte_t pte)
 }
 #endif
 
+static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
+{
+	smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner);
+}
+
+static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
+{
+	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;
+}
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (2 preceding siblings ...)
  2022-09-27 16:29 ` [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 17:51   ` Nadav Amit
  2022-09-27 16:29 ` [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Chih-En Lin
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

The lifetime of COWed PTE table will handle by a reference count.
When the process wants to write the COWed PTE table, which refcount
is 1, it will reuse the shared PTE.

Since only the owner will update their page table state. the fallback
function also needs to handle the situation of non-owner COWed PTE table
fallback to normal PTE.

This commit prepares for the following implementation of the reference
count for COW PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/pgtable.h |  3 ++
 mm/memory.c             | 93 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9dca787a3f4dd..25c1e5c42fdf3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -615,6 +615,9 @@ static inline int pte_unused(pte_t pte)
 }
 #endif
 
+void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
+		      unsigned long addr);
+
 static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
 {
 	smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner);
diff --git a/mm/memory.c b/mm/memory.c
index 4ba73f5aa8bb7..d29f84801f3cd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -509,6 +509,37 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 			add_mm_counter(mm, i, rss[i]);
 }
 
+static void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
+			       pmd_t *pmdp, unsigned long addr,
+			       unsigned long end, bool inc_dec)
+{
+	int rss[NR_MM_COUNTERS];
+	spinlock_t *ptl;
+	pte_t *orig_ptep, *ptep;
+	struct page *page;
+
+	init_rss_vec(rss);
+
+	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	orig_ptep = ptep;
+	arch_enter_lazy_mmu_mode();
+	do {
+		if (pte_none(*ptep))
+			continue;
+
+		page = vm_normal_page(vma, addr, *ptep);
+		if (page) {
+			if (inc_dec)
+				rss[mm_counter(page)]++;
+			else
+				rss[mm_counter(page)]--;
+		}
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_ptep, ptl);
+	add_mm_rss_vec(mm, rss);
+}
+
 /*
  * This function is called to print an error when a bad pte
  * is found. For example, we might have a PFN-mapped pte in
@@ -2817,6 +2848,68 @@ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
 
+/**
+ * cow_pte_fallback - reuse the shared PTE table
+ * @vma: vma that coever the shared PTE table
+ * @pmd: pmd index maps to the shared PTE table
+ * @addr: the address trigger the break COW,
+ *
+ * Reuse the shared (COW) PTE table when the refcount is equal to one.
+ * @addr needs to be in the range of the shared PTE table that @vma and
+ * @pmd mapped to it.
+ *
+ * COW PTE fallback to normal PTE:
+ * - two state here
+ *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
+ *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
+ *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
+ *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
+ */
+void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
+		      unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct *prev = vma->vm_prev;
+	struct vm_area_struct *next = vma->vm_next;
+	unsigned long start, end;
+	pmd_t new;
+
+	VM_WARN_ON(pmd_write(*pmd));
+
+	start = addr & PMD_MASK;
+	end = (addr + PMD_SIZE) & PMD_MASK;
+
+	/*
+	 * If pmd is not owner, it needs to increase the rss.
+	 * Since only the owner has the RSS state for the COW PTE.
+	 */
+	if (!cow_pte_owner_is_same(pmd, pmd)) {
+		/* The part of address range is covered by previous. */
+		if (start < vma->vm_start && prev && start < prev->vm_end) {
+			cow_pte_rss(mm, prev, pmd,
+				    start, prev->vm_end, true /* inc */);
+			start = vma->vm_start;
+		}
+		/* The part of address range is covered by next. */
+		if (end > vma->vm_end && next && end > next->vm_start) {
+			cow_pte_rss(mm, next, pmd,
+				    next->vm_start, end, true /* inc */);
+			end = vma->vm_end;
+		}
+		cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
+
+		mm_inc_nr_ptes(mm);
+		/* Memory barrier here is the same as pmd_install(). */
+		smp_wmb();
+		pmd_populate(mm, pmd, pmd_page(*pmd));
+	}
+
+	/* Reuse the pte page */
+	set_cow_pte_owner(pmd, NULL);
+	new = pmd_mkwrite(*pmd);
+	set_pmd_at(mm, addr, pmd, new);
+}
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry which was
  * read non-atomically.  Before making any commitment, on those architectures
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (3 preceding siblings ...)
  2022-09-27 16:29 ` [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 17:59   ` Nadav Amit
  2022-09-27 16:29 ` [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag Chih-En Lin
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

Reuse the _refcount in struct page for the page table to maintain the
number of process references to COWed PTE table. Before decreasing the
refcount, it will check whether refcount is one or not for reusing
shared PTE table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/mm.h      |  1 +
 include/linux/pgtable.h | 28 ++++++++++++++++++++++++++++
 mm/memory.c             |  1 +
 3 files changed, 30 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 965523dcca3b8..bfe6a8c7ab9ed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2290,6 +2290,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
 	page->cow_pte_owner = NULL;
+	set_page_count(page, 1);
 	return true;
 }
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 25c1e5c42fdf3..8b497d7d800ed 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -9,6 +9,7 @@
 #ifdef CONFIG_MMU
 
 #include <linux/mm_types.h>
+#include <linux/page_ref.h>
 #include <linux/bug.h>
 #include <linux/errno.h>
 #include <asm-generic/pgtable_uffd.h>
@@ -628,6 +629,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
 	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;
 }
 
+static inline int pmd_get_pte(pmd_t *pmd)
+{
+	return page_ref_inc_return(pmd_page(*pmd));
+}
+
+/*
+ * If the COW PTE refcount is 1, instead of decreasing the counter,
+ * clear write protection of the corresponding PMD entry and reset
+ * the COW PTE owner to reuse the table.
+ * But if the reuse parameter is false, do not thing. This help us
+ * to handle the situation that PTE table we already handled.
+ */
+static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd,
+			      unsigned long addr, bool reuse)
+{
+	if (!page_ref_add_unless(pmd_page(*pmd), -1, 1) && reuse) {
+		cow_pte_fallback(vma, pmd, addr);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int cow_pte_count(pmd_t *pmd)
+{
+	return page_count(pmd_page(*pmd));
+}
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
diff --git a/mm/memory.c b/mm/memory.c
index d29f84801f3cd..3e66e229f4169 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2875,6 +2875,7 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
 	pmd_t new;
 
 	VM_WARN_ON(pmd_write(*pmd));
+	VM_WARN_ON(cow_pte_count(pmd) != 1);
 
 	start = addr & PMD_MASK;
 	end = (addr + PMD_SIZE) & PMD_MASK;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (4 preceding siblings ...)
  2022-09-27 16:29 ` [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Chih-En Lin
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

For present COW logic (physical page), in some situations (e.g., pinned
page), we cannot share those pages. To make the COW PTE consistent with
current logic, introduce the COW_PTE_OWNER_EXCLUSIVE flag to avoid doing
COW to the PTE table during fork(). The following is a list of the
exclusive flag used.

- GUP pinnig with COW physical page will get in trouble. Currently, it
  will not do COW when GUP works. Follow the rule here.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/pgtable.h | 18 ++++++++++++++++++
 mm/gup.c                | 13 +++++++++++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8b497d7d800ed..9b08a3361d490 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -656,6 +656,24 @@ static inline int cow_pte_count(pmd_t *pmd)
 	return page_count(pmd_page(*pmd));
 }
 
+/* Keep the first bit clear. See more detail in the comments of struct page. */
+#define COW_PTE_OWNER_EXCLUSIVE ((pmd_t *) 0x02UL)
+
+static inline void pmd_cow_pte_mkexclusive(pmd_t *pmd)
+{
+	set_cow_pte_owner(pmd, COW_PTE_OWNER_EXCLUSIVE);
+}
+
+static inline bool pmd_cow_pte_exclusive(pmd_t *pmd)
+{
+	return cow_pte_owner_is_same(pmd, COW_PTE_OWNER_EXCLUSIVE);
+}
+
+static inline void pmd_cow_pte_clear_mkexclusive(pmd_t *pmd)
+{
+	set_cow_pte_owner(pmd, NULL);
+}
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf4874605..4949c8d42a400 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -634,6 +634,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		mark_page_accessed(page);
 	}
 out:
+	/*
+	 * We don't share the PTE when any other pinned page exists. And
+	 * let the exclusive flag stick around until the table is freed.
+	 */
+	pmd_cow_pte_mkexclusive(pmd);
 	pte_unmap_unlock(ptep, ptl);
 	return page;
 no_page:
@@ -932,6 +937,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pte = pte_offset_map(pmd, address);
 	if (pte_none(*pte))
 		goto unmap;
+	pmd_cow_pte_clear_mkexclusive(pmd);
 	*vma = get_gate_vma(mm);
 	if (!page)
 		goto out;
@@ -2764,8 +2770,11 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
 			if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr,
 					 PMD_SHIFT, next, flags, pages, nr))
 				return 0;
-		} else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
-			return 0;
+		} else {
+			if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
+				return 0;
+			pmd_cow_pte_mkexclusive(&pmd);
+		}
 	} while (pmdp++, addr = next, addr != end);
 
 	return 1;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 7/9] mm: Add the break COW PTE handler
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (5 preceding siblings ...)
  2022-09-27 16:29 ` [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 18:15   ` Nadav Amit
  2022-09-27 16:29 ` [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Chih-En Lin
  8 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

To handle the COW PTE with write fault, introduce the helper function
handle_cow_pte(). The function provides two behaviors. One is breaking
COW by decreasing the refcount, pgables_bytes, and RSS. Another is
copying all the information in the shared PTE table by using
copy_pte_page() with a wrapper.

Also, add the wrapper functions to help us find out the COWed or
COW-available PTE table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/pgtable.h |  75 +++++++++++++++++
 mm/memory.c             | 179 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 254 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9b08a3361d490..85255f5223ae3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -10,6 +10,7 @@
 
 #include <linux/mm_types.h>
 #include <linux/page_ref.h>
+#include <linux/sched/coredump.h> /* For MMF_COW_PTE flag */
 #include <linux/bug.h>
 #include <linux/errno.h>
 #include <asm-generic/pgtable_uffd.h>
@@ -674,6 +675,42 @@ static inline void pmd_cow_pte_clear_mkexclusive(pmd_t *pmd)
 	set_cow_pte_owner(pmd, NULL);
 }
 
+static inline unsigned long get_pmd_start_edge(struct vm_area_struct *vma,
+						unsigned long addr)
+{
+	unsigned long start = addr & PMD_MASK;
+
+	if (start < vma->vm_start)
+		start = vma->vm_start;
+
+	return start;
+}
+
+static inline unsigned long get_pmd_end_edge(struct vm_area_struct *vma,
+						unsigned long addr)
+{
+	unsigned long end = (addr + PMD_SIZE) & PMD_MASK;
+
+	if (end > vma->vm_end)
+		end = vma->vm_end;
+
+	return end;
+}
+
+static inline bool is_cow_pte_available(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	if (!vma || !pmd)
+		return false;
+	if (!test_bit(MMF_COW_PTE, &vma->vm_mm->flags))
+		return false;
+	if (pmd_cow_pte_exclusive(pmd))
+		return false;
+	return true;
+}
+
+int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+		    bool alloc);
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
@@ -1002,6 +1039,44 @@ int cow_pte_handler(struct ctl_table *table, int write, void *buffer,
 
 extern int sysctl_cow_pte_pid;
 
+static inline bool __is_pte_table_cowing(struct vm_area_struct *vma, pmd_t *pmd,
+				       unsigned long addr)
+{
+	if (!vma)
+		return false;
+	if (!pmd) {
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+
+		if (addr == 0)
+			return false;
+
+		pgd = pgd_offset(vma->vm_mm, addr);
+		if (pgd_none_or_clear_bad(pgd))
+			return false;
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none_or_clear_bad(p4d))
+			return false;
+		pud = pud_offset(p4d, addr);
+		if (pud_none_or_clear_bad(pud))
+			return false;
+		pmd = pmd_offset(pud, addr);
+	}
+	if (!test_bit(MMF_COW_PTE, &vma->vm_mm->flags))
+		return false;
+	if (pmd_none(*pmd) || pmd_write(*pmd))
+		return false;
+	if (pmd_cow_pte_exclusive(pmd))
+		return false;
+	return true;
+}
+
+static inline bool is_pte_table_cowing(struct vm_area_struct *vma, pmd_t *pmd)
+{
+	return __is_pte_table_cowing(vma, pmd, 0UL);
+}
+
 #endif /* CONFIG_MMU */
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 3e66e229f4169..4cf3f74fb183f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2911,6 +2911,185 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pmd_at(mm, addr, pmd, new);
 }
 
+static inline int copy_cow_pte_range(struct vm_area_struct *vma,
+				     pmd_t *dst_pmd, pmd_t *src_pmd,
+				     unsigned long start, unsigned long end)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range;
+	int ret;
+	bool is_cow;
+
+	is_cow = is_cow_mapping(vma->vm_flags);
+	if (is_cow) {
+		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
+					0, vma, mm, start, end);
+		mmu_notifier_invalidate_range_start(&range);
+		mmap_assert_write_locked(mm);
+		raw_write_seqcount_begin(&mm->write_protect_seq);
+	}
+
+	ret = copy_pte_range(vma, vma, dst_pmd, src_pmd, start, end);
+
+	if (is_cow) {
+		raw_write_seqcount_end(&mm->write_protect_seq);
+		mmu_notifier_invalidate_range_end(&range);
+	}
+
+	return ret;
+}
+
+/*
+ * Break COW PTE, two state here:
+ *   - After fork :   [parent, rss=1, ref=2, write=NO , owner=parent]
+ *                 to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
+ *                    COW PTE become [ref=1, write=NO , owner=NULL  ]
+ *                    [child , rss=0, ref=2, write=NO , owner=parent]
+ *                 to [child , rss=1, ref=1, write=YES, owner=NULL  ]
+ *                    COW PTE become [ref=1, write=NO , owner=parent]
+ *   NOTE
+ *     - Copy the COW PTE to new PTE.
+ *     - Clear the owner of COW PTE and set PMD entry writable when it is owner.
+ *     - Increase RSS if it is not owner.
+ */
+static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+			 unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long pte_start, pte_end;
+	unsigned long start, end;
+	struct vm_area_struct *prev = vma->vm_prev;
+	struct vm_area_struct *next = vma->vm_next;
+	pmd_t cowed_entry = *pmd;
+
+	if (cow_pte_count(&cowed_entry) == 1) {
+		cow_pte_fallback(vma, pmd, addr);
+		return 1;
+	}
+
+	pte_start = start = addr & PMD_MASK;
+	pte_end = end = (addr + PMD_SIZE) & PMD_MASK;
+
+	pmd_clear(pmd);
+	/*
+	 * If the vma does not cover the entire address range of the PTE table,
+	 * it should check the previous and next.
+	 */
+	if (start < vma->vm_start && prev) {
+		/* The part of address range is covered by previous. */
+		if (start < prev->vm_end)
+			copy_cow_pte_range(prev, pmd, &cowed_entry,
+					   start, prev->vm_end);
+		start = vma->vm_start;
+	}
+	if (end > vma->vm_end && next) {
+		/* The part of address range is covered by next. */
+		if (end > next->vm_start)
+			copy_cow_pte_range(next, pmd, &cowed_entry,
+					   next->vm_start, end);
+		end = vma->vm_end;
+	}
+	if (copy_cow_pte_range(vma, pmd, &cowed_entry, start, end))
+		return -ENOMEM;
+
+	/*
+	 * Here, it is the owner, so clear the ownership. To keep RSS state and
+	 * page table bytes correct, it needs to decrease them.
+	 * Also, handle the address range issue here.
+	 */
+	if (cow_pte_owner_is_same(&cowed_entry, pmd)) {
+		set_cow_pte_owner(&cowed_entry, NULL);
+		if (pte_start < vma->vm_start && prev &&
+		    pte_start < prev->vm_end)
+			cow_pte_rss(mm, vma->vm_prev, pmd,
+				    pte_start, prev->vm_end, false /* dec */);
+		if (pte_end > vma->vm_end && next &&
+		    pte_end > next->vm_start)
+			cow_pte_rss(mm, vma->vm_next, pmd,
+				    next->vm_start, pte_end, false /* dec */);
+		cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
+		mm_dec_nr_ptes(mm);
+	}
+
+	/* Already handled it, don't reuse cowed table. */
+	pmd_put_pte(vma, &cowed_entry, addr, false);
+
+	VM_BUG_ON(cow_pte_count(pmd) != 1);
+
+	return 0;
+}
+
+static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+		       unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long start, end;
+
+	if (pmd_put_pte(vma, pmd, addr, true)) {
+		/* fallback, reuse pgtable */
+		return 1;
+	}
+
+	start = addr & PMD_MASK;
+	end = (addr + PMD_SIZE) & PMD_MASK;
+
+	/*
+	 * If PMD entry is owner, clear the ownership,
+	 * and decrease RSS state and pgtable_bytes.
+	 */
+	if (cow_pte_owner_is_same(pmd, pmd)) {
+		set_cow_pte_owner(pmd, NULL);
+		cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
+		mm_dec_nr_ptes(mm);
+	}
+
+	pmd_clear(pmd);
+	return 0;
+}
+
+/**
+ * handle_cow_pte - Break COW PTE, copy/dereference the shared PTE table
+ * @vma: target vma want to break COW
+ * @pmd: pmd index that maps to the shared PTE table
+ * @addr: the address trigger the break COW
+ * @alloc: copy PTE table if alloc is true, otherwise dereference
+ *
+ * The address needs to be in the range of the PTE table that the pmd index
+ * mapped. If pmd is NULL, it will get the pmd from vma and check it is COWing.
+ */
+int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+		    bool alloc)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	struct mm_struct *mm = vma->vm_mm;
+	int ret = 0;
+
+	if (!pmd) {
+		pgd = pgd_offset(mm, addr);
+		if (pgd_none_or_clear_bad(pgd))
+			return 0;
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none_or_clear_bad(p4d))
+			return 0;
+		pud = pud_offset(p4d, addr);
+		if (pud_none_or_clear_bad(pud))
+			return 0;
+		pmd = pmd_offset(pud, addr);
+	}
+
+	if (!is_pte_table_cowing(vma, pmd))
+		return 0;
+
+	if (alloc)
+		ret = break_cow_pte(vma, pmd, addr);
+	else
+		ret = zap_cow_pte(vma, pmd, addr);
+
+	return ret;
+}
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry which was
  * read non-atomically.  Before making any commitment, on those architectures
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (6 preceding siblings ...)
  2022-09-27 16:29 ` [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 16:29 ` [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Chih-En Lin
  8 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

To avoid the PFRA reclaiming the page resided in the COWed PTE table,
break COW when it using rmap to unmap all the processes.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/rmap.h | 2 ++
 mm/page_vma_mapped.c | 5 +++++
 mm/rmap.c            | 2 +-
 mm/swapfile.c        | 1 +
 mm/vmscan.c          | 1 +
 5 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b89b4b86951f8..5c7e3bedc068b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -312,6 +312,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 #define PVMW_SYNC		(1 << 0)
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
+/* Break COW PTE during the walking */
+#define PVMW_COW_PTE		(1 << 2)
 
 struct page_vma_mapped_walk {
 	unsigned long pfn;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8e9e574d535aa..5008957bbe4a7 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -251,6 +251,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
+
+		/* TODO: Does breaking COW PTE here is correct? */
+		if (pvmw->flags & PVMW_COW_PTE)
+			handle_cow_pte(vma, pvmw->pmd, pvmw->address, false);
+
 		if (!map_pte(pvmw))
 			goto next_pte;
 this_pte:
diff --git a/mm/rmap.c b/mm/rmap.c
index 93d5a6f793d20..8f737cb44e48a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1477,7 +1477,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_COW_PTE);
 	pte_t pteval;
 	struct page *subpage;
 	bool anon_exclusive, ret = true;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fdccd2f1422e..ef4d3d81a824b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1916,6 +1916,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
+		handle_cow_pte(vma, pmd, addr, false);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2b1431352dcd..030fad3d310d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1822,6 +1822,7 @@ static unsigned int shrink_page_list(struct list_head *page_list,
 		/*
 		 * The folio is mapped into the page tables of one or more
 		 * processes. Try to unmap it here.
+		 * It will write to the page tables, break COW PTE here.
 		 */
 		if (folio_mapped(folio)) {
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (7 preceding siblings ...)
  2022-09-27 16:29 ` [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm Chih-En Lin
@ 2022-09-27 16:29 ` Chih-En Lin
  2022-09-27 18:38   ` Nadav Amit
  8 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 16:29 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng,
	Chih-En Lin

This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
To enable the COW page table use the sysctl vm.cow_pte file with the
corresponding PID. It will set the MMF_COW_PTE_READY flag to the
process for enabling COW PTE during the next time of fork.

It uses the MMF_COW_PTE flag to distinguish the normal page table
and the COW one. Moreover, it is difficult to distinguish whether the
entire page table is out of COW state. So the MMF_COW_PTE flag won't be
disabled after its setup.

Since the memory space of the page table is distinctive for each process
in kernel space. It uses the address of the PMD index for the PTE table
ownership to identify which one of the processes needs to update the
page table state. In other words, only the owner will update shared
(COWed) PTE table state, like the RSS and pgtable_bytes.

Some PTE tables (e.g., pinned pages that reside in the table) still need
to be copied immediately for consistency with the current COW logic. As
a result, a flag, COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE
table is exclusive (i.e., only one task owns it at a time) is added to
the table’s owner pointer. Every time a PTE table is copied during the
fork, the owner pointer (and thus the exclusive flag) will be checked to
determine whether the PTE table can be shared across processes.

It uses a reference count to track the lifetime of COWed PTE table.
Doing the fork with COW PTE will increase the refcount. And, when
someone writes to the COWed PTE table, it will cause the write fault to
break COW PTE. If the COWed PTE table's refcount is one, the process
that triggers the fault will reuse the COWed PTE table. Otherwise, the
process will decrease the refcount, copy the information to a new PTE
table or dereference all the information and change the owner if they
have the COWed PTE table.

If doing the COW to the PTE table once as the time touching the PMD
entry, it cannot preserves the reference count of the COWed PTE table.
Since the address range of VMA may overlap the PTE table, the copying
function will use VMA to travel the page table for copying it. So it may
increase the reference count of the COWed PTE table multiple times in
one COW page table forking. Generically it will only increase once time
as the child reference it. To solve this problem, it needs to check the
destination of PMD entry does exist. And the reference count of the
source PTE table is more than one before doing the COW.

This patch modifies the part of the copy page table to do the basic COW.
For the break COW, it modifies the part of a page fault, zaps page table
, unmapping, and remapping.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/memory.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++----
 mm/mmap.c   |  3 ++
 mm/mremap.c |  3 ++
 3 files changed, 87 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4cf3f74fb183f..c532448b5e086 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -250,6 +250,9 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
+		VM_BUG_ON(cow_pte_count(pmd) != 1);
+		if (!pmd_cow_pte_exclusive(pmd))
+			VM_BUG_ON(!cow_pte_owner_is_same(pmd, NULL));
 		free_pte_range(tlb, pmd, addr);
 	} while (pmd++, addr = next, addr != end);
 
@@ -1006,7 +1009,12 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	 * in the parent and the child
 	 */
 	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		/*
+		 * If parent's PTE table is COWing, keep it as it is.
+		 * Don't set wrprotect to that table.
+		 */
+		if (!__is_pte_table_cowing(src_vma, NULL, addr))
+			ptep_set_wrprotect(src_mm, addr, src_pte);
 		pte = pte_wrprotect(pte);
 	}
 	VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page));
@@ -1197,11 +1205,64 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 				continue;
 			/* fall through */
 		}
-		if (pmd_none_or_clear_bad(src_pmd))
-			continue;
-		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
-				   addr, next))
-			return -ENOMEM;
+
+		if (is_cow_pte_available(src_vma, src_pmd)) {
+			/*
+			 * Setting wrprotect to pmd entry will trigger
+			 * pmd_bad() for normal PTE table. Skip the bad
+			 * checking here.
+			 */
+			if (pmd_none(*src_pmd))
+				continue;
+
+			/* Skip if the PTE already COW this time. */
+			if (!pmd_none(*dst_pmd) && !pmd_write(*dst_pmd))
+				continue;
+
+			/*
+			 * If PTE doesn't have an owner, the parent needs to
+			 * take this PTE.
+			 */
+			if (cow_pte_owner_is_same(src_pmd, NULL)) {
+				set_cow_pte_owner(src_pmd, src_pmd);
+				/*
+				 * XXX: The process may COW PTE fork two times.
+				 * But in some situations, owner has cleared.
+				 * Previously Child (This time is the parent)
+				 * COW PTE forking, but previously parent, the
+				 * owner , break COW. So it needs to add back
+				 * the RSS state and pgtable bytes.
+				 */
+				if (!pmd_write(*src_pmd)) {
+					cow_pte_rss(src_mm, src_vma, src_pmd,
+						    get_pmd_start_edge(src_vma,
+									addr),
+						    get_pmd_end_edge(src_vma,
+									addr),
+						    true /* inc */);
+					/* Do we need pt lock here? */
+					mm_inc_nr_ptes(src_mm);
+					/* See the comments in pmd_install(). */
+					smp_wmb();
+					pmd_populate(src_mm, src_pmd,
+						     pmd_page(*src_pmd));
+				}
+			}
+
+			pmdp_set_wrprotect(src_mm, addr, src_pmd);
+
+			/* Child reference count */
+			pmd_get_pte(src_pmd);
+
+			/* COW for PTE table */
+			set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd);
+		} else {
+			if (pmd_none_or_clear_bad(src_pmd))
+				continue;
+			if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
+					   addr, next))
+				return -ENOMEM;
+		}
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -1594,6 +1655,10 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			spin_unlock(ptl);
 		}
 
+		/* TODO: Does TLB needs to flush page info in COWed table? */
+		if (is_pte_table_cowing(vma, pmd))
+			handle_cow_pte(vma, pmd, addr, false);
+
 		/*
 		 * Here there can be other concurrent MADV_DONTNEED or
 		 * trans huge page faults running, and if the pmd is
@@ -5321,6 +5386,16 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				return 0;
 			}
 		}
+
+		/*
+		 * When the PMD entry is set with write protection, it needs to
+		 * handle the on-demand PTE. It will allocate a new PTE and copy
+		 * the old one, then set this entry writeable and decrease the
+		 * reference count at COW PTE.
+		 */
+		if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address,
+				   cow_pte_count(&vmf.orig_pmd) > 1) < 0)
+			return VM_FAULT_OOM;
 	}
 
 	return handle_pte_fault(&vmf);
diff --git a/mm/mmap.c b/mm/mmap.c
index 9d780f415be3c..463359292f8a9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2685,6 +2685,9 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 			return err;
 	}
 
+	if (handle_cow_pte(vma, NULL, addr, true) < 0)
+		return -ENOMEM;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
diff --git a/mm/mremap.c b/mm/mremap.c
index b522cd0259a0f..14f6ad250289c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,6 +532,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
+
+		handle_cow_pte(vma, old_pmd, old_addr, true);
+
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table
  2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
@ 2022-09-27 17:23   ` Nadav Amit
  2022-09-27 17:36     ` Chih-En Lin
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 17:23 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, LKML, Linux MM, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> Add MMF_COW_PTE{, _READY} flags to prepare the subsequent
> implementation of Copy-On-Write for the page table.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
> include/linux/sched/coredump.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 4d0a5be28b70f..f03ff69c90c8c 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -84,7 +84,13 @@ static inline int get_dumpable(struct mm_struct *mm)
> #define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
> #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> 
> +#define MMF_COW_PTE_READY	29
> +#define MMF_COW_PTE_READY_MASK	(1 << MMF_COW_PTE_READY)
> +
> +#define MMF_COW_PTE		30
> +#define MMF_COW_PTE_MASK	(1 << MMF_COW_PTE)

I am not sure how much sense it makes to put it in a separate patch, and it
is rather hard to understand the new flags without proper documentation and
comments.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE
  2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
@ 2022-09-27 17:27   ` Nadav Amit
  2022-09-27 18:05     ` Chih-En Lin
  2022-09-27 21:22   ` John Hubbard
  1 sibling, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 17:27 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> Add a new sysctl vm.cow_pte to set MMF_COW_PTE_READY flag for enabling
> copy-on-write (COW) to the PTE page table during the next time of fork.
> 
> Since it has a time gap between using the sysctl to enable the COW PTE
> and doing the fork, we use two states to determine the task that wants
> to do COW PTE or already doing it.

I don’t get why it is needed in general and certainly why sysctl controls
this behavior.

IIUC, it sounds that you want prctl and not sysctl for such control. But
clearly you think that this control is needed because there is a tradeoff.
Please explain the tradeoff and how users are expected to make a decision
whether to turn the flag or not.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table
  2022-09-27 16:29 ` [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Chih-En Lin
@ 2022-09-27 17:30   ` Nadav Amit
  2022-09-27 18:23     ` Chih-En Lin
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 17:30 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> Introduce the ownership to the PTE table. It uses the address of PMD
> index to track the ownership to identify which process can update
> their page table state from the COWed PTE table.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
> include/linux/mm.h       |  1 +
> include/linux/mm_types.h |  5 ++++-
> include/linux/pgtable.h  | 10 ++++++++++
> 3 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd3..965523dcca3b8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2289,6 +2289,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
> 		return false;
> 	__SetPageTable(page);
> 	inc_lruvec_page_state(page, NR_PAGETABLE);
> +	page->cow_pte_owner = NULL;
> 	return true;
> }
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index cf97f3884fda2..42798b59cec4e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -152,7 +152,10 @@ struct page {
> 			struct list_head deferred_list;
> 		};
> 		struct {	/* Page table pages */
> -			unsigned long _pt_pad_1;	/* compound_head */
> +			union {
> +				unsigned long _pt_pad_1; /* compound_head */
> +				pmd_t *cow_pte_owner; /* cow pte: pmd */
> +			};
> 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
> 			unsigned long _pt_pad_2;	/* mapping */
> 			union {
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index d03d01aefe989..9dca787a3f4dd 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -615,6 +615,16 @@ static inline int pte_unused(pte_t pte)
> }
> #endif
> 
> +static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
> +{
> +	smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner);
> +}
> +
> +static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
> +{
> +	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;
> +}

Hiding synchronization primitives in such manner, and especially without
proper comments, makes it hard to understand what the ordering is supposed
to achieve.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table
  2022-09-27 17:23   ` Nadav Amit
@ 2022-09-27 17:36     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 17:36 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, LKML, Linux MM, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 05:23:58PM +0000, Nadav Amit wrote:
> On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:
> 
> > Add MMF_COW_PTE{, _READY} flags to prepare the subsequent
> > implementation of Copy-On-Write for the page table.
> > 
> > Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> > ---
> > include/linux/sched/coredump.h | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> > index 4d0a5be28b70f..f03ff69c90c8c 100644
> > --- a/include/linux/sched/coredump.h
> > +++ b/include/linux/sched/coredump.h
> > @@ -84,7 +84,13 @@ static inline int get_dumpable(struct mm_struct *mm)
> > #define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
> > #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
> > 
> > +#define MMF_COW_PTE_READY	29
> > +#define MMF_COW_PTE_READY_MASK	(1 << MMF_COW_PTE_READY)
> > +
> > +#define MMF_COW_PTE		30
> > +#define MMF_COW_PTE_MASK	(1 << MMF_COW_PTE)
> 
> I am not sure how much sense it makes to put it in a separate patch, and it
> is rather hard to understand the new flags without proper documentation and
> comments.
> 

I had considered putting it with the sysctl patch, but since these two
flags are not related to the sysctl, so I put it in a separate patch.
Maybe I can put those (sysctl and this patch) together and take a
suitable commit message. I will consider it again.
For the proper documentation/comments, I will also add it in next
version.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions
  2022-09-27 16:29 ` [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Chih-En Lin
@ 2022-09-27 17:51   ` Nadav Amit
  2022-09-27 19:00     ` Chih-En Lin
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 17:51 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> The lifetime of COWed PTE table will handle by a reference count.
> When the process wants to write the COWed PTE table, which refcount
> is 1, it will reuse the shared PTE.
> 
> Since only the owner will update their page table state. the fallback
> function also needs to handle the situation of non-owner COWed PTE table
> fallback to normal PTE.
> 
> This commit prepares for the following implementation of the reference
> count for COW PTE.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
> include/linux/pgtable.h |  3 ++
> mm/memory.c             | 93 +++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 96 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 9dca787a3f4dd..25c1e5c42fdf3 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -615,6 +615,9 @@ static inline int pte_unused(pte_t pte)
> }
> #endif
> 
> +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> +		      unsigned long addr);
> +
> static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
> {
> 	smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner);
> diff --git a/mm/memory.c b/mm/memory.c
> index 4ba73f5aa8bb7..d29f84801f3cd 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -509,6 +509,37 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> 			add_mm_counter(mm, i, rss[i]);
> }
> 
> +static void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
> +			       pmd_t *pmdp, unsigned long addr,
> +			       unsigned long end, bool inc_dec)
> +{
> +	int rss[NR_MM_COUNTERS];
> +	spinlock_t *ptl;
> +	pte_t *orig_ptep, *ptep;
> +	struct page *page;
> +
> +	init_rss_vec(rss);
> +
> +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +	orig_ptep = ptep;
> +	arch_enter_lazy_mmu_mode();
> +	do {
> +		if (pte_none(*ptep))
> +			continue;
> +
> +		page = vm_normal_page(vma, addr, *ptep);
> +		if (page) {
> +			if (inc_dec)
> +				rss[mm_counter(page)]++;
> +			else
> +				rss[mm_counter(page)]--;
> +		}
> +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> +	arch_leave_lazy_mmu_mode();
> +	pte_unmap_unlock(orig_ptep, ptl);

It seems to me very fragile to separate the accounting from the actual
operation. I do not see copying of the pages here, so why is the RSS
updated?

> +	add_mm_rss_vec(mm, rss);
> +}
> +
> /*
>  * This function is called to print an error when a bad pte
>  * is found. For example, we might have a PFN-mapped pte in
> @@ -2817,6 +2848,68 @@ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
> }
> EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
> 
> +/**
> + * cow_pte_fallback - reuse the shared PTE table
> + * @vma: vma that coever the shared PTE table
> + * @pmd: pmd index maps to the shared PTE table
> + * @addr: the address trigger the break COW,
> + *
> + * Reuse the shared (COW) PTE table when the refcount is equal to one.
> + * @addr needs to be in the range of the shared PTE table that @vma and
> + * @pmd mapped to it.
> + *
> + * COW PTE fallback to normal PTE:
> + * - two state here
> + *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
> + *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> + *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
> + *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> + */
> +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> +		      unsigned long addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct vm_area_struct *prev = vma->vm_prev;
> +	struct vm_area_struct *next = vma->vm_next;
> +	unsigned long start, end;
> +	pmd_t new;
> +
> +	VM_WARN_ON(pmd_write(*pmd));
> +
> +	start = addr & PMD_MASK;
> +	end = (addr + PMD_SIZE) & PMD_MASK;
> +
> +	/*
> +	 * If pmd is not owner, it needs to increase the rss.
> +	 * Since only the owner has the RSS state for the COW PTE.
> +	 */
> +	if (!cow_pte_owner_is_same(pmd, pmd)) {
> +		/* The part of address range is covered by previous. */
> +		if (start < vma->vm_start && prev && start < prev->vm_end) {
> +			cow_pte_rss(mm, prev, pmd,
> +				    start, prev->vm_end, true /* inc */);
> +			start = vma->vm_start;
> +		}
> +		/* The part of address range is covered by next. */
> +		if (end > vma->vm_end && next && end > next->vm_start) {
> +			cow_pte_rss(mm, next, pmd,
> +				    next->vm_start, end, true /* inc */);
> +			end = vma->vm_end;
> +		}
> +		cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
> +
> +		mm_inc_nr_ptes(mm);
> +		/* Memory barrier here is the same as pmd_install(). */
> +		smp_wmb();
> +		pmd_populate(mm, pmd, pmd_page(*pmd));
> +	}
> +
> +	/* Reuse the pte page */
> +	set_cow_pte_owner(pmd, NULL);
> +	new = pmd_mkwrite(*pmd);
> +	set_pmd_at(mm, addr, pmd, new);
> +}

Again, kind of hard to understand such a function without a context
(caller). For instance, is there any lock that prevents
cow_pte_owner_is_same() from racing with change of the owner?

I would expect to see first patches that always copy the PTEs without
reusing the PTEs and only then a PTE reuse logic as an optimization.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table
  2022-09-27 16:29 ` [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Chih-En Lin
@ 2022-09-27 17:59   ` Nadav Amit
  2022-09-27 19:07     ` Chih-En Lin
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 17:59 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, LKML, Linux MM, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> Reuse the _refcount in struct page for the page table to maintain the
> number of process references to COWed PTE table. Before decreasing the
> refcount, it will check whether refcount is one or not for reusing
> shared PTE table.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
> include/linux/mm.h      |  1 +
> include/linux/pgtable.h | 28 ++++++++++++++++++++++++++++
> mm/memory.c             |  1 +
> 3 files changed, 30 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 965523dcca3b8..bfe6a8c7ab9ed 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2290,6 +2290,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
> 	__SetPageTable(page);
> 	inc_lruvec_page_state(page, NR_PAGETABLE);
> 	page->cow_pte_owner = NULL;
> +	set_page_count(page, 1);
> 	return true;
> }
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 25c1e5c42fdf3..8b497d7d800ed 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -9,6 +9,7 @@
> #ifdef CONFIG_MMU
> 
> #include <linux/mm_types.h>
> +#include <linux/page_ref.h>
> #include <linux/bug.h>
> #include <linux/errno.h>
> #include <asm-generic/pgtable_uffd.h>
> @@ -628,6 +629,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
> 	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;
> }
> 
> +static inline int pmd_get_pte(pmd_t *pmd)
> +{
> +	return page_ref_inc_return(pmd_page(*pmd));
> +}
> +
> +/*
> + * If the COW PTE refcount is 1, instead of decreasing the counter,
> + * clear write protection of the corresponding PMD entry and reset
> + * the COW PTE owner to reuse the table.
> + * But if the reuse parameter is false, do not thing. This help us
> + * to handle the situation that PTE table we already handled.
> + */
> +static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +			      unsigned long addr, bool reuse)
> +{
> +	if (!page_ref_add_unless(pmd_page(*pmd), -1, 1) && reuse) {
> +		cow_pte_fallback(vma, pmd, addr);

Is there some assumption that pmd_get_pte() would not be called between the
page_ref_add_unless() and cow_pte_fallback()?

Hard to know without comments or context.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE
  2022-09-27 17:27   ` Nadav Amit
@ 2022-09-27 18:05     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 18:05 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 05:27:45PM +0000, Nadav Amit wrote:
> On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:
> 
> > Add a new sysctl vm.cow_pte to set MMF_COW_PTE_READY flag for enabling
> > copy-on-write (COW) to the PTE page table during the next time of fork.
> > 
> > Since it has a time gap between using the sysctl to enable the COW PTE
> > and doing the fork, we use two states to determine the task that wants
> > to do COW PTE or already doing it.
> 
> I don’t get why it is needed in general and certainly why sysctl controls
> this behavior.
> 
> IIUC, it sounds that you want prctl and not sysctl for such control. But
> clearly you think that this control is needed because there is a tradeoff.
> Please explain the tradeoff and how users are expected to make a decision
> whether to turn the flag or not.
> 

If applying COW to the page table, it will has a significantly change
to kernel, this is why I think it uses the sysctl at first.
But, prctl might be better a choice.

For the tradeoff. Since, in some cases (like executing the command in
the terminal), enabling COW to page table only will increase the
overhead due to the page fault (break COW). It doesn't have any benefit
from the COW mechanism. So, we let the users decide which process will
enable COW page table.

The expected user usually will be the process that requires a lot of
memory and want to create a new process for an isolated environment.
(e.g., fuzzer, container, etc) So, expand COW to page table may
improves the startup time and memory usage (on-demand allocate memory).

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 7/9] mm: Add the break COW PTE handler
  2022-09-27 16:29 ` [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Chih-En Lin
@ 2022-09-27 18:15   ` Nadav Amit
  2022-09-27 19:23     ` Chih-En Lin
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 18:15 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> To handle the COW PTE with write fault, introduce the helper function
> handle_cow_pte(). The function provides two behaviors. One is breaking
> COW by decreasing the refcount, pgables_bytes, and RSS. Another is
> copying all the information in the shared PTE table by using
> copy_pte_page() with a wrapper.
> 
> Also, add the wrapper functions to help us find out the COWed or
> COW-available PTE table.
> 

[ snip ]

> +static inline int copy_cow_pte_range(struct vm_area_struct *vma,
> +				     pmd_t *dst_pmd, pmd_t *src_pmd,
> +				     unsigned long start, unsigned long end)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct mmu_notifier_range range;
> +	int ret;
> +	bool is_cow;
> +
> +	is_cow = is_cow_mapping(vma->vm_flags);
> +	if (is_cow) {
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> +					0, vma, mm, start, end);
> +		mmu_notifier_invalidate_range_start(&range);
> +		mmap_assert_write_locked(mm);
> +		raw_write_seqcount_begin(&mm->write_protect_seq);
> +	}
> +
> +	ret = copy_pte_range(vma, vma, dst_pmd, src_pmd, start, end);
> +
> +	if (is_cow) {
> +		raw_write_seqcount_end(&mm->write_protect_seq);
> +		mmu_notifier_invalidate_range_end(&range);

Usually, I would expect mmu-notifiers and TLB flushes to be initiated at the
same point in the code. Presumably you changed protection, so you do need a
TLB flush, right? Is it done elsewhere?

> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Break COW PTE, two state here:
> + *   - After fork :   [parent, rss=1, ref=2, write=NO , owner=parent]
> + *                 to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> + *                    COW PTE become [ref=1, write=NO , owner=NULL  ]
> + *                    [child , rss=0, ref=2, write=NO , owner=parent]
> + *                 to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> + *                    COW PTE become [ref=1, write=NO , owner=parent]
> + *   NOTE
> + *     - Copy the COW PTE to new PTE.
> + *     - Clear the owner of COW PTE and set PMD entry writable when it is owner.
> + *     - Increase RSS if it is not owner.
> + */
> +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +			 unsigned long addr)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	unsigned long pte_start, pte_end;
> +	unsigned long start, end;
> +	struct vm_area_struct *prev = vma->vm_prev;
> +	struct vm_area_struct *next = vma->vm_next;
> +	pmd_t cowed_entry = *pmd;
> +
> +	if (cow_pte_count(&cowed_entry) == 1) {
> +		cow_pte_fallback(vma, pmd, addr);
> +		return 1;
> +	}
> +
> +	pte_start = start = addr & PMD_MASK;
> +	pte_end = end = (addr + PMD_SIZE) & PMD_MASK;
> +
> +	pmd_clear(pmd);
> +	/*
> +	 * If the vma does not cover the entire address range of the PTE table,
> +	 * it should check the previous and next.
> +	 */
> +	if (start < vma->vm_start && prev) {
> +		/* The part of address range is covered by previous. */
> +		if (start < prev->vm_end)
> +			copy_cow_pte_range(prev, pmd, &cowed_entry,
> +					   start, prev->vm_end);
> +		start = vma->vm_start;
> +	}
> +	if (end > vma->vm_end && next) {
> +		/* The part of address range is covered by next. */
> +		if (end > next->vm_start)
> +			copy_cow_pte_range(next, pmd, &cowed_entry,
> +					   next->vm_start, end);
> +		end = vma->vm_end;
> +	}
> +	if (copy_cow_pte_range(vma, pmd, &cowed_entry, start, end))
> +		return -ENOMEM;
> +
> +	/*
> +	 * Here, it is the owner, so clear the ownership. To keep RSS state and
> +	 * page table bytes correct, it needs to decrease them.
> +	 * Also, handle the address range issue here.
> +	 */
> +	if (cow_pte_owner_is_same(&cowed_entry, pmd)) {
> +		set_cow_pte_owner(&cowed_entry, NULL);

Presumably there is some assumption on atomicity here. Otherwise, two
threads can run the following code, which is wrong, no? Yet, I do not see
anything that provides such atomicity.

> +		if (pte_start < vma->vm_start && prev &&
> +		    pte_start < prev->vm_end)
> +			cow_pte_rss(mm, vma->vm_prev, pmd,
> +				    pte_start, prev->vm_end, false /* dec */);
> +		if (pte_end > vma->vm_end && next &&
> +		    pte_end > next->vm_start)
> +			cow_pte_rss(mm, vma->vm_next, pmd,
> +				    next->vm_start, pte_end, false /* dec */);
> +		cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> +		mm_dec_nr_ptes(mm);
> +	}
> +
> +	/* Already handled it, don't reuse cowed table. */
> +	pmd_put_pte(vma, &cowed_entry, addr, false);
> +
> +	VM_BUG_ON(cow_pte_count(pmd) != 1);

Don’t use VM_BUG_ON().


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table
  2022-09-27 17:30   ` Nadav Amit
@ 2022-09-27 18:23     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 18:23 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 05:30:39PM +0000, Nadav Amit wrote:
> > +static inline void set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
> > +{
> > +	smp_store_release(&pmd_page(*pmd)->cow_pte_owner, owner);
> > +}
> > +
> > +static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
> > +{
> > +	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;
> > +}
> 
> Hiding synchronization primitives in such manner, and especially without
> proper comments, makes it hard to understand what the ordering is supposed
> to achieve.
> 

It ensures that every time we write into the pointer, the reader
will ebserve the newest one.
I will add the comments.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-27 16:29 ` [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Chih-En Lin
@ 2022-09-27 18:38   ` Nadav Amit
  2022-09-27 19:53     ` Chih-En Lin
  0 siblings, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-27 18:38 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
> To enable the COW page table use the sysctl vm.cow_pte file with the
> corresponding PID. It will set the MMF_COW_PTE_READY flag to the
> process for enabling COW PTE during the next time of fork.
> 
> It uses the MMF_COW_PTE flag to distinguish the normal page table
> and the COW one. Moreover, it is difficult to distinguish whether the
> entire page table is out of COW state. So the MMF_COW_PTE flag won't be
> disabled after its setup.
> 
> Since the memory space of the page table is distinctive for each process
> in kernel space. It uses the address of the PMD index for the PTE table
> ownership to identify which one of the processes needs to update the
> page table state. In other words, only the owner will update shared
> (COWed) PTE table state, like the RSS and pgtable_bytes.
> 
> Some PTE tables (e.g., pinned pages that reside in the table) still need
> to be copied immediately for consistency with the current COW logic. As
> a result, a flag, COW_PTE_OWNER_EXCLUSIVE, indicating whether a PTE
> table is exclusive (i.e., only one task owns it at a time) is added to
> the table’s owner pointer. Every time a PTE table is copied during the
> fork, the owner pointer (and thus the exclusive flag) will be checked to
> determine whether the PTE table can be shared across processes.
> 
> It uses a reference count to track the lifetime of COWed PTE table.
> Doing the fork with COW PTE will increase the refcount. And, when
> someone writes to the COWed PTE table, it will cause the write fault to
> break COW PTE. If the COWed PTE table's refcount is one, the process
> that triggers the fault will reuse the COWed PTE table. Otherwise, the
> process will decrease the refcount, copy the information to a new PTE
> table or dereference all the information and change the owner if they
> have the COWed PTE table.
> 
> If doing the COW to the PTE table once as the time touching the PMD
> entry, it cannot preserves the reference count of the COWed PTE table.
> Since the address range of VMA may overlap the PTE table, the copying
> function will use VMA to travel the page table for copying it. So it may
> increase the reference count of the COWed PTE table multiple times in
> one COW page table forking. Generically it will only increase once time
> as the child reference it. To solve this problem, it needs to check the
> destination of PMD entry does exist. And the reference count of the
> source PTE table is more than one before doing the COW.
> 
> This patch modifies the part of the copy page table to do the basic COW.
> For the break COW, it modifies the part of a page fault, zaps page table
> , unmapping, and remapping.

I only skimmed the patches that you sent. The last couple of patches seem a
bit rough and dirty, so I am sorry to say that I skipped them (too many
“TODO” and “XXX” for my taste).

I am sure other will have better feedback than me. I understand there is a
tradeoff and that this mechanism is mostly for high performance
snapshotting/forking. It would be beneficial to see whether this mechanism
can somehow be combined with existing ones (mshare?).

The code itself can be improved. I found the reasoning about synchronization
and TLB flushes and synchronizations to be lacking, and the code to seem
potentially incorrect. Better comments would help, even if the code is
correct.

There are additional general questions. For instance, when sharing a
page-table, do you properly update the refcount/mapcount of the mapped
pages? And are there any possible interactions with THP?

Thanks,
Nadav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions
  2022-09-27 17:51   ` Nadav Amit
@ 2022-09-27 19:00     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 19:00 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 05:51:19PM +0000, Nadav Amit wrote:
> > +static void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
> > +			       pmd_t *pmdp, unsigned long addr,
> > +			       unsigned long end, bool inc_dec)
> > +{
> > +	int rss[NR_MM_COUNTERS];
> > +	spinlock_t *ptl;
> > +	pte_t *orig_ptep, *ptep;
> > +	struct page *page;
> > +
> > +	init_rss_vec(rss);
> > +
> > +	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> > +	orig_ptep = ptep;
> > +	arch_enter_lazy_mmu_mode();
> > +	do {
> > +		if (pte_none(*ptep))
> > +			continue;
> > +
> > +		page = vm_normal_page(vma, addr, *ptep);
> > +		if (page) {
> > +			if (inc_dec)
> > +				rss[mm_counter(page)]++;
> > +			else
> > +				rss[mm_counter(page)]--;
> > +		}
> > +	} while (ptep++, addr += PAGE_SIZE, addr != end);
> > +	arch_leave_lazy_mmu_mode();
> > +	pte_unmap_unlock(orig_ptep, ptl);
> 
> It seems to me very fragile to separate the accounting from the actual
> operation. I do not see copying of the pages here, so why is the RSS
> updated?

Since it may have a situation like one process that doesn't do the
accounting during COW, and it would want to reuse the table. So, we
need to restore the states.
On the other hand, it will have a situation like unmap the COWed table
and wanting to remove the states.

> 
> > +	add_mm_rss_vec(mm, rss);
> > +}
> > +
> > /*
> >  * This function is called to print an error when a bad pte
> >  * is found. For example, we might have a PFN-mapped pte in
> > @@ -2817,6 +2848,68 @@ int apply_to_existing_page_range(struct mm_struct *mm, unsigned long addr,
> > }
> > EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
> > 
> > +/**
> > + * cow_pte_fallback - reuse the shared PTE table
> > + * @vma: vma that coever the shared PTE table
> > + * @pmd: pmd index maps to the shared PTE table
> > + * @addr: the address trigger the break COW,
> > + *
> > + * Reuse the shared (COW) PTE table when the refcount is equal to one.
> > + * @addr needs to be in the range of the shared PTE table that @vma and
> > + * @pmd mapped to it.
> > + *
> > + * COW PTE fallback to normal PTE:
> > + * - two state here
> > + *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
> > + *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> > + *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
> > + *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> > + */
> > +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> > +		      unsigned long addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct vm_area_struct *prev = vma->vm_prev;
> > +	struct vm_area_struct *next = vma->vm_next;
> > +	unsigned long start, end;
> > +	pmd_t new;
> > +
> > +	VM_WARN_ON(pmd_write(*pmd));
> > +
> > +	start = addr & PMD_MASK;
> > +	end = (addr + PMD_SIZE) & PMD_MASK;
> > +
> > +	/*
> > +	 * If pmd is not owner, it needs to increase the rss.
> > +	 * Since only the owner has the RSS state for the COW PTE.
> > +	 */
> > +	if (!cow_pte_owner_is_same(pmd, pmd)) {
> > +		/* The part of address range is covered by previous. */
> > +		if (start < vma->vm_start && prev && start < prev->vm_end) {
> > +			cow_pte_rss(mm, prev, pmd,
> > +				    start, prev->vm_end, true /* inc */);
> > +			start = vma->vm_start;
> > +		}
> > +		/* The part of address range is covered by next. */
> > +		if (end > vma->vm_end && next && end > next->vm_start) {
> > +			cow_pte_rss(mm, next, pmd,
> > +				    next->vm_start, end, true /* inc */);
> > +			end = vma->vm_end;
> > +		}
> > +		cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
> > +
> > +		mm_inc_nr_ptes(mm);
> > +		/* Memory barrier here is the same as pmd_install(). */
> > +		smp_wmb();
> > +		pmd_populate(mm, pmd, pmd_page(*pmd));
> > +	}
> > +
> > +	/* Reuse the pte page */
> > +	set_cow_pte_owner(pmd, NULL);
> > +	new = pmd_mkwrite(*pmd);
> > +	set_pmd_at(mm, addr, pmd, new);
> > +}
> 
> Again, kind of hard to understand such a function without a context
> (caller). For instance, is there any lock that prevents
> cow_pte_owner_is_same() from racing with change of the owner?
> 

It is called by the refcount operation and the break COW handler
when the refcount is 1.
Also, It uses synchronization primitives (in set_cow_pte_owner() and
cow_pte_owner_is_same()) to prevent the race.

> I would expect to see first patches that always copy the PTEs without
> reusing the PTEs and only then a PTE reuse logic as an optimization.
> 

I will restructure all the commits to make the logic clear. 

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table
  2022-09-27 17:59   ` Nadav Amit
@ 2022-09-27 19:07     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 19:07 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, LKML, Linux MM, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 05:59:16PM +0000, Nadav Amit wrote:
> Is there some assumption that pmd_get_pte() would not be called between the
> page_ref_add_unless() and cow_pte_fallback()?
> 
> Hard to know without comments or context.

Yes.
It is one of the corner case that I need to handle it.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 7/9] mm: Add the break COW PTE handler
  2022-09-27 18:15   ` Nadav Amit
@ 2022-09-27 19:23     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 19:23 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 06:15:34PM +0000, Nadav Amit wrote:
> On Sep 27, 2022, at 9:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:
> 
> > To handle the COW PTE with write fault, introduce the helper function
> > handle_cow_pte(). The function provides two behaviors. One is breaking
> > COW by decreasing the refcount, pgables_bytes, and RSS. Another is
> > copying all the information in the shared PTE table by using
> > copy_pte_page() with a wrapper.
> > 
> > Also, add the wrapper functions to help us find out the COWed or
> > COW-available PTE table.
> > 
> 
> [ snip ]
> 
> > +static inline int copy_cow_pte_range(struct vm_area_struct *vma,
> > +				     pmd_t *dst_pmd, pmd_t *src_pmd,
> > +				     unsigned long start, unsigned long end)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	struct mmu_notifier_range range;
> > +	int ret;
> > +	bool is_cow;
> > +
> > +	is_cow = is_cow_mapping(vma->vm_flags);
> > +	if (is_cow) {
> > +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
> > +					0, vma, mm, start, end);
> > +		mmu_notifier_invalidate_range_start(&range);
> > +		mmap_assert_write_locked(mm);
> > +		raw_write_seqcount_begin(&mm->write_protect_seq);
> > +	}
> > +
> > +	ret = copy_pte_range(vma, vma, dst_pmd, src_pmd, start, end);
> > +
> > +	if (is_cow) {
> > +		raw_write_seqcount_end(&mm->write_protect_seq);
> > +		mmu_notifier_invalidate_range_end(&range);
> 
> Usually, I would expect mmu-notifiers and TLB flushes to be initiated at the
> same point in the code. Presumably you changed protection, so you do need a
> TLB flush, right? Is it done elsewhere?

You're right.
I will add TLB flushes here.
Thanks.

> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +/*
> > + * Break COW PTE, two state here:
> > + *   - After fork :   [parent, rss=1, ref=2, write=NO , owner=parent]
> > + *                 to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> > + *                    COW PTE become [ref=1, write=NO , owner=NULL  ]
> > + *                    [child , rss=0, ref=2, write=NO , owner=parent]
> > + *                 to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> > + *                    COW PTE become [ref=1, write=NO , owner=parent]
> > + *   NOTE
> > + *     - Copy the COW PTE to new PTE.
> > + *     - Clear the owner of COW PTE and set PMD entry writable when it is owner.
> > + *     - Increase RSS if it is not owner.
> > + */
> > +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +			 unsigned long addr)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	unsigned long pte_start, pte_end;
> > +	unsigned long start, end;
> > +	struct vm_area_struct *prev = vma->vm_prev;
> > +	struct vm_area_struct *next = vma->vm_next;
> > +	pmd_t cowed_entry = *pmd;
> > +
> > +	if (cow_pte_count(&cowed_entry) == 1) {
> > +		cow_pte_fallback(vma, pmd, addr);
> > +		return 1;
> > +	}
> > +
> > +	pte_start = start = addr & PMD_MASK;
> > +	pte_end = end = (addr + PMD_SIZE) & PMD_MASK;
> > +
> > +	pmd_clear(pmd);
> > +	/*
> > +	 * If the vma does not cover the entire address range of the PTE table,
> > +	 * it should check the previous and next.
> > +	 */
> > +	if (start < vma->vm_start && prev) {
> > +		/* The part of address range is covered by previous. */
> > +		if (start < prev->vm_end)
> > +			copy_cow_pte_range(prev, pmd, &cowed_entry,
> > +					   start, prev->vm_end);
> > +		start = vma->vm_start;
> > +	}
> > +	if (end > vma->vm_end && next) {
> > +		/* The part of address range is covered by next. */
> > +		if (end > next->vm_start)
> > +			copy_cow_pte_range(next, pmd, &cowed_entry,
> > +					   next->vm_start, end);
> > +		end = vma->vm_end;
> > +	}
> > +	if (copy_cow_pte_range(vma, pmd, &cowed_entry, start, end))
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * Here, it is the owner, so clear the ownership. To keep RSS state and
> > +	 * page table bytes correct, it needs to decrease them.
> > +	 * Also, handle the address range issue here.
> > +	 */
> > +	if (cow_pte_owner_is_same(&cowed_entry, pmd)) {
> > +		set_cow_pte_owner(&cowed_entry, NULL);
> 
> Presumably there is some assumption on atomicity here. Otherwise, two
> threads can run the following code, which is wrong, no? Yet, I do not see
> anything that provides such atomicity.

I may have multiple process access here. But for the thread, I assume
that they need to hold the mmap_lock. Maybe I need to add the assert
here too.

> 
> > +		if (pte_start < vma->vm_start && prev &&
> > +		    pte_start < prev->vm_end)
> > +			cow_pte_rss(mm, vma->vm_prev, pmd,
> > +				    pte_start, prev->vm_end, false /* dec */);
> > +		if (pte_end > vma->vm_end && next &&
> > +		    pte_end > next->vm_start)
> > +			cow_pte_rss(mm, vma->vm_next, pmd,
> > +				    next->vm_start, pte_end, false /* dec */);
> > +		cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> > +		mm_dec_nr_ptes(mm);
> > +	}
> > +
> > +	/* Already handled it, don't reuse cowed table. */
> > +	pmd_put_pte(vma, &cowed_entry, addr, false);
> > +
> > +	VM_BUG_ON(cow_pte_count(pmd) != 1);
> 
> Don’t use VM_BUG_ON().

Sure. I will change it to VM_WARN_ON().

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-27 18:38   ` Nadav Amit
@ 2022-09-27 19:53     ` Chih-En Lin
  2022-09-27 21:26       ` John Hubbard
  2022-09-28 14:03       ` David Hildenbrand
  0 siblings, 2 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-27 19:53 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
> I only skimmed the patches that you sent. The last couple of patches seem a
> bit rough and dirty, so I am sorry to say that I skipped them (too many
> “TODO” and “XXX” for my taste).
> 
> I am sure other will have better feedback than me. I understand there is a
> tradeoff and that this mechanism is mostly for high performance
> snapshotting/forking. It would be beneficial to see whether this mechanism
> can somehow be combined with existing ones (mshare?).

Still thanks for your feedback. :)
I'm looking at the PTE refcount and mshare patches. And, maybe it can
combine with them in the future.

> The code itself can be improved. I found the reasoning about synchronization
> and TLB flushes and synchronizations to be lacking, and the code to seem
> potentially incorrect. Better comments would help, even if the code is
> correct.
> 
> There are additional general questions. For instance, when sharing a
> page-table, do you properly update the refcount/mapcount of the mapped
> pages? And are there any possible interactions with THP?

Since access to those mapped pages will cost a lot of time, and this
will make fork() even have more overhead. It will not update the
refcount/mapcount of the mapped pages.

I'm not familiar with THP right now. But we have a plan for looking
at it to see what will happen with COW PTE.
Currently, I can only say that I prefer to avoid involving the behavior
of huge-page/THP. If there are any ideas here please tell us.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE
  2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
  2022-09-27 17:27   ` Nadav Amit
@ 2022-09-27 21:22   ` John Hubbard
  2022-09-28  8:36     ` Chih-En Lin
  1 sibling, 1 reply; 38+ messages in thread
From: John Hubbard @ 2022-09-27 21:22 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox, Christophe Leroy
  Cc: linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On 9/27/22 09:29, Chih-En Lin wrote:
> Add a new sysctl vm.cow_pte to set MMF_COW_PTE_READY flag for enabling
> copy-on-write (COW) to the PTE page table during the next time of fork.
> 
> Since it has a time gap between using the sysctl to enable the COW PTE
> and doing the fork, we use two states to determine the task that wants
> to do COW PTE or already doing it.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
>  include/linux/pgtable.h |  6 ++++++
>  kernel/fork.c           |  5 +++++
>  kernel/sysctl.c         |  8 ++++++++
>  mm/Makefile             |  2 +-
>  mm/cow_pte.c            | 39 +++++++++++++++++++++++++++++++++++++++
>  5 files changed, 59 insertions(+), 1 deletion(-)
>  create mode 100644 mm/cow_pte.c
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 014ee8f0fbaab..d03d01aefe989 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -937,6 +937,12 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma,
>  	__ptep_modify_prot_commit(vma, addr, ptep, pte);
>  }
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> +
> +int cow_pte_handler(struct ctl_table *table, int write, void *buffer,
> +		    size_t *lenp, loff_t *ppos);
> +
> +extern int sysctl_cow_pte_pid;

So are setting a global value, to a single pid?? Only one pid at a time
can be set up? 

I think that tells you already that there is a huge API problem here.

As the other thread with Nadav said, this is not a sysctl. It wants to
be a prctl(), at least the way things look so far.


> +
>  #endif /* CONFIG_MMU */
>  
>  /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 8a9e92068b150..6981944a7c6ec 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2671,6 +2671,11 @@ pid_t kernel_clone(struct kernel_clone_args *args)
>  			trace = 0;
>  	}
>  
> +	if (current->mm && test_bit(MMF_COW_PTE_READY, &current->mm->flags)) {
> +		clear_bit(MMF_COW_PTE_READY, &current->mm->flags);
> +		set_bit(MMF_COW_PTE, &current->mm->flags);
> +	}
> +
>  	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
>  	add_latent_entropy();
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 205d605cacc5b..c4f54412ae3a9 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2360,6 +2360,14 @@ static struct ctl_table vm_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= mmap_min_addr_handler,
>  	},
> +	{
> +		.procname	= "cow_pte",
> +		.data		= &sysctl_cow_pte_pid,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= cow_pte_handler,
> +		.extra1		= SYSCTL_ZERO,
> +	},
>  #endif
>  #ifdef CONFIG_NUMA
>  	{
> diff --git a/mm/Makefile b/mm/Makefile
> index 9a564f8364035..7a568d5066ee6 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -40,7 +40,7 @@ mmu-y			:= nommu.o
>  mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
>  			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>  			   msync.o page_vma_mapped.o pagewalk.o \
> -			   pgtable-generic.o rmap.o vmalloc.o
> +			   pgtable-generic.o rmap.o vmalloc.o cow_pte.o
>  
>  
>  ifdef CONFIG_CROSS_MEMORY_ATTACH
> diff --git a/mm/cow_pte.c b/mm/cow_pte.c
> new file mode 100644
> index 0000000000000..4e50aa4294ce7
> --- /dev/null
> +++ b/mm/cow_pte.c
> @@ -0,0 +1,39 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/sysctl.h>
> +#include <linux/pgtable.h>
> +#include <linux/sched.h>
> +#include <linux/sched/coredump.h>
> +#include <linux/pid.h>
> +
> +/* sysctl will write to this variable */
> +int sysctl_cow_pte_pid = -1;
> +
> +static void set_cow_pte_task(void)
> +{
> +	struct pid *pid;
> +	struct task_struct *task;
> +
> +	pid = find_get_pid(sysctl_cow_pte_pid);

This seems to be missing a corresponding call to put_pid().

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-27 19:53     ` Chih-En Lin
@ 2022-09-27 21:26       ` John Hubbard
  2022-09-28  8:52         ` Chih-En Lin
  2022-09-28 14:03       ` David Hildenbrand
  1 sibling, 1 reply; 38+ messages in thread
From: John Hubbard @ 2022-09-27 21:26 UTC (permalink / raw)
  To: Chih-En Lin, Nadav Amit
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On 9/27/22 12:53, Chih-En Lin wrote:
> I'm not familiar with THP right now. But we have a plan for looking
> at it to see what will happen with COW PTE.
> Currently, I can only say that I prefer to avoid involving the behavior
> of huge-page/THP. If there are any ideas here please tell us.
> 
In order to be considered at all, this would have to at least behave 
correctly in the presence of THP and hugetlbfs, IMHO. Those are no
longer niche features.


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE
  2022-09-27 21:22   ` John Hubbard
@ 2022-09-28  8:36     ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-28  8:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Nadav Amit,
	Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu, Miaohe Lin,
	Thomas Gleixner, Sebastian Andrzej Siewior, Andy Lutomirski,
	Fenghua Yu, Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 02:22:27PM -0700, John Hubbard wrote:
> > +extern int sysctl_cow_pte_pid;
>
> So are setting a global value, to a single pid?? Only one pid at a time
> can be set up?
>
> I think that tells you already that there is a huge API problem here.
>
> As the other thread with Nadav said, this is not a sysctl. It wants to
> be a prctl(), at least the way things look so far.

I will change it to use the prctl().
Probably it will be under PR_SET_MM or create a new option.

> > +static void set_cow_pte_task(void)
> > +{
> > +   struct pid *pid;
> > +   struct task_struct *task;
> > +
> > +   pid = find_get_pid(sysctl_cow_pte_pid);
>
> This seems to be missing a corresponding call to put_pid().

Thanks.

> thanks,
>
> --
> John Hubbard
> NVIDIA
>

Best regards,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-27 21:26       ` John Hubbard
@ 2022-09-28  8:52         ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-28  8:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox, Christophe Leroy, linux-kernel, linux-mm,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Vlastimil Babka,
	William Kucharski, Kirill A . Shutemov, Peter Xu,
	Suren Baghdasaryan, Arnd Bergmann, Tong Tiangen, Pasha Tatashin,
	Li kunyu, Anshuman Khandual, Minchan Kim, Yang Shi, Song Liu,
	Miaohe Lin, Thomas Gleixner, Sebastian Andrzej Siewior,
	Andy Lutomirski, Fenghua Yu, Dinglan Peng, Pedro Fonseca,
	Jim Huang, Huichun Feng

On Tue, Sep 27, 2022 at 02:26:19PM -0700, John Hubbard wrote:
> On 9/27/22 12:53, Chih-En Lin wrote:
> > I'm not familiar with THP right now. But we have a plan for looking
> > at it to see what will happen with COW PTE.
> > Currently, I can only say that I prefer to avoid involving the behavior
> > of huge-page/THP. If there are any ideas here please tell us.
> > 
> In order to be considered at all, this would have to at least behave 
> correctly in the presence of THP and hugetlbfs, IMHO. Those are no
> longer niche features.
> 

To make sure it will work well with them. During fork() and page fault,
we put the mechanism after the hug-page/THP to make it doesn't mess up.
It may have corner cases that I didn't handle. I will keep looking at
it.

Thanks,
Chih-En lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-27 19:53     ` Chih-En Lin
  2022-09-27 21:26       ` John Hubbard
@ 2022-09-28 14:03       ` David Hildenbrand
  2022-09-29 13:38         ` Chih-En Lin
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2022-09-28 14:03 UTC (permalink / raw)
  To: Chih-En Lin, Nadav Amit
  Cc: Andrew Morton, Qi Zheng, Matthew Wilcox, Christophe Leroy,
	linux-kernel, linux-mm, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On 27.09.22 21:53, Chih-En Lin wrote:
> On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
>> I only skimmed the patches that you sent. The last couple of patches seem a
>> bit rough and dirty, so I am sorry to say that I skipped them (too many
>> “TODO” and “XXX” for my taste).
>>
>> I am sure other will have better feedback than me. I understand there is a
>> tradeoff and that this mechanism is mostly for high performance
>> snapshotting/forking. It would be beneficial to see whether this mechanism
>> can somehow be combined with existing ones (mshare?).
> 
> Still thanks for your feedback. :)
> I'm looking at the PTE refcount and mshare patches. And, maybe it can
> combine with them in the future.
> 
>> The code itself can be improved. I found the reasoning about synchronization
>> and TLB flushes and synchronizations to be lacking, and the code to seem
>> potentially incorrect. Better comments would help, even if the code is
>> correct.
>>
>> There are additional general questions. For instance, when sharing a
>> page-table, do you properly update the refcount/mapcount of the mapped
>> pages? And are there any possible interactions with THP?
> 
> Since access to those mapped pages will cost a lot of time, and this
> will make fork() even have more overhead. It will not update the
> refcount/mapcount of the mapped pages.

Oh no.

So we'd have pages logically mapped into two processes (two page table 
structures), but the refcount/mapcount/PageAnonExclusive would not 
reflect that?

Honestly, I don't think it is upstream material in that hacky form. No, 
we don't need more COW CVEs or more COW over-complications that 
destabilize the whole system.

IMHO, a relaxed form that focuses on only the memory consumption 
reduction could *possibly* be accepted upstream if it's not too invasive 
or complex. During fork(), we'd do exactly what we used to do to PTEs 
(increment mapcount, refcount, trying to clear PageAnonExclusive, map 
the page R/O, duplicate swap entries; all while holding the page table 
lock), however, sharing the prepared page table with the child process 
using COW after we prepared it.

Any (most once we want to *optimize* rmap handling) modification 
attempts require breaking COW -- copying the page table for the faulting 
process. But at that point, the PTEs are already write-protected and 
properly accounted (refcount/mapcount/PageAnonExclusive).

Doing it that way might not require any questionable GUP hacks and 
swapping, MMU notifiers etc. "might just work as expected" because the 
accounting remains unchanged" -- we simply de-duplicate the page table 
itself we'd have after fork and any modification attempts simply replace 
the mapped copy.

But devil is in the detail (page table lock, TLB flushing).

"will make fork() even have more overhead" is not a good excuse for such 
complexity/hacks -- sure, it will make your benchmark results look 
better in comparison ;)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-28 14:03       ` David Hildenbrand
@ 2022-09-29 13:38         ` Chih-En Lin
  2022-09-29 13:49           ` Chih-En Lin
  2022-09-29 17:24           ` David Hildenbrand
  0 siblings, 2 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-29 13:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

Sorry for replying late.

On Wed, Sep 28, 2022 at 04:03:19PM +0200, David Hildenbrand wrote:
> On 27.09.22 21:53, Chih-En Lin wrote:
> > On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
> > > I only skimmed the patches that you sent. The last couple of patches seem a
> > > bit rough and dirty, so I am sorry to say that I skipped them (too many
> > > “TODO” and “XXX” for my taste).
> > > 
> > > I am sure other will have better feedback than me. I understand there is a
> > > tradeoff and that this mechanism is mostly for high performance
> > > snapshotting/forking. It would be beneficial to see whether this mechanism
> > > can somehow be combined with existing ones (mshare?).
> > 
> > Still thanks for your feedback. :)
> > I'm looking at the PTE refcount and mshare patches. And, maybe it can
> > combine with them in the future.
> > 
> > > The code itself can be improved. I found the reasoning about synchronization
> > > and TLB flushes and synchronizations to be lacking, and the code to seem
> > > potentially incorrect. Better comments would help, even if the code is
> > > correct.
> > > 
> > > There are additional general questions. For instance, when sharing a
> > > page-table, do you properly update the refcount/mapcount of the mapped
> > > pages? And are there any possible interactions with THP?
> > 
> > Since access to those mapped pages will cost a lot of time, and this
> > will make fork() even have more overhead. It will not update the
> > refcount/mapcount of the mapped pages.
> 
> Oh no.
> 
> So we'd have pages logically mapped into two processes (two page table
> structures), but the refcount/mapcount/PageAnonExclusive would not reflect
> that?
> 
> Honestly, I don't think it is upstream material in that hacky form. No, we
> don't need more COW CVEs or more COW over-complications that destabilize the
> whole system.
>

I know setting the write protection is not enough to prove the security
safe since the previous COW CVEs are related to it. And, if skipping the
accounting to reduce the overhead of fork() is not suitable for upstream
, we can change it. But, I think COW to the table can still be an
upstream material.

Recently the patches, like refcount for the empty user PTE page table
pages and mshare for the pages shared between the processes require more
PTE entries, showing that the modern system uses a lot of memory for the
page table (especially the PTE table). So, I think the method, COW to
the table, might reduce the memory usage for the side of the multiple
users PTE page table.

> IMHO, a relaxed form that focuses on only the memory consumption reduction
> could *possibly* be accepted upstream if it's not too invasive or complex.
> During fork(), we'd do exactly what we used to do to PTEs (increment
> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> duplicate swap entries; all while holding the page table lock), however,
> sharing the prepared page table with the child process using COW after we
> prepared it.
> 
> Any (most once we want to *optimize* rmap handling) modification attempts
> require breaking COW -- copying the page table for the faulting process. But
> at that point, the PTEs are already write-protected and properly accounted
> (refcount/mapcount/PageAnonExclusive).
> 
> Doing it that way might not require any questionable GUP hacks and swapping,
> MMU notifiers etc. "might just work as expected" because the accounting
> remains unchanged" -- we simply de-duplicate the page table itself we'd have
> after fork and any modification attempts simply replace the mapped copy.

Agree.
However for GUP hacks, if we want to do the COW to page table, we still
need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
check whether the PTE table is available or not before we do the COW to
the table). Otherwise, it will be more complicated since it might need
to handle situations like while preparing the COW work, it just figuring
out that it needs to duplicate the whole table and roll back (recover
the state and copy it to new table). Hopefully, I'm not wrong here.

> But devil is in the detail (page table lock, TLB flushing).

Sure, it might be an overhead in the page fault and needs to be handled
carefully. ;)

> "will make fork() even have more overhead" is not a good excuse for such
> complexity/hacks -- sure, it will make your benchmark results look better in
> comparison ;)

;);)
I think that, even if we do the accounting with the COW page table, it
still has a little bit improve.

> -- 
> Thanks,
> 
> David / dhildenb
>

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 13:38         ` Chih-En Lin
@ 2022-09-29 13:49           ` Chih-En Lin
  2022-09-29 17:24           ` David Hildenbrand
  1 sibling, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-29 13:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Thu, Sep 29, 2022 at 09:38:53PM +0800, Chih-En Lin wrote:
> Sorry for replying late.
> 
> On Wed, Sep 28, 2022 at 04:03:19PM +0200, David Hildenbrand wrote:
> > On 27.09.22 21:53, Chih-En Lin wrote:
> > > On Tue, Sep 27, 2022 at 06:38:05PM +0000, Nadav Amit wrote:
> > > > I only skimmed the patches that you sent. The last couple of patches seem a
> > > > bit rough and dirty, so I am sorry to say that I skipped them (too many
> > > > “TODO” and “XXX” for my taste).
> > > > 
> > > > I am sure other will have better feedback than me. I understand there is a
> > > > tradeoff and that this mechanism is mostly for high performance
> > > > snapshotting/forking. It would be beneficial to see whether this mechanism
> > > > can somehow be combined with existing ones (mshare?).
> > > 
> > > Still thanks for your feedback. :)
> > > I'm looking at the PTE refcount and mshare patches. And, maybe it can
> > > combine with them in the future.
> > > 
> > > > The code itself can be improved. I found the reasoning about synchronization
> > > > and TLB flushes and synchronizations to be lacking, and the code to seem
> > > > potentially incorrect. Better comments would help, even if the code is
> > > > correct.
> > > > 
> > > > There are additional general questions. For instance, when sharing a
> > > > page-table, do you properly update the refcount/mapcount of the mapped
> > > > pages? And are there any possible interactions with THP?
> > > 
> > > Since access to those mapped pages will cost a lot of time, and this
> > > will make fork() even have more overhead. It will not update the
> > > refcount/mapcount of the mapped pages.
> > 
> > Oh no.
> > 
> > So we'd have pages logically mapped into two processes (two page table
> > structures), but the refcount/mapcount/PageAnonExclusive would not reflect
> > that?
> > 
> > Honestly, I don't think it is upstream material in that hacky form. No, we
> > don't need more COW CVEs or more COW over-complications that destabilize the
> > whole system.
> >
> 
> I know setting the write protection is not enough to prove the security
> safe since the previous COW CVEs are related to it. And, if skipping the
> accounting to reduce the overhead of fork() is not suitable for upstream
> , we can change it. But, I think COW to the table can still be an
> upstream material.
> 
> Recently the patches, like refcount for the empty user PTE page table
> pages and mshare for the pages shared between the processes require more
> PTE entries, showing that the modern system uses a lot of memory for the
> page table (especially the PTE table). So, I think the method, COW to
> the table, might reduce the memory usage for the side of the multiple
> users PTE page table.

Sorry, I think I need to explain more about "the multiple users PTE page
table". It means that it has more than one user holding the page table
and the mapped page that still has the same context. So, we can use COW
to reduce the memory usage at first.

>
> > IMHO, a relaxed form that focuses on only the memory consumption reduction
> > could *possibly* be accepted upstream if it's not too invasive or complex.
> > During fork(), we'd do exactly what we used to do to PTEs (increment
> > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> > duplicate swap entries; all while holding the page table lock), however,
> > sharing the prepared page table with the child process using COW after we
> > prepared it.
> > 
> > Any (most once we want to *optimize* rmap handling) modification attempts
> > require breaking COW -- copying the page table for the faulting process. But
> > at that point, the PTEs are already write-protected and properly accounted
> > (refcount/mapcount/PageAnonExclusive).
> > 
> > Doing it that way might not require any questionable GUP hacks and swapping,
> > MMU notifiers etc. "might just work as expected" because the accounting
> > remains unchanged" -- we simply de-duplicate the page table itself we'd have
> > after fork and any modification attempts simply replace the mapped copy.
> 
> Agree.
> However for GUP hacks, if we want to do the COW to page table, we still
> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> check whether the PTE table is available or not before we do the COW to
> the table). Otherwise, it will be more complicated since it might need
> to handle situations like while preparing the COW work, it just figuring
> out that it needs to duplicate the whole table and roll back (recover
> the state and copy it to new table). Hopefully, I'm not wrong here.
> 
> > But devil is in the detail (page table lock, TLB flushing).
> 
> Sure, it might be an overhead in the page fault and needs to be handled
> carefully. ;)
> 
> > "will make fork() even have more overhead" is not a good excuse for such
> > complexity/hacks -- sure, it will make your benchmark results look better in
> > comparison ;)
> 
> ;);)
> I think that, even if we do the accounting with the COW page table, it
> still has a little bit improve.
> 
> > -- 
> > Thanks,
> > 
> > David / dhildenb
> >
> 
> Thanks,
> Chih-En Lin

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 13:38         ` Chih-En Lin
  2022-09-29 13:49           ` Chih-En Lin
@ 2022-09-29 17:24           ` David Hildenbrand
  2022-09-29 18:29             ` Chih-En Lin
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2022-09-29 17:24 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

>> IMHO, a relaxed form that focuses on only the memory consumption reduction
>> could *possibly* be accepted upstream if it's not too invasive or complex.
>> During fork(), we'd do exactly what we used to do to PTEs (increment
>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
>> duplicate swap entries; all while holding the page table lock), however,
>> sharing the prepared page table with the child process using COW after we
>> prepared it.
>>
>> Any (most once we want to *optimize* rmap handling) modification attempts
>> require breaking COW -- copying the page table for the faulting process. But
>> at that point, the PTEs are already write-protected and properly accounted
>> (refcount/mapcount/PageAnonExclusive).
>>
>> Doing it that way might not require any questionable GUP hacks and swapping,
>> MMU notifiers etc. "might just work as expected" because the accounting
>> remains unchanged" -- we simply de-duplicate the page table itself we'd have
>> after fork and any modification attempts simply replace the mapped copy.
> 
> Agree.
> However for GUP hacks, if we want to do the COW to page table, we still
> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> check whether the PTE table is available or not before we do the COW to
> the table). Otherwise, it will be more complicated since it might need
> to handle situations like while preparing the COW work, it just figuring
> out that it needs to duplicate the whole table and roll back (recover
> the state and copy it to new table). Hopefully, I'm not wrong here.

The nice thing is that GUP itself *usually* doesn't modify page tables. 
One corner case is follow_pfn_pte(). All other modifications should 
happen in the actual fault handler that has to deal with such kind of 
unsharing either way when modifying the PTE.

If the pages are already in a COW-ed pagetable in the desired "shared" 
state (e.g., PageAnonExclusive cleared on an anonymous page), R/O 
pinning of such pages will just work as expected and we shouldn't be 
surprised by another set of GUP+COW CVEs.

We'd really only deduplicate the page table and not play other tricks 
with the actual page table content that differ from the existing way of 
handling fork().

I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code 
when not modifying the page table. I think we only need "we have to 
unshare this page table now" in follow_pfn_pte() and inside the fault 
handling when GUP triggers a fault.

I hope my assumption is correct, or am I missing something?

> 
>> But devil is in the detail (page table lock, TLB flushing).
> 
> Sure, it might be an overhead in the page fault and needs to be handled
> carefully. ;)
> 
>> "will make fork() even have more overhead" is not a good excuse for such
>> complexity/hacks -- sure, it will make your benchmark results look better in
>> comparison ;)
> 
> ;);)
> I think that, even if we do the accounting with the COW page table, it
> still has a little bit improve.

:)

My gut feeling is that this is true. While we have to do a pass over the 
parent page table during fork and wrprotect all PTEs etc., we don't have 
to duplicate the page table content and allocate/free memory for that.

One interesting case is when we cannot share an anon page with the child 
process because it maybe pinned -- and we have to copy it via 
copy_present_page(). In that case, the page table between the parent and 
the child would differ and we'd not be able to share the page table.

That case could be caught in copy_pte_range(): in case we'd have to 
allocate a page via page_copy_prealloc(), we'd have to fall back to the 
ordinary "separate page table for the child" way of doing things.

But that looks doable to me.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 17:24           ` David Hildenbrand
@ 2022-09-29 18:29             ` Chih-En Lin
  2022-09-29 18:38               ` David Hildenbrand
  2022-09-29 18:40               ` Nadav Amit
  0 siblings, 2 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-29 18:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
> > > IMHO, a relaxed form that focuses on only the memory consumption reduction
> > > could *possibly* be accepted upstream if it's not too invasive or complex.
> > > During fork(), we'd do exactly what we used to do to PTEs (increment
> > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> > > duplicate swap entries; all while holding the page table lock), however,
> > > sharing the prepared page table with the child process using COW after we
> > > prepared it.
> > > 
> > > Any (most once we want to *optimize* rmap handling) modification attempts
> > > require breaking COW -- copying the page table for the faulting process. But
> > > at that point, the PTEs are already write-protected and properly accounted
> > > (refcount/mapcount/PageAnonExclusive).
> > > 
> > > Doing it that way might not require any questionable GUP hacks and swapping,
> > > MMU notifiers etc. "might just work as expected" because the accounting
> > > remains unchanged" -- we simply de-duplicate the page table itself we'd have
> > > after fork and any modification attempts simply replace the mapped copy.
> > 
> > Agree.
> > However for GUP hacks, if we want to do the COW to page table, we still
> > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> > check whether the PTE table is available or not before we do the COW to
> > the table). Otherwise, it will be more complicated since it might need
> > to handle situations like while preparing the COW work, it just figuring
> > out that it needs to duplicate the whole table and roll back (recover
> > the state and copy it to new table). Hopefully, I'm not wrong here.
> 
> The nice thing is that GUP itself *usually* doesn't modify page tables. One
> corner case is follow_pfn_pte(). All other modifications should happen in
> the actual fault handler that has to deal with such kind of unsharing either
> way when modifying the PTE.
> 
> If the pages are already in a COW-ed pagetable in the desired "shared" state
> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
> pages will just work as expected and we shouldn't be surprised by another
> set of GUP+COW CVEs.
> 
> We'd really only deduplicate the page table and not play other tricks with
> the actual page table content that differ from the existing way of handling
> fork().
> 
> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
> not modifying the page table. I think we only need "we have to unshare this
> page table now" in follow_pfn_pte() and inside the fault handling when GUP
> triggers a fault.
> 
> I hope my assumption is correct, or am I missing something?
> 

My consideration is when we pinned the page and did the COW to make the
page table be shared. It might not allow mapping the pinned page to R/O)
into both processes.

So, if the fork is working on the shared state, it needs to recover the
table and copy to a new one since that pinned page will need to copy
immediately. We can hold the shared state after occurring such a
situation. So we still need some trick to let the fork() know which page
table already has the pinned page (or such page won't let us share)
before going to duplicate.

Am I wrong here?

After that, since we handled the accounting in fork(), we don't need
ownership (pmd_t pointer) anymore. We have to find another way to mark
the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
stored at that space.)

> > 
> > > But devil is in the detail (page table lock, TLB flushing).
> > 
> > Sure, it might be an overhead in the page fault and needs to be handled
> > carefully. ;)
> > 
> > > "will make fork() even have more overhead" is not a good excuse for such
> > > complexity/hacks -- sure, it will make your benchmark results look better in
> > > comparison ;)
> > 
> > ;);)
> > I think that, even if we do the accounting with the COW page table, it
> > still has a little bit improve.
> 
> :)
> 
> My gut feeling is that this is true. While we have to do a pass over the
> parent page table during fork and wrprotect all PTEs etc., we don't have to
> duplicate the page table content and allocate/free memory for that.
> 
> One interesting case is when we cannot share an anon page with the child
> process because it maybe pinned -- and we have to copy it via
> copy_present_page(). In that case, the page table between the parent and the
> child would differ and we'd not be able to share the page table.

That is what I want to say above.
The case might happen in the middle of the shared page table progress.
It might cost more overhead to recover it. Therefore, if GUP wants to
pin the mapped page we can mark the PTE table first, so fork() won't
waste time doing the work for sharing.

> That case could be caught in copy_pte_range(): in case we'd have to allocate
> a page via page_copy_prealloc(), we'd have to fall back to the ordinary
> "separate page table for the child" way of doing things.
> 
> But that looks doable to me.

Sounds good. :)

> -- 
> Thanks,
> 
> David / dhildenb
> 

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 18:29             ` Chih-En Lin
@ 2022-09-29 18:38               ` David Hildenbrand
  2022-09-29 18:57                 ` Chih-En Lin
  2022-09-29 18:40               ` Nadav Amit
  1 sibling, 1 reply; 38+ messages in thread
From: David Hildenbrand @ 2022-09-29 18:38 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On 29.09.22 20:29, Chih-En Lin wrote:
> On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
>>>> IMHO, a relaxed form that focuses on only the memory consumption reduction
>>>> could *possibly* be accepted upstream if it's not too invasive or complex.
>>>> During fork(), we'd do exactly what we used to do to PTEs (increment
>>>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
>>>> duplicate swap entries; all while holding the page table lock), however,
>>>> sharing the prepared page table with the child process using COW after we
>>>> prepared it.
>>>>
>>>> Any (most once we want to *optimize* rmap handling) modification attempts
>>>> require breaking COW -- copying the page table for the faulting process. But
>>>> at that point, the PTEs are already write-protected and properly accounted
>>>> (refcount/mapcount/PageAnonExclusive).
>>>>
>>>> Doing it that way might not require any questionable GUP hacks and swapping,
>>>> MMU notifiers etc. "might just work as expected" because the accounting
>>>> remains unchanged" -- we simply de-duplicate the page table itself we'd have
>>>> after fork and any modification attempts simply replace the mapped copy.
>>>
>>> Agree.
>>> However for GUP hacks, if we want to do the COW to page table, we still
>>> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
>>> check whether the PTE table is available or not before we do the COW to
>>> the table). Otherwise, it will be more complicated since it might need
>>> to handle situations like while preparing the COW work, it just figuring
>>> out that it needs to duplicate the whole table and roll back (recover
>>> the state and copy it to new table). Hopefully, I'm not wrong here.
>>
>> The nice thing is that GUP itself *usually* doesn't modify page tables. One
>> corner case is follow_pfn_pte(). All other modifications should happen in
>> the actual fault handler that has to deal with such kind of unsharing either
>> way when modifying the PTE.
>>
>> If the pages are already in a COW-ed pagetable in the desired "shared" state
>> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
>> pages will just work as expected and we shouldn't be surprised by another
>> set of GUP+COW CVEs.
>>
>> We'd really only deduplicate the page table and not play other tricks with
>> the actual page table content that differ from the existing way of handling
>> fork().
>>
>> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
>> not modifying the page table. I think we only need "we have to unshare this
>> page table now" in follow_pfn_pte() and inside the fault handling when GUP
>> triggers a fault.
>>
>> I hope my assumption is correct, or am I missing something?
>>
> 
> My consideration is when we pinned the page and did the COW to make the
> page table be shared. It might not allow mapping the pinned page to R/O)
> into both processes.
> 
> So, if the fork is working on the shared state, it needs to recover the
> table and copy to a new one since that pinned page will need to copy
> immediately. We can hold the shared state after occurring such a
> situation. So we still need some trick to let the fork() know which page
> table already has the pinned page (or such page won't let us share)
> before going to duplicate.
> 
> Am I wrong here?

I think you might be overthinking this. Let's keep it simple:

1) Handle pinned anon pages just as I described below, falling back to 
the "slow" path of page table copying.

2) Once we passed that stage, you can be sure that the COW-ed page table 
cannot have actually pinned anon pages. All anon pages in such a page 
table have PageAnonExclusive cleared and are "maybe shared". GUP cannot 
succeed in pinning these pages anymore, because it will only pin 
exclusive anon pages!

3) If anybody wants to take a R/O pin on a shared anon page that is 
mapped into a COW-ed page table, we trigger a fault with 
FAULT_FLAG_UNSHARE instead of pinning the page. This has to break COW on 
the page table and properly map an exclusive anon page into it, breaking 
COW.

Do you see a problem with that?

> 
> After that, since we handled the accounting in fork(), we don't need
> ownership (pmd_t pointer) anymore. We have to find another way to mark
> the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
> stored at that space.)
> 
>>>
>>>> But devil is in the detail (page table lock, TLB flushing).
>>>
>>> Sure, it might be an overhead in the page fault and needs to be handled
>>> carefully. ;)
>>>
>>>> "will make fork() even have more overhead" is not a good excuse for such
>>>> complexity/hacks -- sure, it will make your benchmark results look better in
>>>> comparison ;)
>>>
>>> ;);)
>>> I think that, even if we do the accounting with the COW page table, it
>>> still has a little bit improve.
>>
>> :)
>>
>> My gut feeling is that this is true. While we have to do a pass over the
>> parent page table during fork and wrprotect all PTEs etc., we don't have to
>> duplicate the page table content and allocate/free memory for that.
>>
>> One interesting case is when we cannot share an anon page with the child
>> process because it maybe pinned -- and we have to copy it via
>> copy_present_page(). In that case, the page table between the parent and the
>> child would differ and we'd not be able to share the page table.
> 
> That is what I want to say above.
> The case might happen in the middle of the shared page table progress.
> It might cost more overhead to recover it. Therefore, if GUP wants to
> pin the mapped page we can mark the PTE table first, so fork() won't
> waste time doing the work for sharing.

Having pinned pages is a corner case for most apps. No need to worry 
about optimizing this corner case for now.

I see what you are trying to optimize, but I don't think this is needed 
in a first version, and probably never is needed.


Any attempts to mark page tables in a certain way from GUP 
(COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast 
(get_user_pages_fast) can race with pretty much anything, even with 
concurrent fork. I suspect your current code might be really racy in 
that regard.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 18:29             ` Chih-En Lin
  2022-09-29 18:38               ` David Hildenbrand
@ 2022-09-29 18:40               ` Nadav Amit
  2022-09-29 19:02                 ` Chih-En Lin
  1 sibling, 1 reply; 38+ messages in thread
From: Nadav Amit @ 2022-09-29 18:40 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: David Hildenbrand, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Sep 29, 2022, at 11:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:

> That case could be caught in copy_pte_range(): in case we'd have to allocate
>> a page via page_copy_prealloc(), we'd have to fall back to the ordinary
>> "separate page table for the child" way of doing things.
>> 
>> But that looks doable to me.
> 
> Sounds good. :)

Chih-En, I admit I did not fully read the entire correspondence and got deep
into all the details.

I would note, however, that there are several additional components that I
did not see (and perhaps missed) in your patches. Basically, there are many
page-table manipulations that are done not through the page-fault handler or
reclamation mechanisms. I did not see any of them being addressed.

So if/when you send a new version, please have a look at mprotect(),
madvise(), soft-dirty, userfaultfd and THP. In these cases, I presume, you
would have to COW-break (aka COW-unshare) the page-tables.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 18:38               ` David Hildenbrand
@ 2022-09-29 18:57                 ` Chih-En Lin
  2022-09-29 19:00                   ` David Hildenbrand
  0 siblings, 1 reply; 38+ messages in thread
From: Chih-En Lin @ 2022-09-29 18:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote:
> On 29.09.22 20:29, Chih-En Lin wrote:
> > On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
> > > > > IMHO, a relaxed form that focuses on only the memory consumption reduction
> > > > > could *possibly* be accepted upstream if it's not too invasive or complex.
> > > > > During fork(), we'd do exactly what we used to do to PTEs (increment
> > > > > mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
> > > > > duplicate swap entries; all while holding the page table lock), however,
> > > > > sharing the prepared page table with the child process using COW after we
> > > > > prepared it.
> > > > > 
> > > > > Any (most once we want to *optimize* rmap handling) modification attempts
> > > > > require breaking COW -- copying the page table for the faulting process. But
> > > > > at that point, the PTEs are already write-protected and properly accounted
> > > > > (refcount/mapcount/PageAnonExclusive).
> > > > > 
> > > > > Doing it that way might not require any questionable GUP hacks and swapping,
> > > > > MMU notifiers etc. "might just work as expected" because the accounting
> > > > > remains unchanged" -- we simply de-duplicate the page table itself we'd have
> > > > > after fork and any modification attempts simply replace the mapped copy.
> > > > 
> > > > Agree.
> > > > However for GUP hacks, if we want to do the COW to page table, we still
> > > > need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
> > > > check whether the PTE table is available or not before we do the COW to
> > > > the table). Otherwise, it will be more complicated since it might need
> > > > to handle situations like while preparing the COW work, it just figuring
> > > > out that it needs to duplicate the whole table and roll back (recover
> > > > the state and copy it to new table). Hopefully, I'm not wrong here.
> > > 
> > > The nice thing is that GUP itself *usually* doesn't modify page tables. One
> > > corner case is follow_pfn_pte(). All other modifications should happen in
> > > the actual fault handler that has to deal with such kind of unsharing either
> > > way when modifying the PTE.
> > > 
> > > If the pages are already in a COW-ed pagetable in the desired "shared" state
> > > (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
> > > pages will just work as expected and we shouldn't be surprised by another
> > > set of GUP+COW CVEs.
> > > 
> > > We'd really only deduplicate the page table and not play other tricks with
> > > the actual page table content that differ from the existing way of handling
> > > fork().
> > > 
> > > I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
> > > not modifying the page table. I think we only need "we have to unshare this
> > > page table now" in follow_pfn_pte() and inside the fault handling when GUP
> > > triggers a fault.
> > > 
> > > I hope my assumption is correct, or am I missing something?
> > > 
> > 
> > My consideration is when we pinned the page and did the COW to make the
> > page table be shared. It might not allow mapping the pinned page to R/O)
> > into both processes.
> > 
> > So, if the fork is working on the shared state, it needs to recover the
> > table and copy to a new one since that pinned page will need to copy
> > immediately. We can hold the shared state after occurring such a
> > situation. So we still need some trick to let the fork() know which page
> > table already has the pinned page (or such page won't let us share)
> > before going to duplicate.
> > 
> > Am I wrong here?
> 
> I think you might be overthinking this. Let's keep it simple:
> 
> 1) Handle pinned anon pages just as I described below, falling back to the
> "slow" path of page table copying.
> 
> 2) Once we passed that stage, you can be sure that the COW-ed page table
> cannot have actually pinned anon pages. All anon pages in such a page table
> have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in
> pinning these pages anymore, because it will only pin exclusive anon pages!
> 
> 3) If anybody wants to take a R/O pin on a shared anon page that is mapped
> into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead
> of pinning the page. This has to break COW on the page table and properly
> map an exclusive anon page into it, breaking COW.
> 
> Do you see a problem with that?
> 
> > 
> > After that, since we handled the accounting in fork(), we don't need
> > ownership (pmd_t pointer) anymore. We have to find another way to mark
> > the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
> > stored at that space.)
> > 
> > > > 
> > > > > But devil is in the detail (page table lock, TLB flushing).
> > > > 
> > > > Sure, it might be an overhead in the page fault and needs to be handled
> > > > carefully. ;)
> > > > 
> > > > > "will make fork() even have more overhead" is not a good excuse for such
> > > > > complexity/hacks -- sure, it will make your benchmark results look better in
> > > > > comparison ;)
> > > > 
> > > > ;);)
> > > > I think that, even if we do the accounting with the COW page table, it
> > > > still has a little bit improve.
> > > 
> > > :)
> > > 
> > > My gut feeling is that this is true. While we have to do a pass over the
> > > parent page table during fork and wrprotect all PTEs etc., we don't have to
> > > duplicate the page table content and allocate/free memory for that.
> > > 
> > > One interesting case is when we cannot share an anon page with the child
> > > process because it maybe pinned -- and we have to copy it via
> > > copy_present_page(). In that case, the page table between the parent and the
> > > child would differ and we'd not be able to share the page table.
> > 
> > That is what I want to say above.
> > The case might happen in the middle of the shared page table progress.
> > It might cost more overhead to recover it. Therefore, if GUP wants to
> > pin the mapped page we can mark the PTE table first, so fork() won't
> > waste time doing the work for sharing.
> 
> Having pinned pages is a corner case for most apps. No need to worry about
> optimizing this corner case for now.
> 
> I see what you are trying to optimize, but I don't think this is needed in a
> first version, and probably never is needed.
> 
> 
> Any attempts to mark page tables in a certain way from GUP
> (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast
> (get_user_pages_fast) can race with pretty much anything, even with
> concurrent fork. I suspect your current code might be really racy in that
> regard.

I see.
Now, I know why optimizing that corner case is not worth it.
Thank you for explaining that.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 18:57                 ` Chih-En Lin
@ 2022-09-29 19:00                   ` David Hildenbrand
  0 siblings, 0 replies; 38+ messages in thread
From: David Hildenbrand @ 2022-09-29 19:00 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Nadav Amit, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On 29.09.22 20:57, Chih-En Lin wrote:
> On Thu, Sep 29, 2022 at 08:38:52PM +0200, David Hildenbrand wrote:
>> On 29.09.22 20:29, Chih-En Lin wrote:
>>> On Thu, Sep 29, 2022 at 07:24:31PM +0200, David Hildenbrand wrote:
>>>>>> IMHO, a relaxed form that focuses on only the memory consumption reduction
>>>>>> could *possibly* be accepted upstream if it's not too invasive or complex.
>>>>>> During fork(), we'd do exactly what we used to do to PTEs (increment
>>>>>> mapcount, refcount, trying to clear PageAnonExclusive, map the page R/O,
>>>>>> duplicate swap entries; all while holding the page table lock), however,
>>>>>> sharing the prepared page table with the child process using COW after we
>>>>>> prepared it.
>>>>>>
>>>>>> Any (most once we want to *optimize* rmap handling) modification attempts
>>>>>> require breaking COW -- copying the page table for the faulting process. But
>>>>>> at that point, the PTEs are already write-protected and properly accounted
>>>>>> (refcount/mapcount/PageAnonExclusive).
>>>>>>
>>>>>> Doing it that way might not require any questionable GUP hacks and swapping,
>>>>>> MMU notifiers etc. "might just work as expected" because the accounting
>>>>>> remains unchanged" -- we simply de-duplicate the page table itself we'd have
>>>>>> after fork and any modification attempts simply replace the mapped copy.
>>>>>
>>>>> Agree.
>>>>> However for GUP hacks, if we want to do the COW to page table, we still
>>>>> need the hacks in this patch (using the COW_PTE_OWN_EXCLUSIVE flag to
>>>>> check whether the PTE table is available or not before we do the COW to
>>>>> the table). Otherwise, it will be more complicated since it might need
>>>>> to handle situations like while preparing the COW work, it just figuring
>>>>> out that it needs to duplicate the whole table and roll back (recover
>>>>> the state and copy it to new table). Hopefully, I'm not wrong here.
>>>>
>>>> The nice thing is that GUP itself *usually* doesn't modify page tables. One
>>>> corner case is follow_pfn_pte(). All other modifications should happen in
>>>> the actual fault handler that has to deal with such kind of unsharing either
>>>> way when modifying the PTE.
>>>>
>>>> If the pages are already in a COW-ed pagetable in the desired "shared" state
>>>> (e.g., PageAnonExclusive cleared on an anonymous page), R/O pinning of such
>>>> pages will just work as expected and we shouldn't be surprised by another
>>>> set of GUP+COW CVEs.
>>>>
>>>> We'd really only deduplicate the page table and not play other tricks with
>>>> the actual page table content that differ from the existing way of handling
>>>> fork().
>>>>
>>>> I don't immediately see why we need COW_PTE_OWN_EXCLUSIVE in GUP code when
>>>> not modifying the page table. I think we only need "we have to unshare this
>>>> page table now" in follow_pfn_pte() and inside the fault handling when GUP
>>>> triggers a fault.
>>>>
>>>> I hope my assumption is correct, or am I missing something?
>>>>
>>>
>>> My consideration is when we pinned the page and did the COW to make the
>>> page table be shared. It might not allow mapping the pinned page to R/O)
>>> into both processes.
>>>
>>> So, if the fork is working on the shared state, it needs to recover the
>>> table and copy to a new one since that pinned page will need to copy
>>> immediately. We can hold the shared state after occurring such a
>>> situation. So we still need some trick to let the fork() know which page
>>> table already has the pinned page (or such page won't let us share)
>>> before going to duplicate.
>>>
>>> Am I wrong here?
>>
>> I think you might be overthinking this. Let's keep it simple:
>>
>> 1) Handle pinned anon pages just as I described below, falling back to the
>> "slow" path of page table copying.
>>
>> 2) Once we passed that stage, you can be sure that the COW-ed page table
>> cannot have actually pinned anon pages. All anon pages in such a page table
>> have PageAnonExclusive cleared and are "maybe shared". GUP cannot succeed in
>> pinning these pages anymore, because it will only pin exclusive anon pages!
>>
>> 3) If anybody wants to take a R/O pin on a shared anon page that is mapped
>> into a COW-ed page table, we trigger a fault with FAULT_FLAG_UNSHARE instead
>> of pinning the page. This has to break COW on the page table and properly
>> map an exclusive anon page into it, breaking COW.
>>
>> Do you see a problem with that?
>>
>>>
>>> After that, since we handled the accounting in fork(), we don't need
>>> ownership (pmd_t pointer) anymore. We have to find another way to mark
>>> the table to be exclusive. (Right now, COW_PTE_OWNER_EXCLUSIVE flag is
>>> stored at that space.)
>>>
>>>>>
>>>>>> But devil is in the detail (page table lock, TLB flushing).
>>>>>
>>>>> Sure, it might be an overhead in the page fault and needs to be handled
>>>>> carefully. ;)
>>>>>
>>>>>> "will make fork() even have more overhead" is not a good excuse for such
>>>>>> complexity/hacks -- sure, it will make your benchmark results look better in
>>>>>> comparison ;)
>>>>>
>>>>> ;);)
>>>>> I think that, even if we do the accounting with the COW page table, it
>>>>> still has a little bit improve.
>>>>
>>>> :)
>>>>
>>>> My gut feeling is that this is true. While we have to do a pass over the
>>>> parent page table during fork and wrprotect all PTEs etc., we don't have to
>>>> duplicate the page table content and allocate/free memory for that.
>>>>
>>>> One interesting case is when we cannot share an anon page with the child
>>>> process because it maybe pinned -- and we have to copy it via
>>>> copy_present_page(). In that case, the page table between the parent and the
>>>> child would differ and we'd not be able to share the page table.
>>>
>>> That is what I want to say above.
>>> The case might happen in the middle of the shared page table progress.
>>> It might cost more overhead to recover it. Therefore, if GUP wants to
>>> pin the mapped page we can mark the PTE table first, so fork() won't
>>> waste time doing the work for sharing.
>>
>> Having pinned pages is a corner case for most apps. No need to worry about
>> optimizing this corner case for now.
>>
>> I see what you are trying to optimize, but I don't think this is needed in a
>> first version, and probably never is needed.
>>
>>
>> Any attempts to mark page tables in a certain way from GUP
>> (COW_PTE_OWNER_EXCLUSIVE) is problematic either way: GUP-fast
>> (get_user_pages_fast) can race with pretty much anything, even with
>> concurrent fork. I suspect your current code might be really racy in that
>> regard.
> 
> I see.
> Now, I know why optimizing that corner case is not worth it.
> Thank you for explaining that.

Falling back after already processing some PTEs requires some care, 
though. I guess it's not too hard to get it right -- it might be harder 
to get it "clean". But we can talk about that detail later.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table
  2022-09-29 18:40               ` Nadav Amit
@ 2022-09-29 19:02                 ` Chih-En Lin
  0 siblings, 0 replies; 38+ messages in thread
From: Chih-En Lin @ 2022-09-29 19:02 UTC (permalink / raw)
  To: Nadav Amit
  Cc: David Hildenbrand, Andrew Morton, Qi Zheng, Matthew Wilcox,
	Christophe Leroy, linux-kernel, linux-mm, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Vlastimil Babka, William Kucharski,
	Kirill A . Shutemov, Peter Xu, Suren Baghdasaryan, Arnd Bergmann,
	Tong Tiangen, Pasha Tatashin, Li kunyu, Anshuman Khandual,
	Minchan Kim, Yang Shi, Song Liu, Miaohe Lin, Thomas Gleixner,
	Sebastian Andrzej Siewior, Andy Lutomirski, Fenghua Yu,
	Dinglan Peng, Pedro Fonseca, Jim Huang, Huichun Feng

On Thu, Sep 29, 2022 at 06:40:36PM +0000, Nadav Amit wrote:
> On Sep 29, 2022, at 11:29 AM, Chih-En Lin <shiyn.lin@gmail.com> wrote:
> 
> > That case could be caught in copy_pte_range(): in case we'd have to allocate
> >> a page via page_copy_prealloc(), we'd have to fall back to the ordinary
> >> "separate page table for the child" way of doing things.
> >> 
> >> But that looks doable to me.
> > 
> > Sounds good. :)
> 
> Chih-En, I admit I did not fully read the entire correspondence and got deep
> into all the details.
> 
> I would note, however, that there are several additional components that I
> did not see (and perhaps missed) in your patches. Basically, there are many
> page-table manipulations that are done not through the page-fault handler or
> reclamation mechanisms. I did not see any of them being addressed.
> 
> So if/when you send a new version, please have a look at mprotect(),
> madvise(), soft-dirty, userfaultfd and THP. In these cases, I presume, you
> would have to COW-break (aka COW-unshare) the page-tables.
> 

Sure. Before I send the new version I will try to handle all of them.
Thank you for the note.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-09-29 19:02 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-27 16:29 [RFC PATCH v2 0/9] Introduce Copy-On-Write to Page Table Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 1/9] mm: Add new mm flags for Copy-On-Write PTE table Chih-En Lin
2022-09-27 17:23   ` Nadav Amit
2022-09-27 17:36     ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 2/9] mm: pgtable: Add sysctl to enable COW PTE Chih-En Lin
2022-09-27 17:27   ` Nadav Amit
2022-09-27 18:05     ` Chih-En Lin
2022-09-27 21:22   ` John Hubbard
2022-09-28  8:36     ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 3/9] mm, pgtable: Add ownership to PTE table Chih-En Lin
2022-09-27 17:30   ` Nadav Amit
2022-09-27 18:23     ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 4/9] mm: Add COW PTE fallback functions Chih-En Lin
2022-09-27 17:51   ` Nadav Amit
2022-09-27 19:00     ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 5/9] mm, pgtable: Add a refcount to PTE table Chih-En Lin
2022-09-27 17:59   ` Nadav Amit
2022-09-27 19:07     ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 6/9] mm, pgtable: Add COW_PTE_OWNER_EXCLUSIVE flag Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 7/9] mm: Add the break COW PTE handler Chih-En Lin
2022-09-27 18:15   ` Nadav Amit
2022-09-27 19:23     ` Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 8/9] mm: Handle COW PTE with reclaim algorithm Chih-En Lin
2022-09-27 16:29 ` [RFC PATCH v2 9/9] mm: Introduce Copy-On-Write PTE table Chih-En Lin
2022-09-27 18:38   ` Nadav Amit
2022-09-27 19:53     ` Chih-En Lin
2022-09-27 21:26       ` John Hubbard
2022-09-28  8:52         ` Chih-En Lin
2022-09-28 14:03       ` David Hildenbrand
2022-09-29 13:38         ` Chih-En Lin
2022-09-29 13:49           ` Chih-En Lin
2022-09-29 17:24           ` David Hildenbrand
2022-09-29 18:29             ` Chih-En Lin
2022-09-29 18:38               ` David Hildenbrand
2022-09-29 18:57                 ` Chih-En Lin
2022-09-29 19:00                   ` David Hildenbrand
2022-09-29 18:40               ` Nadav Amit
2022-09-29 19:02                 ` Chih-En Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.