All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
@ 2023-02-07  3:51 Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl Chih-En Lin
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

v3 -> v4
- Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
  s390 and powerpc32, don't support the PMD entry and PTE table
  operations.
- Fix unmatch type of break_cow_pte_range() in
  migrate_vma_collect_pmd().
- Don’t break COW PTE in folio_referenced_one().
- Fix the wrong VMA range checking in break_cow_pte_range().
- Only break COW when we modify the soft-dirty bit in
  clear_refs_pte_range().
- Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
- Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
  tlb_flush_pmd_range().
- Handle VM_DONTCOPY with COW PTE fork.
- Fix the wrong address and invalid vma in recover_pte_range().
- Fix the infinite page fault loop in GUP routine.
  In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
  handler, we return -EMLINK to let the GUP handles the page fault
  (call faultin_page() in __get_user_pages()).
- return not_found(pvmw) if the break COW PTE failed in
  page_vma_mapped_walk().
- Since COW PTE has the same result as the normal COW selftest, it
  probably passed the COW selftest.

	# [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
	not ok 33 No leak from parent into child
	# [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
	not ok 44 No leak from parent into child
	# [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
	not ok 55 No leak from child into parent
	# [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
	not ok 66 No leak from child into parent

	Bail out! 4 out of 147 tests failed
	# Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
  See the more information about anon cow hugetlb tests:
    https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/


v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/

RFC v2 -> v3
- Change the sysctl with PID to prctl(PR_SET_COW_PTE).
- Account all the COW PTE mapped pages in fork() instead of defer it to
  page fault (break COW PTE).
- If there is an unshareable mapped page (maybe pinned or private
  device), recover all the entries that are already handled by COW PTE
  fork, then copy to the new one.
- Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
  follow_pfn_pte().
- Remove the PTE ownership since we don't need it.
- Use pte lock to protect the break COW PTE and free COW-ed PTE.
- Do TLB flushing in break COW PTE handler.
- Handle THP, KSM, madvise, mprotect, uffd and migrate device.
- Handle the replacement page of uprobe.
- Handle the clear_refs_write() of fs/proc.
- All of the benchmarks dropped since the accounting and pte lock.
  The benchmarks of v3 is worse than RFC v2, most of the cases are
  similar to the normal fork, but there still have an use case
  (TriforceAFL) is better than the normal fork version.

RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/

RFC v1 -> RFC v2
- Change the clone flag method to sysctl with PID.
- Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
  MMF_COW_PTE_READY, for the sysctl.
- Change the owner pointer to use the folio padding.
- Handle all the VMAs that cover the PTE table when doing the break COW PTE.
- Remove the self-defined refcount to use the _refcount for the page
  table page.
- Add the exclusive flag to let the page table only own by one task in
  some situations.
- Invalidate address range MMU notifier and start the write_seqcount
  when doing the break COW PTE.
- Handle the swap cache and swapoff.

RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/

---

Currently, copy-on-write is only used for the mapped memory; the child
process still needs to copy the entire page table from the parent
process during forking. The parent process might take a lot of time and
memory to copy the page table when the parent has a big page table
allocated. For example, the memory usage of a process after forking with
1 GB mapped memory is as follows:

              DEFAULT FORK
          parent         child
VmRSS:   1049688 kB    1048688 kB
VmPTE:      2096 kB       2096 kB

This patch introduces copy-on-write (COW) for the PTE level page tables.
COW PTE improves performance in the situation where the user needs
copies of the program to run on isolated environments. Feedback-based
fuzzers (e.g., AFL) and serverless/microservice frameworks are two major
examples. For instance, COW PTE achieves a 1.03x throughput increase
when running TriforceAFL.

After applying COW to PTE, the memory usage after forking is as follows:

                 COW PTE
          parent         child
VmRSS:	 1049968 kB       2576 kB
VmPTE:	    2096 kB         44 kB

The results show that this patch significantly decreases memory usage.
The other number of latencies are discussed later.

Real-world application benchmarks
=================================

We run benchmarks of fuzzing and VM cloning. The experiments were
done with the normal fork or the fork with COW PTE.

With AFL (LLVM mode) and SQLite, COW PTE (52.15 execs/sec) is a
little bit worse than the normal fork version (53.50 execs/sec).

                   fork
       execs_per_sec     unix_time        time
count    28.000000  2.800000e+01   28.000000
mean     53.496786  1.671270e+09   96.107143
std       3.625060  7.194717e+01   71.947172
min      35.350000  1.671270e+09    0.000000
25%      53.967500  1.671270e+09   33.750000
50%      54.235000  1.671270e+09   92.000000
75%      54.525000  1.671270e+09  149.250000
max      55.100000  1.671270e+09  275.000000

                 COW PTE
       execs_per_sec     unix_time        time
count    34.000000  3.400000e+01   34.000000
mean     52.150000  1.671268e+09  103.323529
std       3.218271  7.507682e+01   75.076817
min      34.250000  1.671268e+09    0.000000
25%      52.500000  1.671268e+09   42.250000
50%      52.750000  1.671268e+09   94.500000
75%      52.952500  1.671268e+09  150.750000
max      53.680000  1.671268e+09  285.000000


With TriforceAFL which is for kernel fuzzing with QEMU, COW PTE
(105.54 execs/sec) achieves a 1.05x throughput increase over the
normal fork version (102.30 execs/sec).

                   fork
     execs_per_sec     unix_time        time
count    38.000000  3.800000e+01   38.000000
mean    102.299737  1.671269e+09  156.289474
std      20.139268  8.717113e+01   87.171130
min       6.600000  1.671269e+09    0.000000
25%      95.657500  1.671269e+09   82.250000
50%     109.950000  1.671269e+09  176.500000
75%     113.972500  1.671269e+09  223.750000
max     118.790000  1.671269e+09  281.000000

                 COW PTE
     execs_per_sec     unix_time        time
count    42.000000  4.200000e+01   42.000000
mean    105.540714  1.671269e+09  163.476190
std      19.443517  8.858845e+01   88.588453
min       6.200000  1.671269e+09    0.000000
25%      96.585000  1.671269e+09  123.500000
50%     113.925000  1.671269e+09  180.500000
75%     116.940000  1.671269e+09  233.500000
max     121.090000  1.671269e+09  286.000000

Microbenchmark - syscall latency
================================

We run microbenchmarks to measure the latency of a fork syscall with
sizes of mapped memory ranging from 0 to 512 MB. The results show that
the latency of a normal fork reaches 10 ms. The latency of a fork with
COW PTE is also around 10 ms.

Microbenchmark - page fault latency
====================================

We conducted some microbenchmarks to measure page fault latency with
different patterns of accesses to a 512 MB memory buffer after forking.

In the first experiment, the program accesses the entire 512 MB memory
by writing to all the pages consecutively. The experiment is done with
normal fork, fork with COW PTE and calculates the single access average
latency. COW PTE page fault latency (0.000795 ms) and the normal fork
fault latency (0.000770 ms). Here are the raw numbers:

Page fault - Access to the entire 512 MB memory

fork mean: 0.000770 ms
fork median: 0.000769 ms
fork std: 0.000010 ms

COW PTE mean: 0.000795 ms
COW PTE median: 0.000795 ms
COW PTE std: 0.000009 ms

The second experiment simulates real-world applications with sparse
accesses. The program randomly accesses the memory by writing to one
random page 1 million times and calculates the average access time,
after that, we run both 100 times to get the averages. The result shows
that COW PTE (0.000029 ms) is similar to the normal fork (0.000026 ms).

Page fault - Random access

fork mean: 0.000026 ms
fork median: 0.000025 ms
fork std: 0.000002 ms

COW PTE mean: 0.000029 ms
COW PTE median: 0.000026 ms
COW PTE std: 0.000004 ms

All the tests were run with QEMU and the kernel was built with
the x86_64 default config (v3 patch set).

Summary
=======

In summary, COW PTE reduces the memory footprint of processes and
improves the performance for some use cases.

This patch is based on the paper "On-demand-fork: a microsecond fork
for memory-intensive and latency-sensitive applications" [1] from
Purdue University.

Any comments and suggestions are welcome.

Thanks,
Chih-En Lin

---

[1] https://dl.acm.org/doi/10.1145/3447786.3456258

This patch is based on v6.2-rc7.

---

Chih-En Lin (14):
  mm: Allow user to control COW PTE via prctl
  mm: Add Copy-On-Write PTE to fork()
  mm: Add break COW PTE fault and helper functions
  mm/rmap: Break COW PTE in rmap walking
  mm/khugepaged: Break COW PTE before scanning pte
  mm/ksm: Break COW PTE before modify shared PTE
  mm/madvise: Handle COW-ed PTE with madvise()
  mm/gup: Trigger break COW PTE before calling follow_pfn_pte()
  mm/mprotect: Break COW PTE before changing protection
  mm/userfaultfd: Support COW PTE
  mm/migrate_device: Support COW PTE
  fs/proc: Support COW PTE with clear_refs_write
  events/uprobes: Break COW PTE before replacing page
  mm: fork: Enable COW PTE to fork system call

 fs/proc/task_mmu.c                 |   5 +
 include/linux/mm.h                 |  37 ++
 include/linux/pgtable.h            |   6 +
 include/linux/rmap.h               |   2 +
 include/linux/sched/coredump.h     |  12 +-
 include/trace/events/huge_memory.h |   1 +
 include/uapi/linux/prctl.h         |   6 +
 kernel/events/uprobes.c            |   2 +-
 kernel/fork.c                      |   7 +
 kernel/sys.c                       |  11 +
 mm/Kconfig                         |   9 +
 mm/gup.c                           |   8 +-
 mm/khugepaged.c                    |  35 +-
 mm/ksm.c                           |   4 +-
 mm/madvise.c                       |  13 +
 mm/memory.c                        | 642 ++++++++++++++++++++++++++++-
 mm/migrate.c                       |   3 +-
 mm/migrate_device.c                |   2 +
 mm/mmap.c                          |   4 +
 mm/mprotect.c                      |   9 +
 mm/mremap.c                        |   2 +
 mm/page_vma_mapped.c               |   4 +
 mm/rmap.c                          |   9 +-
 mm/swapfile.c                      |   2 +
 mm/userfaultfd.c                   |   6 +
 mm/vmscan.c                        |   3 +-
 26 files changed, 826 insertions(+), 18 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 02/14] mm: Add Copy-On-Write PTE to fork() Chih-En Lin
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Add a new prctl, PR_SET_COW_PTE, to allow the user to enable COW PTE.
Since it has a time gap between using the prctl to enable the COW PTE
and doing the fork, we use two states (MMF_COW_PTE_READY and MMF_COW_PTE)
to determine the task that wants to do COW PTE or already doing it.

The MMF_COW_PTE_READY flag marks the task to do COW PTE in the next time
of fork(). During fork(), if MMF_COW_PTE_READY set, fork() will unset the
flag and set the MMF_COW_PTE flag. After that, fork() might shares PTEs
instead of duplicates it.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/sched/coredump.h | 12 +++++++++++-
 include/uapi/linux/prctl.h     |  6 ++++++
 kernel/sys.c                   | 11 +++++++++++
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 8270ad7ae14c..570d599ebc85 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -83,7 +83,17 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_PINNED		27	/* FOLL_PIN has run, never cleared */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
+/*
+ * MMF_COW_PTE_READY: Marking the task to do COW PTE in the next time of
+ * fork(). During fork(), if MMF_COW_PTE_READY set, fork() will unset the
+ * flag and set the MMF_COW_PTE flag. After that, fork() might shares PTEs
+ * rather than duplicates it.
+ */
+#define MMF_COW_PTE_READY	29 /* Share PTE tables in next time of fork() */
+#define MMF_COW_PTE		30 /* PTE tables are shared between processes */
+#define MMF_COW_PTE_MASK	(1 << MMF_COW_PTE)
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
-				 MMF_DISABLE_THP_MASK)
+				 MMF_DISABLE_THP_MASK | MMF_COW_PTE_MASK)
 
 #endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a5e06dcbba13..664a3c023019 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -284,4 +284,10 @@ struct prctl_mm_map {
 #define PR_SET_VMA		0x53564d41
 # define PR_SET_VMA_ANON_NAME		0
 
+/*
+ * Set the prepare flag, MMF_COW_PTE_READY, to do the share (copy-on-write)
+ * page table in the next time of fork.
+ */
+#define PR_SET_COW_PTE			65
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 88b31f096fb2..eeab3093026f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2350,6 +2350,14 @@ static int prctl_set_vma(unsigned long opt, unsigned long start,
 }
 #endif /* CONFIG_ANON_VMA_NAME */
 
+static int prctl_set_cow_pte(struct mm_struct *mm)
+{
+	if (test_bit(MMF_COW_PTE, &mm->flags))
+		return -EINVAL;
+	set_bit(MMF_COW_PTE_READY, &mm->flags);
+	return 0;
+}
+
 SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 		unsigned long, arg4, unsigned long, arg5)
 {
@@ -2628,6 +2636,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_SET_VMA:
 		error = prctl_set_vma(arg2, arg3, arg4, arg5);
 		break;
+	case PR_SET_COW_PTE:
+		error = prctl_set_cow_pte(me->mm);
+		break;
 	default:
 		error = -EINVAL;
 		break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 02/14] mm: Add Copy-On-Write PTE to fork()
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 03/14] mm: Add break COW PTE fault and helper functions Chih-En Lin
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Add copy_cow_pte_range() and recover_pte_range() for copy-on-write (COW)
PTE in fork system call. During COW PTE fork, when processing the shared
PTE, we traverse all the entries to determine current mapped page is
available to share between processes. If PTE can be shared, account
those mapped pages and then share the PTE. However, once we find out the
mapped page is unavailable, e.g., pinned page, we have to copy it via
copy_present_page(), which means that we will fall back to default path,
page table copying (copy_pte_range()). And, since we may have already
processed some COW-ed PTE entries, before starting the default path, we
have to recover those entries.

All the COW PTE behaviors are protected by the pte lock.
The logic of how we handle nonpresent/present pte entries and error
in copy_cow_pte_range() is same as copy_pte_range(). But to keep the
codes clean (e.g., avoiding condition lock), we introduce new functions
instead of modifying copy_pte_range().

To track the lifetime of COW-ed PTE, introduce the refcount of PTE.
We reuse the _refcount in struct page for the page table to maintain the
number of process references to COW-ed PTE table. Doing the fork with
COW PTE will increase the refcount. And, when someone writes to the
COW-ed PTE, it will cause the write fault to break COW PTE. If the
refcount of COW-ed PTE is one, the process that triggers the fault will
reuse the COW-ed PTE. Otherwise, the process will decrease the refcount
and duplicate it.

Since we share the PTE between the parent and child, the state of the
parent's pte entries is different between COW PTE and the normal fork.
COW PTE handles all the pte entries on the child side which means it
will clear the dirty and access bit of the parent's pte entry.

And, since some of the architectures, e.g., s390 and powerpc32, don't
support the PMD entry and PTE table operations, add a new Kconfig,
COW_PTE. COW_PTE config depends on the (HAVE_ARCH_TRANSPARENT_HUGEPAGE
&& !PREEMPT_RT) condition, it is same as the TRANSPARENT_HUGEPAGE
config since most of the operations in COW PTE are depend on it.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/mm.h |  20 +++
 mm/Kconfig         |   9 ++
 mm/memory.c        | 303 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 332 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f857163ac89..22e1e5804e96 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2482,6 +2482,23 @@ static inline bool ptlock_init(struct page *page) { return true; }
 static inline void ptlock_free(struct page *page) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
+#ifdef CONFIG_COW_PTE
+static inline int pmd_get_pte(pmd_t *pmd)
+{
+	return page_ref_inc_return(pmd_page(*pmd));
+}
+
+static inline bool pmd_put_pte(pmd_t *pmd)
+{
+	return page_ref_add_unless(pmd_page(*pmd), -1, 1);
+}
+
+static inline int cow_pte_count(pmd_t *pmd)
+{
+	return page_count(pmd_page(*pmd));
+}
+#endif
+
 static inline void pgtable_init(void)
 {
 	ptlock_cache_init();
@@ -2494,6 +2511,9 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
 		return false;
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
+#ifdef CONFIG_COW_PTE
+	set_page_count(page, 1);
+#endif
 	return true;
 }
 
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..7dcceeb4196b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -822,6 +822,15 @@ config READ_ONLY_THP_FOR_FS
 
 endif # TRANSPARENT_HUGEPAGE
 
+menuconfig COW_PTE
+	bool "Copy-on-write PTE table"
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
+	help
+	  Extend the copy-on-write (COW) mechanism to the PTE table
+	  (the bottom level of the page-table hierarchy). To enable this
+	  feature, a process must set prctl(PR_SET_COW_PTE) before the
+	  fork system call.
+
 #
 # UP and nommu archs use km based percpu allocator
 #
diff --git a/mm/memory.c b/mm/memory.c
index 3e836fecd035..7d2a1d24db56 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -739,11 +739,17 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma,
 		struct vm_area_struct *src_vma, unsigned long addr, int *rss)
 {
+	/* With COW PTE, dst_vma is src_vma. */
 	unsigned long vm_flags = dst_vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
 	swp_entry_t entry = pte_to_swp_entry(pte);
 
+	/*
+	 * If it's COW PTE, parent shares PTE with child. Which means the
+	 * following modifications of child will also affect parent.
+	 */
+
 	if (likely(!non_swap_entry(entry))) {
 		if (swap_duplicate(entry) < 0)
 			return -EIO;
@@ -886,6 +892,8 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 /*
  * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
  * is required to copy this pte.
+ * However, if prealloc is NULL, it is COW PTE. We should return and fall back
+ * to copy the PTE table.
  */
 static inline int
 copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
@@ -909,6 +917,14 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) {
 			/* Page maybe pinned, we have to copy. */
 			put_page(page);
+			/*
+			 * If prealloc is NULL, we are processing share page
+			 * table (COW PTE, in copy_cow_pte_range()). We cannot
+			 * call copy_present_page() right now, instead, we
+			 * should fall back to copy_pte_range().
+			 */
+			if (!prealloc)
+				return -EAGAIN;
 			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
 						 addr, rss, prealloc, page);
 		}
@@ -929,6 +945,11 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	}
 	VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page));
 
+	/*
+	 * If it's COW PTE, parent shares PTE with child.
+	 * Which means the following will also affect parent.
+	 */
+
 	/*
 	 * If it's a shared mapping, mark it clean in
 	 * the child
@@ -937,6 +958,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		pte = pte_mkclean(pte);
 	pte = pte_mkold(pte);
 
+	/* For COW PTE, dst_vma is still src_vma. */
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
@@ -963,6 +985,8 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
 	return new_page;
 }
 
+
+/* copy_pte_range() will immediately allocate new page table. */
 static int
 copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	       pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -1087,6 +1111,227 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	return ret;
 }
 
+#ifdef CONFIG_COW_PTE
+/*
+ * copy_cow_pte_range() will try to share the page table with child.
+ * The logic of non-present, present and error handling is same as
+ * copy_pte_range() but dst_vma and dst_pte are src_vma and src_pte.
+ *
+ * We cannot preserve soft-dirty information, because PTE will share
+ * between multiple processes.
+ */
+static int
+copy_cow_pte_range(struct vm_area_struct *dst_vma,
+		   struct vm_area_struct *src_vma,
+		   pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		   unsigned long end, unsigned long *recover_end)
+{
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
+	struct vma_iterator vmi;
+	struct vm_area_struct *curr = src_vma;
+	pte_t *src_pte, *orig_src_pte;
+	spinlock_t *src_ptl;
+	int ret = 0;
+	int rss[NR_MM_COUNTERS];
+	swp_entry_t entry = (swp_entry_t){0};
+	unsigned long vm_end, orig_addr = addr;
+	pgtable_t pte_table = pmd_pgtable(*src_pmd);
+
+	end = (addr + PMD_SIZE) & PMD_MASK;
+	addr = addr & PMD_MASK;
+
+	/*
+	 * Increase the refcount to prevent the parent's PTE
+	 * dropped/reused. Only increace the refcount at first
+	 * time attached.
+	 */
+	src_ptl = pte_lockptr(src_mm, src_pmd);
+	spin_lock(src_ptl);
+	pmd_get_pte(src_pmd);
+	pmd_install(dst_mm, dst_pmd, &pte_table);
+	spin_unlock(src_ptl);
+
+	/*
+	 * We should handle all of the entries in this PTE at this traversal,
+	 * since we cannot promise that the next vma will not do the lazy fork.
+	 * The lazy fork will skip the copying, which may cause the incomplete
+	 * state of COW-ed PTE.
+	 */
+	vma_iter_init(&vmi, src_mm, addr);
+	for_each_vma_range(vmi, curr, end) {
+		vm_end = min(end, curr->vm_end);
+		addr = max(addr, curr->vm_start);
+
+		/* We don't share the PTE with VM_DONTCOPY. */
+		if (curr->vm_flags & VM_DONTCOPY) {
+			*recover_end = addr;
+			return -EAGAIN;
+		}
+again:
+		init_rss_vec(rss);
+		src_pte = pte_offset_map(src_pmd, addr);
+		src_ptl = pte_lockptr(src_mm, src_pmd);
+		orig_src_pte = src_pte;
+		spin_lock(src_ptl);
+		arch_enter_lazy_mmu_mode();
+
+		do {
+			if (pte_none(*src_pte))
+				continue;
+			if (unlikely(!pte_present(*src_pte))) {
+				/*
+				 * Although, parent's PTE is COW-ed, we should
+				 * still need to handle all the swap stuffs.
+				 */
+				ret = copy_nonpresent_pte(dst_mm, src_mm,
+							  src_pte, src_pte,
+							  curr, curr,
+							  addr, rss);
+				if (ret == -EIO) {
+					entry = pte_to_swp_entry(*src_pte);
+					break;
+				} else if (ret == -EBUSY) {
+					break;
+				} else if (!ret)
+					continue;
+				/*
+				 * Device exclusive entry restored, continue by
+				 * copying the now present pte.
+				 */
+				WARN_ON_ONCE(ret != -ENOENT);
+			}
+			/*
+			 * copy_present_pte() will determine the mapped page
+			 * should be COW mapping or not.
+			 */
+			ret = copy_present_pte(curr, curr, src_pte, src_pte,
+					       addr, rss, NULL);
+			/*
+			 * If we need a pre-allocated page for this pte,
+			 * drop the lock, recover all the entries, fall
+			 * back to copy_pte_range(), and try again.
+			 */
+			if (unlikely(ret == -EAGAIN))
+				break;
+		} while (src_pte++, addr += PAGE_SIZE, addr != vm_end);
+
+		arch_leave_lazy_mmu_mode();
+		add_mm_rss_vec(dst_mm, rss);
+		spin_unlock(src_ptl);
+		pte_unmap(orig_src_pte);
+		cond_resched();
+
+		if (ret == -EIO) {
+			VM_WARN_ON_ONCE(!entry.val);
+			if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			entry.val = 0;
+		} else if (ret == -EBUSY) {
+			goto out;
+		} else if (ret == -EAGAIN) {
+			/*
+			 * We've to allocate the page immediately but first we
+			 * should recover the processed entries and fall back
+			 * to copy_pte_range().
+			 */
+			*recover_end = addr;
+			return -EAGAIN;
+		} else if (ret) {
+			VM_WARN_ON_ONCE(1);
+		}
+
+		/* We've captured and resolved the error. Reset, try again. */
+		ret = 0;
+		if (addr != vm_end)
+			goto again;
+	}
+
+out:
+	/*
+	 * All the pte entries are available to COW mapping.
+	 * Now, we can share with child (COW PTE).
+	 */
+	pmdp_set_wrprotect(src_mm, orig_addr, src_pmd);
+	set_pmd_at(dst_mm, orig_addr, dst_pmd, pmd_wrprotect(*src_pmd));
+
+	return ret;
+}
+
+/* When recovering the pte entries, we should hold the locks entirely. */
+static int
+recover_pte_range(struct vm_area_struct *dst_vma,
+		  struct vm_area_struct *src_vma,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long end)
+{
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
+	struct vma_iterator vmi;
+	struct vm_area_struct *curr = src_vma;
+	pte_t *orig_src_pte, *orig_dst_pte;
+	pte_t *src_pte, *dst_pte;
+	spinlock_t *src_ptl, *dst_ptl;
+	unsigned long vm_end, addr = end & PMD_MASK;
+	int ret = 0;
+
+	/* Before we allocate the new PTE, clear the entry. */
+	mm_dec_nr_ptes(dst_mm);
+	pmd_clear(dst_pmd);
+	if (pte_alloc(dst_mm, dst_pmd))
+		return -ENOMEM;
+
+	/*
+	 * Traverse all the vmas that cover this PTE table until
+	 * the end of recover address (unshareable page).
+	 */
+	vma_iter_init(&vmi, src_mm, addr);
+	for_each_vma_range(vmi, curr, end) {
+		vm_end = min(end, curr->vm_end);
+		addr = max(addr, curr->vm_start);
+
+		orig_dst_pte = dst_pte = pte_offset_map(dst_pmd, addr);
+		dst_ptl = pte_lockptr(dst_mm, dst_pmd);
+		spin_lock(dst_ptl);
+
+		orig_src_pte = src_pte = pte_offset_map(src_pmd, addr);
+		src_ptl = pte_lockptr(src_mm, src_pmd);
+		spin_lock(src_ptl);
+		arch_enter_lazy_mmu_mode();
+
+		do {
+			if (pte_none(*src_pte))
+				continue;
+			/*
+			 * COW mapping stuffs (e.g., PageAnonExclusive)
+			 * should already handled by copy_cow_pte_range().
+			 * We can simply set the entry to the child.
+			 */
+			set_pte_at(dst_mm, addr, dst_pte, *src_pte);
+		} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+
+		arch_leave_lazy_mmu_mode();
+		spin_unlock(src_ptl);
+		pte_unmap(orig_src_pte);
+
+		spin_unlock(dst_ptl);
+		pte_unmap(orig_dst_pte);
+	}
+	/*
+	 * After recovering the entries, release the holding from child.
+	 * Parent may still share with others, so don't make it writeable.
+	 */
+	spin_lock(src_ptl);
+	pmd_put_pte(src_pmd);
+	spin_unlock(src_ptl);
+
+	cond_resched();
+
+	return ret;
+}
+#endif /* CONFIG_COW_PTE */
+
 static inline int
 copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	       pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
@@ -1115,6 +1360,64 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 				continue;
 			/* fall through */
 		}
+
+#ifdef CONFIG_COW_PTE
+		/*
+		 * If MMF_COW_PTE set, copy_pte_range() will try to share
+		 * the PTE page table first. In other words, it attempts to
+		 * do COW on PTE (and mapped pages). However, if there has
+		 * any unshareable page (e.g., pinned page, device private
+		 * page), it will fall back to the default path, which will
+		 * copy the page table immediately.
+		 * In such a case, it stores the address of first unshareable
+		 * page to recover_end then goes back to the beginning of PTE
+		 * and recovers the COW-ed PTE entries until it meets the same
+		 * unshareable page again. During the recovering, because of
+		 * COW-ed PTE entries are logical same as COW mapping, so it
+		 * only needs to allocate the new PTE and sets COW-ed PTE
+		 * entries to new PTE (which will be same as COW mapping).
+		 */
+		if (test_bit(MMF_COW_PTE, &src_mm->flags)) {
+			unsigned long recover_end = 0;
+			int ret;
+
+			/*
+			 * Setting wrprotect with normal PTE to pmd entry
+			 * will trigger pmd_bad(). Skip bad checking here.
+			 */
+			if (pmd_none(*src_pmd))
+				continue;
+			/* Skip if the PTE already did COW PTE this time. */
+			if (!pmd_none(*dst_pmd) && !pmd_write(*dst_pmd))
+				continue;
+
+			ret = copy_cow_pte_range(dst_vma, src_vma,
+						 dst_pmd, src_pmd,
+						 addr, next, &recover_end);
+			if (!ret) {
+				/* COW PTE succeeded. */
+				continue;
+			} else if (ret == -EAGAIN) {
+				/* fall back to normal copy method. */
+				if (recover_pte_range(dst_vma, src_vma,
+						      dst_pmd, src_pmd,
+						      recover_end))
+					return -ENOMEM;
+				/*
+				 * Since we processed all the entries of PTE
+				 * table, recover_end may not in the src_vma.
+				 * If we already handled the src_vma, skip it.
+				 */
+				if (!range_in_vma(src_vma, recover_end,
+						  recover_end + PAGE_SIZE))
+					continue;
+				else
+					addr = recover_end;
+				/* fall through */
+			} else if (ret)
+				return -ENOMEM;
+		}
+#endif /* CONFIG_COW_PTE */
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
 		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 03/14] mm: Add break COW PTE fault and helper functions
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 02/14] mm: Add Copy-On-Write PTE to fork() Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 04/14] mm/rmap: Break COW PTE in rmap walking Chih-En Lin
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Add the function, handle_cow_pte_fault(), to break (unshare) COW-ed PTE
with the page fault that will modify the PTE table or the mapped page
resided in COW-ed PTE (i.e., write, unshared, file read fault).

When breaking COW PTE, it first checks COW-ed PTE's refcount to try to
reuse it. If COW-ed PTE cannot be reused, allocates new PTE and
duplicates all pte entries in COW-ed PTE. Moreover, Flush TLB when we
change the write protection of PTE.

In addition, provide the helper functions, break_cow_pte{,_range}(), to
let the other features (remap, THP, migration, swapfile, etc) to use.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/mm.h      |  17 ++
 include/linux/pgtable.h |   6 +
 mm/memory.c             | 339 +++++++++++++++++++++++++++++++++++++++-
 mm/mmap.c               |   4 +
 mm/mremap.c             |   2 +
 mm/swapfile.c           |   2 +
 6 files changed, 363 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 22e1e5804e96..369355e13936 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2020,6 +2020,23 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to);
 void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
 
+#ifdef CONFIG_COW_PTE
+int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr);
+int break_cow_pte_range(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end);
+#else
+static inline int break_cow_pte(struct vm_area_struct *vma,
+				pmd_t *pmd, unsigned long addr)
+{
+	return 0;
+}
+static inline int break_cow_pte_range(struct vm_area_struct *vma,
+				      unsigned long start, unsigned long end)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
 				  unsigned long address, unsigned int flags,
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1159b25b0542..72ff2a1cee5e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1406,6 +1406,12 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 	if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
 		(IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
 		return 1;
+	/*
+	 * COW-ed PTE has write protection which can trigger pmd_bad().
+	 * To avoid this, return here if entry is write protection.
+	 */
+	if (!pmd_write(pmdval))
+		return 0;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
 		return 1;
diff --git a/mm/memory.c b/mm/memory.c
index 7d2a1d24db56..465742c6efa2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -192,6 +192,36 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
+#ifdef CONFIG_COW_PTE
+		/*
+		 * For COW-ed PTE, the pte entries still mapping to pages.
+		 * However, we should did de-accounting to all of it. So,
+		 * even if the refcount is not the same as zapping, we
+		 * could still fall back to normal PTE and handle it
+		 * without traversing entries to do the de-accounting.
+		 */
+		if (test_bit(MMF_COW_PTE, &tlb->mm->flags)) {
+			if (!pmd_none(*pmd) && !pmd_write(*pmd)) {
+				spinlock_t *ptl = pte_lockptr(tlb->mm, pmd);
+
+				spin_lock(ptl);
+				if (!pmd_put_pte(pmd)) {
+					pmd_t new = pmd_mkwrite(*pmd);
+
+					set_pmd_at(tlb->mm, addr, pmd, new);
+					spin_unlock(ptl);
+					free_pte_range(tlb, pmd, addr);
+					continue;
+				}
+				spin_unlock(ptl);
+
+				pmd_clear(pmd);
+				mm_dec_nr_ptes(tlb->mm);
+				tlb_flush_pmd_range(tlb, addr, PAGE_SIZE);
+			} else
+				VM_WARN_ON(cow_pte_count(pmd) != 1);
+		}
+#endif
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
 		free_pte_range(tlb, pmd, addr);
@@ -1654,6 +1684,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	bool pte_is_shared = false;
+
+#ifdef CONFIG_COW_PTE
+	if (test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)) {
+		if (!range_in_vma(vma, addr & PMD_MASK,
+				  (addr + PMD_SIZE) & PMD_MASK)) {
+			/*
+			 * We cannot promise this COW-ed PTE will also be zap
+			 * with the rest of VMAs. So, break COW PTE here.
+			 */
+			break_cow_pte(vma, pmd, addr);
+		} else {
+			start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			if (cow_pte_count(pmd) == 1) {
+				/* Reuse COW-ed PTE */
+				pmd_t new = pmd_mkwrite(*pmd);
+				set_pmd_at(tlb->mm, addr, pmd, new);
+			} else
+				pte_is_shared = true;
+			pte_unmap_unlock(start_pte, ptl);
+		}
+	}
+#endif
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1678,11 +1731,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
+			if (pte_is_shared)
+				ptent = *pte;
+			else
+				ptent = ptep_get_and_clear_full(mm, addr, pte,
+								tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
-			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
-						      ptent);
+			if (!pte_is_shared)
+				zap_install_uffd_wp_if_needed(vma, addr, pte,
+							      details, ptent);
 			if (unlikely(!page))
 				continue;
 
@@ -1754,8 +1811,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			/* We should have covered all the swap entry types */
 			WARN_ON_ONCE(1);
 		}
-		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
-		zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent);
+
+		if (!pte_is_shared) {
+			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			zap_install_uffd_wp_if_needed(vma, addr, pte,
+						      details, ptent);
+		}
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -2143,6 +2204,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 	if (retval)
 		goto out;
 	retval = -ENOMEM;
+	if (break_cow_pte(vma, NULL, addr))
+		goto out;
 	pte = get_locked_pte(vma->vm_mm, addr, &ptl);
 	if (!pte)
 		goto out;
@@ -2402,6 +2465,9 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	pte_t *pte, entry;
 	spinlock_t *ptl;
 
+	if (break_cow_pte(vma, NULL, addr))
+		return VM_FAULT_OOM;
+
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -2779,6 +2845,10 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
 	BUG_ON(addr >= end);
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
+
+	if (break_cow_pte_range(vma, addr, end))
+		return -ENOMEM;
+
 	flush_cache_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
@@ -5159,6 +5229,233 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 	return VM_FAULT_FALLBACK;
 }
 
+#ifdef CONFIG_COW_PTE
+/* Break (unshare) COW PTE */
+static vm_fault_t handle_cow_pte_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t *pmd = vmf->pmd;
+	unsigned long start, end, addr = vmf->address;
+	struct mmu_notifier_range range;
+	pmd_t cowed_entry;
+	pte_t *orig_dst_pte, *orig_src_pte;
+	pte_t *dst_pte, *src_pte;
+	spinlock_t *dst_ptl, *src_ptl;
+	int ret = 0;
+
+	/*
+	 * Do nothing with the fault that doesn't have PTE yet
+	 * (from lazy fork).
+	 */
+	if (pmd_none(*pmd) || pmd_write(*pmd))
+		return 0;
+	/* COW PTE doesn't handle huge page. */
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+		return 0;
+
+	mmap_assert_write_locked(mm);
+
+	start = addr & PMD_MASK;
+	end = (addr + PMD_SIZE) & PMD_MASK;
+	addr = start;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
+				0, vma, mm, start, end);
+	/*
+	 * Because of the address range is PTE not only for the faulted
+	 * vma, it might have some unmatch situations since mmu notifier
+	 * will only reigster the faulted vma.
+	 * Do we really need to care about this kind of unmatch?
+	 */
+	mmu_notifier_invalidate_range_start(&range);
+	raw_write_seqcount_begin(&mm->write_protect_seq);
+
+	/*
+	 * Fast path, check if we are the only one faulted task
+	 * references to this COW-ed PTE, reuse it.
+	 */
+	src_pte = pte_offset_map_lock(mm, pmd, addr, &src_ptl);
+	if (cow_pte_count(pmd) == 1) {
+		pmd_t new = pmd_mkwrite(*pmd);
+		set_pmd_at(mm, addr, pmd, new);
+		pte_unmap_unlock(src_pte, src_ptl);
+		goto flush_tlb;
+	}
+	/* We don't hold the lock when allocating the new PTE. */
+	pte_unmap_unlock(src_pte, src_ptl);
+
+	/*
+	 * Slow path. Since we already did the accounting and still
+	 * sharing the mapped pages, we can just clone PTE.
+	 */
+
+	cowed_entry = READ_ONCE(*pmd);
+	/* Decrease the pgtable_bytes of COW-ed PTE. */
+	mm_dec_nr_ptes(mm);
+	pmd_clear(pmd);
+	orig_dst_pte = dst_pte = pte_alloc_map_lock(mm, pmd, addr, &dst_ptl);
+	if (unlikely(!dst_pte)) {
+		/* If allocation failed, restore COW-ed PTE. */
+		set_pmd_at(mm, addr, pmd, cowed_entry);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * We should hold the lock of COW-ed PTE until all the operations
+	 * have been done, including duplicating, and decrease refcount.
+	 */
+	src_pte = pte_offset_map_lock(mm, &cowed_entry, addr, &src_ptl);
+	orig_src_pte = src_pte;
+	arch_enter_lazy_mmu_mode();
+
+	/*
+	 * All the mapped pages in COW-ed PTE are COW mapping. We can
+	 * set the entries and leave other stuff to handle_pte_fault().
+	 */
+	do {
+		if (pte_none(*src_pte))
+			continue;
+		set_pte_at(mm, addr, dst_pte, *src_pte);
+	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+
+	arch_leave_lazy_mmu_mode();
+	pte_unmap_unlock(orig_dst_pte, dst_ptl);
+
+	/* Decrease the refcount of COW-ed PTE. */
+	if (!pmd_put_pte(&cowed_entry)) {
+		/*
+		 * COW-ed (old) PTE's refcount is 1. Now we have two PTEs
+		 * with the same content. Free the new one and reuse the
+		 * old one.
+		 */
+		pgtable_t token = pmd_pgtable(*pmd);
+		/* Reuse COW-ed PTE. */
+		pmd_t new = pmd_mkwrite(cowed_entry);
+
+		/* Clear all the entries of new PTE. */
+		addr = start;
+		dst_pte = pte_offset_map_lock(mm, pmd, addr, &dst_ptl);
+		orig_dst_pte = dst_pte;
+		do {
+			if (pte_none(*dst_pte))
+				continue;
+			if (pte_present(*dst_pte))
+				page_table_check_pte_clear(mm, addr, *dst_pte);
+			pte_clear(mm, addr, dst_pte);
+		} while (dst_pte++, addr += PAGE_SIZE, addr != end);
+		pte_unmap_unlock(orig_dst_pte, dst_ptl);
+		/* Now, we can safely free new PTE. */
+		pmd_clear(pmd);
+		pte_free(mm, token);
+		/* Reuse COW-ed PTE */
+		set_pmd_at(mm, start, pmd, new);
+	}
+
+	pte_unmap_unlock(orig_src_pte, src_ptl);
+
+flush_tlb:
+	/*
+	 * If we change the protection, flush TLB.
+	 * flush_tlb_range() will only use vma to get mm, we don't need
+	 * to consider the unmatch address range with vma problem here.
+	 *
+	 * Should we flush TLB when holding the pte lock?
+	 */
+	flush_tlb_range(vma, start, end);
+out:
+	raw_write_seqcount_end(&mm->write_protect_seq);
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+static inline int __break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+				  unsigned long addr)
+{
+	struct vm_fault vmf = {
+		.vma = vma,
+		.address = addr & PAGE_MASK,
+		.pmd = pmd,
+	};
+
+	return handle_cow_pte_fault(&vmf);
+}
+
+/**
+ * break_cow_pte - duplicate/reuse shared, wprotected (COW-ed) PTE
+ * @vma: target vma want to break COW
+ * @pmd: pmd index that maps to the shared PTE
+ * @addr: the address trigger break COW PTE
+ *
+ * Return: zero on success, < 0 otherwise.
+ *
+ * The address needs to be in the range of shared and write portected
+ * PTE that the pmd index mapped. If pmd is NULL, it will get the pmd
+ * from vma. Duplicate COW-ed PTE when some still mapping to it.
+ * Otherwise, reuse COW-ed PTE.
+ */
+int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr)
+{
+	struct mm_struct *mm;
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+
+	if (!vma)
+		return -EINVAL;
+	mm = vma->vm_mm;
+
+	if (!test_bit(MMF_COW_PTE, &mm->flags))
+		return 0;
+
+	if (!pmd) {
+		pgd = pgd_offset(mm, addr);
+		if (pgd_none_or_clear_bad(pgd))
+			return 0;
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none_or_clear_bad(p4d))
+			return 0;
+		pud = pud_offset(p4d, addr);
+		if (pud_none_or_clear_bad(pud))
+			return 0;
+		pmd = pmd_offset(pud, addr);
+	}
+
+	/* We will check the type of pmd entry later. */
+
+	return __break_cow_pte(vma, pmd, addr);
+}
+
+/**
+ * break_cow_pte_range - duplicate/reuse COW-ed PTE in a given range
+ * @vma: target vma want to break COW
+ * @start: the address of start breaking
+ * @end: the address of end breaking
+ *
+ * Return: zero on success, the number of failed otherwise.
+ */
+int break_cow_pte_range(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end)
+{
+	unsigned long addr, next;
+	int nr_failed = 0;
+
+	if (!range_in_vma(vma, start, end))
+		return -EINVAL;
+
+	addr = start;
+	do {
+		next = pmd_addr_end(addr, end);
+		if (break_cow_pte(vma, NULL, addr))
+			nr_failed++;
+	} while (addr = next, addr != end);
+
+	return nr_failed;
+}
+#endif /* CONFIG_COW_PTE */
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -5234,8 +5531,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			return do_fault(vmf);
 	}
 
-	if (!pte_present(vmf->orig_pte))
+	if (!pte_present(vmf->orig_pte)) {
+#ifdef CONFIG_COW_PTE
+		if (test_bit(MMF_COW_PTE, &vmf->vma->vm_mm->flags))
+			handle_cow_pte_fault(vmf);
+#endif
 		return do_swap_page(vmf);
+	}
 
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
@@ -5371,8 +5673,31 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				return 0;
 			}
 		}
+#ifdef CONFIG_COW_PTE
+		/*
+		 * Duplicate COW-ed PTE when page fault will change the
+		 * mapped pages (write or unshared fault) or COW-ed PTE
+		 * (file mapped read fault, see do_read_fault()).
+		 */
+		if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) ||
+		      vma->vm_ops) && test_bit(MMF_COW_PTE, &mm->flags)) {
+			ret = handle_cow_pte_fault(&vmf);
+			if (unlikely(ret == -ENOMEM))
+				return VM_FAULT_OOM;
+		}
+#endif
 	}
 
+#ifdef CONFIG_COW_PTE
+	/*
+	 * It's definitely will break the kernel when refcount of PTE
+	 * is higher than 1 and it is writeable in PMD entry. But we
+	 * want to see more information so just warning here.
+	 */
+	if (likely(!pmd_none(*vmf.pmd)))
+		VM_WARN_ON(cow_pte_count(vmf.pmd) > 1 && pmd_write(*vmf.pmd));
+#endif
+
 	return handle_pte_fault(&vmf);
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index 425a9349e610..ca16d7abcdb6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2208,6 +2208,10 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 			return err;
 	}
 
+	err = break_cow_pte(vma, NULL, addr);
+	if (err)
+		return err;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
diff --git a/mm/mremap.c b/mm/mremap.c
index 930f65c315c0..3fbc45e381cc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -534,6 +534,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
+		/* TLB flush twice time here? */
+		break_cow_pte(vma, old_pmd, old_addr);
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4fa440e87cd6..92e39a722100 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1911,6 +1911,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
+		if (break_cow_pte(vma, pmd, addr))
+			return -ENOMEM;
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
 		if (ret)
 			return ret;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 04/14] mm/rmap: Break COW PTE in rmap walking
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (2 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 03/14] mm: Add break COW PTE fault and helper functions Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 05/14] mm/khugepaged: Break COW PTE before scanning pte Chih-En Lin
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Some of the features (unmap, migrate, device exclusive, mkclean, etc)
might modify the pte entry via rmap. Add a new page vma mapped walk
flag, PVMW_BREAK_COW_PTE, to indicate the rmap walking to break COW PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/rmap.h | 2 ++
 mm/migrate.c         | 3 ++-
 mm/page_vma_mapped.c | 4 ++++
 mm/rmap.c            | 9 +++++----
 mm/vmscan.c          | 3 ++-
 5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..d0f07e551973 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -368,6 +368,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 #define PVMW_SYNC		(1 << 0)
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
+/* Break COW-ed PTE during walking */
+#define PVMW_BREAK_COW_PTE	(1 << 2)
 
 struct page_vma_mapped_walk {
 	unsigned long pfn;
diff --git a/mm/migrate.c b/mm/migrate.c
index a4d3fc65085f..04376ce05aa8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -183,7 +183,8 @@ void putback_movable_pages(struct list_head *l)
 static bool remove_migration_pte(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr, void *old)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION);
+	DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr,
+			      PVMW_SYNC | PVMW_MIGRATION | PVMW_BREAK_COW_PTE);
 
 	while (page_vma_mapped_walk(&pvmw)) {
 		rmap_t rmap_flags = RMAP_NONE;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..7b35e85b9964 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -251,6 +251,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
+		if (pvmw->flags & PVMW_BREAK_COW_PTE) {
+			if (break_cow_pte(vma, pvmw->pmd, pvmw->address))
+				return not_found(pvmw);
+		}
 		if (!map_pte(pvmw))
 			goto next_pte;
 this_pte:
diff --git a/mm/rmap.c b/mm/rmap.c
index b616870a09be..bce97496b1f6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1012,7 +1012,8 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
 static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma,
 			     unsigned long address, void *arg)
 {
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
+			      PVMW_SYNC | PVMW_BREAK_COW_PTE);
 	int *cleaned = arg;
 
 	*cleaned += page_vma_mkclean_one(&pvmw);
@@ -1463,7 +1464,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE);
 	pte_t pteval;
 	struct page *subpage;
 	bool anon_exclusive, ret = true;
@@ -1834,7 +1835,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE);
 	pte_t pteval;
 	struct page *subpage;
 	bool anon_exclusive, ret = true;
@@ -2187,7 +2188,7 @@ static bool page_make_device_exclusive_one(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long address, void *priv)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE);
 	struct make_exclusive_args *args = priv;
 	pte_t pteval;
 	struct page *subpage;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf3eedf0209c..15eda32146fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1882,7 +1882,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 		/*
 		 * The folio is mapped into the page tables of one or more
-		 * processes. Try to unmap it here.
+		 * processes. Try to unmap it here. Also, since it will write
+		 * to the page tables, break COW PTE if they are.
 		 */
 		if (folio_mapped(folio)) {
 			enum ttu_flags flags = TTU_BATCH_FLUSH;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 05/14] mm/khugepaged: Break COW PTE before scanning pte
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (3 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 04/14] mm/rmap: Break COW PTE in rmap walking Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 06/14] mm/ksm: Break COW PTE before modify shared PTE Chih-En Lin
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

We should not allow THP to collapse COW-ed PTE. So, break COW PTE
before collapse_pte_mapped_thp() collapse to THP. Also, break COW
PTE before khugepaged_scan_pmd() scan PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/trace/events/huge_memory.h |  1 +
 mm/khugepaged.c                    | 35 +++++++++++++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 3e6fb05852f9..5f2c39f61521 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -13,6 +13,7 @@
 	EM( SCAN_PMD_NULL,		"pmd_null")			\
 	EM( SCAN_PMD_NONE,		"pmd_none")			\
 	EM( SCAN_PMD_MAPPED,		"page_pmd_mapped")		\
+	EM( SCAN_COW_PTE,		"cowed_pte")			\
 	EM( SCAN_EXCEED_NONE_PTE,	"exceed_none_pte")		\
 	EM( SCAN_EXCEED_SWAP_PTE,	"exceed_swap_pte")		\
 	EM( SCAN_EXCEED_SHARED_PTE,	"exceed_shared_pte")		\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 90acfea40c13..1cddc20318d5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -31,6 +31,7 @@ enum scan_result {
 	SCAN_PMD_NULL,
 	SCAN_PMD_NONE,
 	SCAN_PMD_MAPPED,
+	SCAN_COW_PTE,
 	SCAN_EXCEED_NONE_PTE,
 	SCAN_EXCEED_SWAP_PTE,
 	SCAN_EXCEED_SHARED_PTE,
@@ -875,7 +876,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 		return SCAN_PMD_MAPPED;
 	if (pmd_devmap(pmde))
 		return SCAN_PMD_NULL;
-	if (pmd_bad(pmde))
+	if (pmd_write(pmde) && pmd_bad(pmde))
 		return SCAN_PMD_NULL;
 	return SCAN_SUCCEED;
 }
@@ -926,6 +927,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 			pte_unmap(vmf.pte);
 			continue;
 		}
+		if (break_cow_pte(vma, pmd, address))
+			return SCAN_COW_PTE;
 		ret = do_swap_page(&vmf);
 
 		/*
@@ -1038,6 +1041,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
+	/* We should already handled COW-ed PTE. */
+	VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd));
+
 	anon_vma_lock_write(vma->anon_vma);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm,
@@ -1148,6 +1154,13 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	/* Break COW PTE before we collapse the pages. */
+	if (break_cow_pte(vma, pmd, address)) {
+		result = SCAN_COW_PTE;
+		goto out;
+	}
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -1206,6 +1219,10 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 			goto out_unmap;
 		}
 
+		/*
+		 * If we only trigger the break COW PTE, the page usually
+		 * still in COW mapping, which it still be shared.
+		 */
 		if (page_mapcount(page) > 1) {
 			++shared;
 			if (cc->is_khugepaged &&
@@ -1501,6 +1518,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	/* We shouldn't let COW-ed PTE collapse. */
+	if (break_cow_pte(vma, pmd, haddr))
+		goto drop_hpage;
+	VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd));
+
 	/*
 	 * We need to lock the mapping so that from here on, only GUP-fast and
 	 * hardware page walks can access the parts of the page tables that
@@ -1706,6 +1728,11 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
 				result = SCAN_PTE_UFFD_WP;
 				goto unlock_next;
 			}
+			if (test_bit(MMF_COW_PTE, &mm->flags) &&
+			     !pmd_write(*pmd)) {
+				result = SCAN_COW_PTE;
+				goto unlock_next;
+			}
 			collapse_and_free_pmd(mm, vma, addr, pmd);
 			if (!cc->is_khugepaged && is_target)
 				result = set_huge_pmd(vma, addr, pmd, hpage);
@@ -2143,6 +2170,11 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	swap = 0;
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+	if (break_cow_pte(find_vma(mm, addr), NULL, addr)) {
+		result = SCAN_COW_PTE;
+		goto out;
+	}
+
 	rcu_read_lock();
 	xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
 		if (xas_retry(&xas, page))
@@ -2213,6 +2245,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 	}
 	rcu_read_unlock();
 
+out:
 	if (result == SCAN_SUCCEED) {
 		if (cc->is_khugepaged &&
 		    present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 06/14] mm/ksm: Break COW PTE before modify shared PTE
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (4 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 05/14] mm/khugepaged: Break COW PTE before scanning pte Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 07/14] mm/madvise: Handle COW-ed PTE with madvise() Chih-En Lin
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Break COW PTE before merge the page that reside in COW-ed PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/ksm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index dd02780c387f..ce3887d3b04c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1045,7 +1045,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 			      pte_t *orig_pte)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, 0);
+	DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, PVMW_BREAK_COW_PTE);
 	int swapped;
 	int err = -EFAULT;
 	struct mmu_notifier_range range;
@@ -1163,6 +1163,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	barrier();
 	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
 		goto out;
+	if (break_cow_pte(vma, pmd, addr))
+		goto out;
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
 				addr + PAGE_SIZE);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 07/14] mm/madvise: Handle COW-ed PTE with madvise()
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (5 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 06/14] mm/ksm: Break COW PTE before modify shared PTE Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 08/14] mm/gup: Trigger break COW PTE before calling follow_pfn_pte() Chih-En Lin
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Break COW PTE if madvise() modify the pte entry of COW-ed PTE.
Following are the list of flags which need to break COW PTE. However,
like MADV_HUGEPAGE and MADV_MERGEABLE, we should handle it respectively.

- MADV_DONTNEED: It calls to zap_page_range() which already be handled.
- MADV_FREE: It uses walk_page_range() with madvise_free_pte_range() to
	     free the page by itself, so add break_cow_pte().
- MADV_REMOVE: Same as MADV_FREE, it remove the page by itself, so add
	       break_cow_pte_range().
- MADV_COLD: Similar to MAD_FREE, break COW PTE before pageout.
- MADV_POPULATE: Let GUP deal with it.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/madvise.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index b6ea204d4e23..8b815942f286 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -428,6 +428,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
+	if (break_cow_pte(vma, pmd, addr))
+		return 0;
+
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	flush_tlb_batched_pending(mm);
@@ -629,6 +632,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
+	/* We should only allocate PTE. */
+	if (break_cow_pte(vma, pmd, addr))
+		goto next;
+
 	tlb_change_page_size(tlb, PAGE_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	flush_tlb_batched_pending(mm);
@@ -989,6 +996,12 @@ static long madvise_remove(struct vm_area_struct *vma,
 	if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
 		return -EACCES;
 
+	error = break_cow_pte_range(vma, start, end);
+	if (error < 0)
+		return error;
+	else if (error > 0)
+		return -ENOMEM;
+
 	offset = (loff_t)(start - vma->vm_start)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 08/14] mm/gup: Trigger break COW PTE before calling follow_pfn_pte()
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (6 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 07/14] mm/madvise: Handle COW-ed PTE with madvise() Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 09/14] mm/mprotect: Break COW PTE before changing protection Chih-En Lin
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

In most of cases, GUP will not modify the page table, excluding
follow_pfn_pte(). To deal with COW PTE, Trigger the break COW
PTE fault before calling follow_pfn_pte().

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/gup.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..e702c0800105 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -545,7 +545,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
 			 (FOLL_PIN | FOLL_GET)))
 		return ERR_PTR(-EINVAL);
-	if (unlikely(pmd_bad(*pmd)))
+	/* COW-ed PTE has write protection which can trigger pmd_bad(). */
+	if (unlikely(pmd_write(*pmd) && pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
 
 	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -588,6 +589,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		if (is_zero_pfn(pte_pfn(pte))) {
 			page = pte_page(pte);
 		} else {
+			if (test_bit(MMF_COW_PTE, &mm->flags) &&
+			    !pmd_write(*pmd)) {
+				page = ERR_PTR(-EMLINK);
+				goto out;
+			}
 			ret = follow_pfn_pte(vma, address, ptep, flags);
 			page = ERR_PTR(ret);
 			goto out;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 09/14] mm/mprotect: Break COW PTE before changing protection
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (7 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 08/14] mm/gup: Trigger break COW PTE before calling follow_pfn_pte() Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 10/14] mm/userfaultfd: Support COW PTE Chih-En Lin
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

If the PTE table is COW-ed, break it before changing the protection.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/mprotect.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 61cf60015a8b..8b18cd0e5c5e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -103,6 +103,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
+	if (break_cow_pte(vma, pmd, addr))
+		return 0;
+
 	/*
 	 * The pmd points to a regular pte so the pmd can't change
 	 * from under us even if the mmap_lock is only hold for
@@ -314,6 +317,12 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 		return 1;
 	if (pmd_trans_huge(pmdval))
 		return 0;
+	/*
+	 * If the entry point to COW-ed PTE, it's write protection bit
+	 * will cause pmd_bad().
+	 */
+	if (!pmd_write(pmdval))
+		return 0;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
 		return 1;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 10/14] mm/userfaultfd: Support COW PTE
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (8 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 09/14] mm/mprotect: Break COW PTE before changing protection Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 11/14] mm/migrate_device: " Chih-En Lin
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

If uffd fills the zeropage or installs to COW-ed PTE, break it first.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/userfaultfd.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0499907b6f1a..3f66aa3eb54f 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,6 +70,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	struct inode *inode;
 	pgoff_t offset, max_off;
 
+	if (break_cow_pte(dst_vma, dst_pmd, dst_addr))
+		return -ENOMEM;
+
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
 	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
@@ -229,6 +232,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	pgoff_t offset, max_off;
 	struct inode *inode;
 
+	if (break_cow_pte(dst_vma, dst_pmd, dst_addr))
+		return -ENOMEM;
+
 	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
 					 dst_vma->vm_page_prot));
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 11/14] mm/migrate_device: Support COW PTE
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (9 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 10/14] mm/userfaultfd: Support COW PTE Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 12/14] fs/proc: Support COW PTE with clear_refs_write Chih-En Lin
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Break COW PTE before collecting the pages in COW-ed PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/migrate_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 721b2365dbca..2930e591e8fc 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -106,6 +106,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		}
 	}
 
+	if (break_cow_pte_range(vma, start, end))
+		return migrate_vma_collect_skip(start, end, walk);
 	if (unlikely(pmd_bad(*pmdp)))
 		return migrate_vma_collect_skip(start, end, walk);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 12/14] fs/proc: Support COW PTE with clear_refs_write
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (10 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 11/14] mm/migrate_device: " Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 13/14] events/uprobes: Break COW PTE before replacing page Chih-En Lin
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Before clearing the entry in COW-ed PTE, break COW PTE first.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 fs/proc/task_mmu.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index af1c49ae11b1..94958422aede 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1196,6 +1196,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
+	/* Only break COW when we modify the soft-dirty bit. */
+	if (cp->type == CLEAR_REFS_SOFT_DIRTY &&
+	    break_cow_pte(vma, pmd, addr))
+		return 0;
+
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 13/14] events/uprobes: Break COW PTE before replacing page
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (11 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 12/14] fs/proc: Support COW PTE with clear_refs_write Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-07  3:51 ` [PATCH v4 14/14] mm: fork: Enable COW PTE to fork system call Chih-En Lin
  2023-02-09 18:15 ` [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Pasha Tatashin
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

Break COW PTE if we want to replace the page which
resides in COW-ed PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 kernel/events/uprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index d9e357b7e17c..2956a53da01a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -157,7 +157,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	struct folio *old_folio = page_folio(old_page);
 	struct folio *new_folio;
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, PVMW_BREAK_COW_PTE);
 	int err;
 	struct mmu_notifier_range range;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v4 14/14] mm: fork: Enable COW PTE to fork system call
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (12 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 13/14] events/uprobes: Break COW PTE before replacing page Chih-En Lin
@ 2023-02-07  3:51 ` Chih-En Lin
  2023-02-09 18:15 ` [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Pasha Tatashin
  14 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-07  3:51 UTC (permalink / raw)
  To: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song
  Cc: Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Pasha Tatashin, Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng, Chih-En Lin

This patch enables the Copy-On-Write (COW) mechanism to the PTE table
in fork system call. To let the process do COW PTE fork, use
prctl(PR_SET_COW_PTE), it will set the MMF_COW_PTE_READY flag to the
process for enabling COW PTE during the next time of fork.

It uses the MMF_COW_PTE flag to distinguish the normal page table
and the COW one. Moreover, it is difficult to distinguish whether all
the page tables is out of COW state. So the MMF_COW_PTE flag won't be
disabled after setup.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 kernel/fork.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9f7fe3541897..94c35c8b31b1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2678,6 +2678,13 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 			trace = 0;
 	}
 
+#ifdef CONFIG_COW_PTE
+	if (current->mm && test_bit(MMF_COW_PTE_READY, &current->mm->flags)) {
+		clear_bit(MMF_COW_PTE_READY, &current->mm->flags);
+		set_bit(MMF_COW_PTE, &current->mm->flags);
+	}
+#endif
+
 	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
 	add_latent_entropy();
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (13 preceding siblings ...)
  2023-02-07  3:51 ` [PATCH v4 14/14] mm: fork: Enable COW PTE to fork system call Chih-En Lin
@ 2023-02-09 18:15 ` Pasha Tatashin
  2023-02-10  2:17   ` Chih-En Lin
  14 siblings, 1 reply; 37+ messages in thread
From: Pasha Tatashin @ 2023-02-09 18:15 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote:
>
> v3 -> v4
> - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
>   s390 and powerpc32, don't support the PMD entry and PTE table
>   operations.
> - Fix unmatch type of break_cow_pte_range() in
>   migrate_vma_collect_pmd().
> - Don’t break COW PTE in folio_referenced_one().
> - Fix the wrong VMA range checking in break_cow_pte_range().
> - Only break COW when we modify the soft-dirty bit in
>   clear_refs_pte_range().
> - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
> - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
>   tlb_flush_pmd_range().
> - Handle VM_DONTCOPY with COW PTE fork.
> - Fix the wrong address and invalid vma in recover_pte_range().
> - Fix the infinite page fault loop in GUP routine.
>   In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
>   handler, we return -EMLINK to let the GUP handles the page fault
>   (call faultin_page() in __get_user_pages()).
> - return not_found(pvmw) if the break COW PTE failed in
>   page_vma_mapped_walk().
> - Since COW PTE has the same result as the normal COW selftest, it
>   probably passed the COW selftest.
>
>         # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
>         not ok 33 No leak from parent into child
>         # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
>         not ok 44 No leak from parent into child
>         # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
>         not ok 55 No leak from child into parent
>         # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
>         not ok 66 No leak from child into parent
>
>         Bail out! 4 out of 147 tests failed
>         # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
>   See the more information about anon cow hugetlb tests:
>     https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/
>
>
> v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/
>
> RFC v2 -> v3
> - Change the sysctl with PID to prctl(PR_SET_COW_PTE).
> - Account all the COW PTE mapped pages in fork() instead of defer it to
>   page fault (break COW PTE).
> - If there is an unshareable mapped page (maybe pinned or private
>   device), recover all the entries that are already handled by COW PTE
>   fork, then copy to the new one.
> - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
>   follow_pfn_pte().
> - Remove the PTE ownership since we don't need it.
> - Use pte lock to protect the break COW PTE and free COW-ed PTE.
> - Do TLB flushing in break COW PTE handler.
> - Handle THP, KSM, madvise, mprotect, uffd and migrate device.
> - Handle the replacement page of uprobe.
> - Handle the clear_refs_write() of fs/proc.
> - All of the benchmarks dropped since the accounting and pte lock.
>   The benchmarks of v3 is worse than RFC v2, most of the cases are
>   similar to the normal fork, but there still have an use case
>   (TriforceAFL) is better than the normal fork version.
>
> RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/
>
> RFC v1 -> RFC v2
> - Change the clone flag method to sysctl with PID.
> - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
>   MMF_COW_PTE_READY, for the sysctl.
> - Change the owner pointer to use the folio padding.
> - Handle all the VMAs that cover the PTE table when doing the break COW PTE.
> - Remove the self-defined refcount to use the _refcount for the page
>   table page.
> - Add the exclusive flag to let the page table only own by one task in
>   some situations.
> - Invalidate address range MMU notifier and start the write_seqcount
>   when doing the break COW PTE.
> - Handle the swap cache and swapoff.
>
> RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/
>
> ---
>
> Currently, copy-on-write is only used for the mapped memory; the child
> process still needs to copy the entire page table from the parent
> process during forking. The parent process might take a lot of time and
> memory to copy the page table when the parent has a big page table
> allocated. For example, the memory usage of a process after forking with
> 1 GB mapped memory is as follows:

For some reason, I was not able to reproduce performance improvements
with a simple fork() performance measurement program. The results that
I saw are the following:

Base:
Fork latency per gigabyte: 0.004416 seconds
Fork latency per gigabyte: 0.004382 seconds
Fork latency per gigabyte: 0.004442 seconds
COW kernel:
Fork latency per gigabyte: 0.004524 seconds
Fork latency per gigabyte: 0.004764 seconds
Fork latency per gigabyte: 0.004547 seconds

AMD EPYC 7B12 64-Core Processor
Base:
Fork latency per gigabyte: 0.003923 seconds
Fork latency per gigabyte: 0.003909 seconds
Fork latency per gigabyte: 0.003955 seconds
COW kernel:
Fork latency per gigabyte: 0.004221 seconds
Fork latency per gigabyte: 0.003882 seconds
Fork latency per gigabyte: 0.003854 seconds

Given, that page table for child is not copied, I was expecting the
performance to be better with COW kernel, and also not to depend on
the size of the parent.

Test program:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/types.h>

#define USEC    1000000
#define GIG     (1ul << 30)
#define NGIG    32
#define SIZE    (NGIG * GIG)
#define NPROC   16

void main() {
        int page_size = getpagesize();
        struct timeval start, end;
        long duration, i;
        char *p;

        p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (p == MAP_FAILED) {
                perror("mmap");
                exit(1);
        }
        madvise(p, SIZE, MADV_NOHUGEPAGE);

        /* Touch every page */
        for (i = 0; i < SIZE; i += page_size)
                p[i] = 0;

        gettimeofday(&start, NULL);
        for (i = 0; i < NPROC; i++) {
                int pid = fork();

                if (pid == 0) {
                        sleep(30);
                        exit(0);
                }
        }
        gettimeofday(&end, NULL);
        /* Normolize per proc and per gig */
        duration = ((end.tv_sec - start.tv_sec) * USEC
                + (end.tv_usec - start.tv_usec)) / NPROC / NGIG;
        printf("Fork latency per gigabyte: %ld.%06ld seconds\n",
                duration / USEC, duration % USEC);
}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-09 18:15 ` [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Pasha Tatashin
@ 2023-02-10  2:17   ` Chih-En Lin
  2023-02-10 16:21     ` Pasha Tatashin
  0 siblings, 1 reply; 37+ messages in thread
From: Chih-En Lin @ 2023-02-10  2:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Fri, Feb 10, 2023 at 2:16 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote:
> >
> > v3 -> v4
> > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
> >   s390 and powerpc32, don't support the PMD entry and PTE table
> >   operations.
> > - Fix unmatch type of break_cow_pte_range() in
> >   migrate_vma_collect_pmd().
> > - Don’t break COW PTE in folio_referenced_one().
> > - Fix the wrong VMA range checking in break_cow_pte_range().
> > - Only break COW when we modify the soft-dirty bit in
> >   clear_refs_pte_range().
> > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
> > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
> >   tlb_flush_pmd_range().
> > - Handle VM_DONTCOPY with COW PTE fork.
> > - Fix the wrong address and invalid vma in recover_pte_range().
> > - Fix the infinite page fault loop in GUP routine.
> >   In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
> >   handler, we return -EMLINK to let the GUP handles the page fault
> >   (call faultin_page() in __get_user_pages()).
> > - return not_found(pvmw) if the break COW PTE failed in
> >   page_vma_mapped_walk().
> > - Since COW PTE has the same result as the normal COW selftest, it
> >   probably passed the COW selftest.
> >
> >         # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
> >         not ok 33 No leak from parent into child
> >         # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
> >         not ok 44 No leak from parent into child
> >         # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
> >         not ok 55 No leak from child into parent
> >         # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
> >         not ok 66 No leak from child into parent
> >
> >         Bail out! 4 out of 147 tests failed
> >         # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
> >   See the more information about anon cow hugetlb tests:
> >     https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/
> >
> >
> > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/
> >
> > RFC v2 -> v3
> > - Change the sysctl with PID to prctl(PR_SET_COW_PTE).
> > - Account all the COW PTE mapped pages in fork() instead of defer it to
> >   page fault (break COW PTE).
> > - If there is an unshareable mapped page (maybe pinned or private
> >   device), recover all the entries that are already handled by COW PTE
> >   fork, then copy to the new one.
> > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
> >   follow_pfn_pte().
> > - Remove the PTE ownership since we don't need it.
> > - Use pte lock to protect the break COW PTE and free COW-ed PTE.
> > - Do TLB flushing in break COW PTE handler.
> > - Handle THP, KSM, madvise, mprotect, uffd and migrate device.
> > - Handle the replacement page of uprobe.
> > - Handle the clear_refs_write() of fs/proc.
> > - All of the benchmarks dropped since the accounting and pte lock.
> >   The benchmarks of v3 is worse than RFC v2, most of the cases are
> >   similar to the normal fork, but there still have an use case
> >   (TriforceAFL) is better than the normal fork version.
> >
> > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/
> >
> > RFC v1 -> RFC v2
> > - Change the clone flag method to sysctl with PID.
> > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
> >   MMF_COW_PTE_READY, for the sysctl.
> > - Change the owner pointer to use the folio padding.
> > - Handle all the VMAs that cover the PTE table when doing the break COW PTE.
> > - Remove the self-defined refcount to use the _refcount for the page
> >   table page.
> > - Add the exclusive flag to let the page table only own by one task in
> >   some situations.
> > - Invalidate address range MMU notifier and start the write_seqcount
> >   when doing the break COW PTE.
> > - Handle the swap cache and swapoff.
> >
> > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/
> >
> > ---
> >
> > Currently, copy-on-write is only used for the mapped memory; the child
> > process still needs to copy the entire page table from the parent
> > process during forking. The parent process might take a lot of time and
> > memory to copy the page table when the parent has a big page table
> > allocated. For example, the memory usage of a process after forking with
> > 1 GB mapped memory is as follows:
>
> For some reason, I was not able to reproduce performance improvements
> with a simple fork() performance measurement program. The results that
> I saw are the following:
>
> Base:
> Fork latency per gigabyte: 0.004416 seconds
> Fork latency per gigabyte: 0.004382 seconds
> Fork latency per gigabyte: 0.004442 seconds
> COW kernel:
> Fork latency per gigabyte: 0.004524 seconds
> Fork latency per gigabyte: 0.004764 seconds
> Fork latency per gigabyte: 0.004547 seconds
>
> AMD EPYC 7B12 64-Core Processor
> Base:
> Fork latency per gigabyte: 0.003923 seconds
> Fork latency per gigabyte: 0.003909 seconds
> Fork latency per gigabyte: 0.003955 seconds
> COW kernel:
> Fork latency per gigabyte: 0.004221 seconds
> Fork latency per gigabyte: 0.003882 seconds
> Fork latency per gigabyte: 0.003854 seconds
>
> Given, that page table for child is not copied, I was expecting the
> performance to be better with COW kernel, and also not to depend on
> the size of the parent.

Yes, the child won't duplicate the page table, but fork will still
traverse all the page table entries to do the accounting.
And, since this patch expends the COW to the PTE table level, it's not
the mapped page (page table entry) grained anymore, so we have to
guarantee that all the mapped page is available to do COW mapping in
the such page table.
This kind of checking also costs some time.
As a result, since the accounting and the checking, the COW PTE fork
still depends on the size of the parent so the improvement might not
be significant.

Actually, at the RFC v1 and v2, we proposed the version of skipping
those works, and we got a significant improvement. You can see the
number from RFC v2 cover letter [1]:
"In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
for normal fork"

However, it might break the existing logic of the refcount/mapcount of
the page and destabilize the system.

[1] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#me2340d963c2758a2561c39cb3baf42c478dfe548
[2] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#mbc33221f00c7cf3d71839b45fc23862a5dac3014

> Test program:
>
> #include <time.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/time.h>
> #include <sys/mman.h>
> #include <sys/types.h>
>
> #define USEC    1000000
> #define GIG     (1ul << 30)
> #define NGIG    32
> #define SIZE    (NGIG * GIG)
> #define NPROC   16
>
> void main() {
>         int page_size = getpagesize();
>         struct timeval start, end;
>         long duration, i;
>         char *p;
>
>         p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         if (p == MAP_FAILED) {
>                 perror("mmap");
>                 exit(1);
>         }
>         madvise(p, SIZE, MADV_NOHUGEPAGE);
>
>         /* Touch every page */
>         for (i = 0; i < SIZE; i += page_size)
>                 p[i] = 0;
>
>         gettimeofday(&start, NULL);
>         for (i = 0; i < NPROC; i++) {
>                 int pid = fork();
>
>                 if (pid == 0) {
>                         sleep(30);
>                         exit(0);
>                 }
>         }
>         gettimeofday(&end, NULL);
>         /* Normolize per proc and per gig */
>         duration = ((end.tv_sec - start.tv_sec) * USEC
>                 + (end.tv_usec - start.tv_usec)) / NPROC / NGIG;
>         printf("Fork latency per gigabyte: %ld.%06ld seconds\n",
>                 duration / USEC, duration % USEC);
> }

I'm not sure only taking the few testing is enough.
So, I rewrite your test program to run multiple times but focus on a
single fork, and get the average time:
fork.log: 0.000498
odfork.log: 0.000469

Test program:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <sys/prctl.h>

#define USEC 1000000
#define GIG (1ul << 30)
#define NGIG 4
#define SIZE (NGIG * GIG)
#define NPROC 16

int main(void)
{
    unsigned int i = 0;
    unsigned long j = 0;
    int pid, page_size = getpagesize();
    struct timeval start, end;
    long duration;
    char *p;

    prctl(65, 0, 0, 0, 0);

    for (i = 0; i < NPROC; i++) {
p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (p == MAP_FAILED) {
            perror("mmap");
            exit(1);
        }
        madvise(p, SIZE, MADV_NOHUGEPAGE);
        /* Touch every page */
        for (j = 0; j < SIZE; j += page_size)
            p[j] = 0;

        gettimeofday(&start, NULL);
        pid = fork();
        switch (pid) {
        case -1:
            perror("fork");
    exit(1);
        case 0: /* child */
            return 0;
default: /* parent */
            gettimeofday(&end, NULL);
            duration = ((end.tv_sec - start.tv_sec) * USEC +
                        (end.tv_usec - start.tv_usec)) /
                       NPROC / NGIG;
            // seconds
            printf("%ld.%06ld\n", duration / USEC, duration % USEC);
            waitpid(pid, NULL, 0);
            munmap(p, SIZE);
    p = NULL;
        }
    }
}

Script:

import numpy

def calc_mean(file):
    np_tmp = numpy.loadtxt(file, usecols=range(0,1))
    print("{}: {:6f}".format(file, np_tmp.mean()))

calc_mean("fork.log")
calc_mean("odfork.log")

I didn't make the memory size and process number bigger because it ran
on my laptop, and I can't access my server for some reason.

Thanks,
Chih-En Lin

On Fri, Feb 10, 2023 at 2:16 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Mon, Feb 6, 2023 at 10:52 PM Chih-En Lin <shiyn.lin@gmail.com> wrote:
> >
> > v3 -> v4
> > - Add Kconfig, CONFIG_COW_PTE, since some of the architectures, e.g.,
> >   s390 and powerpc32, don't support the PMD entry and PTE table
> >   operations.
> > - Fix unmatch type of break_cow_pte_range() in
> >   migrate_vma_collect_pmd().
> > - Don’t break COW PTE in folio_referenced_one().
> > - Fix the wrong VMA range checking in break_cow_pte_range().
> > - Only break COW when we modify the soft-dirty bit in
> >   clear_refs_pte_range().
> > - Handle do_swap_page() with COW PTE in mm/memory.c and mm/khugepaged.c.
> > - Change the tlb flush from flush_tlb_mm_range() (x86 specific) to
> >   tlb_flush_pmd_range().
> > - Handle VM_DONTCOPY with COW PTE fork.
> > - Fix the wrong address and invalid vma in recover_pte_range().
> > - Fix the infinite page fault loop in GUP routine.
> >   In mm/gup.c:follow_pfn_pte(), instead of calling the break COW PTE
> >   handler, we return -EMLINK to let the GUP handles the page fault
> >   (call faultin_page() in __get_user_pages()).
> > - return not_found(pvmw) if the break COW PTE failed in
> >   page_vma_mapped_walk().
> > - Since COW PTE has the same result as the normal COW selftest, it
> >   probably passed the COW selftest.
> >
> >         # [RUN] vmsplice() + unmap in child ... with hugetlb (2048 kB)
> >         not ok 33 No leak from parent into child
> >         # [RUN] vmsplice() + unmap in child with mprotect() optimization ... with hugetlb (2048 kB)
> >         not ok 44 No leak from parent into child
> >         # [RUN] vmsplice() before fork(), unmap in parent after fork() ... with hugetlb (2048 kB)
> >         not ok 55 No leak from child into parent
> >         # [RUN] vmsplice() + unmap in parent after fork() ... with hugetlb (2048 kB)
> >         not ok 66 No leak from child into parent
> >
> >         Bail out! 4 out of 147 tests failed
> >         # Totals: pass:143 fail:4 xfail:0 xpass:0 skip:0 error:0
> >   See the more information about anon cow hugetlb tests:
> >     https://patchwork.kernel.org/project/linux-mm/patch/20220927110120.106906-5-david@redhat.com/
> >
> >
> > v3: https://lore.kernel.org/linux-mm/20221220072743.3039060-1-shiyn.lin@gmail.com/T/
> >
> > RFC v2 -> v3
> > - Change the sysctl with PID to prctl(PR_SET_COW_PTE).
> > - Account all the COW PTE mapped pages in fork() instead of defer it to
> >   page fault (break COW PTE).
> > - If there is an unshareable mapped page (maybe pinned or private
> >   device), recover all the entries that are already handled by COW PTE
> >   fork, then copy to the new one.
> > - Remove COW_PTE_OWNER_EXCLUSIVE flag and handle the only case of GUP,
> >   follow_pfn_pte().
> > - Remove the PTE ownership since we don't need it.
> > - Use pte lock to protect the break COW PTE and free COW-ed PTE.
> > - Do TLB flushing in break COW PTE handler.
> > - Handle THP, KSM, madvise, mprotect, uffd and migrate device.
> > - Handle the replacement page of uprobe.
> > - Handle the clear_refs_write() of fs/proc.
> > - All of the benchmarks dropped since the accounting and pte lock.
> >   The benchmarks of v3 is worse than RFC v2, most of the cases are
> >   similar to the normal fork, but there still have an use case
> >   (TriforceAFL) is better than the normal fork version.
> >
> > RFC v2: https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/
> >
> > RFC v1 -> RFC v2
> > - Change the clone flag method to sysctl with PID.
> > - Change the MMF_COW_PGTABLE flag to two flags, MMF_COW_PTE and
> >   MMF_COW_PTE_READY, for the sysctl.
> > - Change the owner pointer to use the folio padding.
> > - Handle all the VMAs that cover the PTE table when doing the break COW PTE.
> > - Remove the self-defined refcount to use the _refcount for the page
> >   table page.
> > - Add the exclusive flag to let the page table only own by one task in
> >   some situations.
> > - Invalidate address range MMU notifier and start the write_seqcount
> >   when doing the break COW PTE.
> > - Handle the swap cache and swapoff.
> >
> > RFC v1: https://lore.kernel.org/all/20220519183127.3909598-1-shiyn.lin@gmail.com/
> >
> > ---
> >
> > Currently, copy-on-write is only used for the mapped memory; the child
> > process still needs to copy the entire page table from the parent
> > process during forking. The parent process might take a lot of time and
> > memory to copy the page table when the parent has a big page table
> > allocated. For example, the memory usage of a process after forking with
> > 1 GB mapped memory is as follows:
>
> For some reason, I was not able to reproduce performance improvements
> with a simple fork() performance measurement program. The results that
> I saw are the following:
>
> Base:
> Fork latency per gigabyte: 0.004416 seconds
> Fork latency per gigabyte: 0.004382 seconds
> Fork latency per gigabyte: 0.004442 seconds
> COW kernel:
> Fork latency per gigabyte: 0.004524 seconds
> Fork latency per gigabyte: 0.004764 seconds
> Fork latency per gigabyte: 0.004547 seconds
>
> AMD EPYC 7B12 64-Core Processor
> Base:
> Fork latency per gigabyte: 0.003923 seconds
> Fork latency per gigabyte: 0.003909 seconds
> Fork latency per gigabyte: 0.003955 seconds
> COW kernel:
> Fork latency per gigabyte: 0.004221 seconds
> Fork latency per gigabyte: 0.003882 seconds
> Fork latency per gigabyte: 0.003854 seconds
>
> Given, that page table for child is not copied, I was expecting the
> performance to be better with COW kernel, and also not to depend on
> the size of the parent.
>
> Test program:
>
> #include <time.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/time.h>
> #include <sys/mman.h>
> #include <sys/types.h>
>
> #define USEC    1000000
> #define GIG     (1ul << 30)
> #define NGIG    32
> #define SIZE    (NGIG * GIG)
> #define NPROC   16
>
> void main() {
>         int page_size = getpagesize();
>         struct timeval start, end;
>         long duration, i;
>         char *p;
>
>         p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>                  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>         if (p == MAP_FAILED) {
>                 perror("mmap");
>                 exit(1);
>         }
>         madvise(p, SIZE, MADV_NOHUGEPAGE);
>
>         /* Touch every page */
>         for (i = 0; i < SIZE; i += page_size)
>                 p[i] = 0;
>
>         gettimeofday(&start, NULL);
>         for (i = 0; i < NPROC; i++) {
>                 int pid = fork();
>
>                 if (pid == 0) {
>                         sleep(30);
>                         exit(0);
>                 }
>         }
>         gettimeofday(&end, NULL);
>         /* Normolize per proc and per gig */
>         duration = ((end.tv_sec - start.tv_sec) * USEC
>                 + (end.tv_usec - start.tv_usec)) / NPROC / NGIG;
>         printf("Fork latency per gigabyte: %ld.%06ld seconds\n",
>                 duration / USEC, duration % USEC);
> }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-10  2:17   ` Chih-En Lin
@ 2023-02-10 16:21     ` Pasha Tatashin
  2023-02-10 17:20       ` Chih-En Lin
  0 siblings, 1 reply; 37+ messages in thread
From: Pasha Tatashin @ 2023-02-10 16:21 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

> > > Currently, copy-on-write is only used for the mapped memory; the child
> > > process still needs to copy the entire page table from the parent
> > > process during forking. The parent process might take a lot of time and
> > > memory to copy the page table when the parent has a big page table
> > > allocated. For example, the memory usage of a process after forking with
> > > 1 GB mapped memory is as follows:
> >
> > For some reason, I was not able to reproduce performance improvements
> > with a simple fork() performance measurement program. The results that
> > I saw are the following:
> >
> > Base:
> > Fork latency per gigabyte: 0.004416 seconds
> > Fork latency per gigabyte: 0.004382 seconds
> > Fork latency per gigabyte: 0.004442 seconds
> > COW kernel:
> > Fork latency per gigabyte: 0.004524 seconds
> > Fork latency per gigabyte: 0.004764 seconds
> > Fork latency per gigabyte: 0.004547 seconds
> >
> > AMD EPYC 7B12 64-Core Processor
> > Base:
> > Fork latency per gigabyte: 0.003923 seconds
> > Fork latency per gigabyte: 0.003909 seconds
> > Fork latency per gigabyte: 0.003955 seconds
> > COW kernel:
> > Fork latency per gigabyte: 0.004221 seconds
> > Fork latency per gigabyte: 0.003882 seconds
> > Fork latency per gigabyte: 0.003854 seconds
> >
> > Given, that page table for child is not copied, I was expecting the
> > performance to be better with COW kernel, and also not to depend on
> > the size of the parent.
>
> Yes, the child won't duplicate the page table, but fork will still
> traverse all the page table entries to do the accounting.
> And, since this patch expends the COW to the PTE table level, it's not
> the mapped page (page table entry) grained anymore, so we have to
> guarantee that all the mapped page is available to do COW mapping in
> the such page table.
> This kind of checking also costs some time.
> As a result, since the accounting and the checking, the COW PTE fork
> still depends on the size of the parent so the improvement might not
> be significant.

The current version of the series does not provide any performance
improvements for fork(). I would recommend removing claims from the
cover letter about better fork() performance, as this may be
misleading for those looking for a way to speed up forking. In my
case, I was looking to speed up Redis OSS, which relies on fork() to
create consistent snapshots for driving replicates/backups. The O(N)
per-page operation causes fork() to be slow, so I was hoping that this
series, which does not duplicate the VA during fork(), would make the
operation much quicker.

> Actually, at the RFC v1 and v2, we proposed the version of skipping
> those works, and we got a significant improvement. You can see the
> number from RFC v2 cover letter [1]:
> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
> for normal fork"

I suspect the 93% improvement (when the mapcount was not updated) was
only for VAs with 4K pages. With 2M mappings this series did not
provide any benefit is this correct?

>
> However, it might break the existing logic of the refcount/mapcount of
> the page and destabilize the system.

This makes sense.

> [1] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#me2340d963c2758a2561c39cb3baf42c478dfe548
> [2] https://lore.kernel.org/linux-mm/20220927162957.270460-1-shiyn.lin@gmail.com/T/#mbc33221f00c7cf3d71839b45fc23862a5dac3014

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-10 16:21     ` Pasha Tatashin
@ 2023-02-10 17:20       ` Chih-En Lin
  2023-02-10 19:02         ` Chih-En Lin
  2023-02-14  9:58         ` David Hildenbrand
  0 siblings, 2 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-10 17:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
> > > > Currently, copy-on-write is only used for the mapped memory; the child
> > > > process still needs to copy the entire page table from the parent
> > > > process during forking. The parent process might take a lot of time and
> > > > memory to copy the page table when the parent has a big page table
> > > > allocated. For example, the memory usage of a process after forking with
> > > > 1 GB mapped memory is as follows:
> > >
> > > For some reason, I was not able to reproduce performance improvements
> > > with a simple fork() performance measurement program. The results that
> > > I saw are the following:
> > >
> > > Base:
> > > Fork latency per gigabyte: 0.004416 seconds
> > > Fork latency per gigabyte: 0.004382 seconds
> > > Fork latency per gigabyte: 0.004442 seconds
> > > COW kernel:
> > > Fork latency per gigabyte: 0.004524 seconds
> > > Fork latency per gigabyte: 0.004764 seconds
> > > Fork latency per gigabyte: 0.004547 seconds
> > >
> > > AMD EPYC 7B12 64-Core Processor
> > > Base:
> > > Fork latency per gigabyte: 0.003923 seconds
> > > Fork latency per gigabyte: 0.003909 seconds
> > > Fork latency per gigabyte: 0.003955 seconds
> > > COW kernel:
> > > Fork latency per gigabyte: 0.004221 seconds
> > > Fork latency per gigabyte: 0.003882 seconds
> > > Fork latency per gigabyte: 0.003854 seconds
> > >
> > > Given, that page table for child is not copied, I was expecting the
> > > performance to be better with COW kernel, and also not to depend on
> > > the size of the parent.
> >
> > Yes, the child won't duplicate the page table, but fork will still
> > traverse all the page table entries to do the accounting.
> > And, since this patch expends the COW to the PTE table level, it's not
> > the mapped page (page table entry) grained anymore, so we have to
> > guarantee that all the mapped page is available to do COW mapping in
> > the such page table.
> > This kind of checking also costs some time.
> > As a result, since the accounting and the checking, the COW PTE fork
> > still depends on the size of the parent so the improvement might not
> > be significant.
> 
> The current version of the series does not provide any performance
> improvements for fork(). I would recommend removing claims from the
> cover letter about better fork() performance, as this may be
> misleading for those looking for a way to speed up forking. In my

From v3 to v4, I changed the implementation of the COW fork() part to do
the accounting and checking. At the time, I also removed most of the
descriptions about the better fork() performance. Maybe it's not enough
and still has some misleading. I will fix this in the next version.
Thanks.

> case, I was looking to speed up Redis OSS, which relies on fork() to
> create consistent snapshots for driving replicates/backups. The O(N)
> per-page operation causes fork() to be slow, so I was hoping that this
> series, which does not duplicate the VA during fork(), would make the
> operation much quicker.

Indeed, at first, I tried to avoid the O(N) per-page operation by
deferring the accounting and the swap stuff to the page fault. But,
as I mentioned, it's not suitable for the mainline.

Honestly, for improving the fork(), I have an idea to skip the per-page
operation without breaking the logic. However, this will introduce the
complicated mechanism and may has the overhead for other features. It
might not be worth it. It's hard to strike a balance between the
over-complicated mechanism with (probably) better performance and data
consistency with the page status. So, I would focus on the safety and
stable approach at first.

> > Actually, at the RFC v1 and v2, we proposed the version of skipping
> > those works, and we got a significant improvement. You can see the
> > number from RFC v2 cover letter [1]:
> > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
> > for normal fork"
> 
> I suspect the 93% improvement (when the mapcount was not updated) was
> only for VAs with 4K pages. With 2M mappings this series did not
> provide any benefit is this correct?

Yes. In this case, the COW PTE performance is similar to the normal
fork().

> >
> > However, it might break the existing logic of the refcount/mapcount of
> > the page and destabilize the system.
> 
> This makes sense.

;)

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-10 17:20       ` Chih-En Lin
@ 2023-02-10 19:02         ` Chih-En Lin
  2023-02-14  9:58         ` David Hildenbrand
  1 sibling, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-10 19:02 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Andrew Morton, Qi Zheng, David Hildenbrand,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Sat, Feb 11, 2023 at 01:20:10AM +0800, Chih-En Lin wrote:
> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
> > > > > Currently, copy-on-write is only used for the mapped memory; the child
> > > > > process still needs to copy the entire page table from the parent
> > > > > process during forking. The parent process might take a lot of time and
> > > > > memory to copy the page table when the parent has a big page table
> > > > > allocated. For example, the memory usage of a process after forking with
> > > > > 1 GB mapped memory is as follows:
> > > >
> > > > For some reason, I was not able to reproduce performance improvements
> > > > with a simple fork() performance measurement program. The results that
> > > > I saw are the following:
> > > >
> > > > Base:
> > > > Fork latency per gigabyte: 0.004416 seconds
> > > > Fork latency per gigabyte: 0.004382 seconds
> > > > Fork latency per gigabyte: 0.004442 seconds
> > > > COW kernel:
> > > > Fork latency per gigabyte: 0.004524 seconds
> > > > Fork latency per gigabyte: 0.004764 seconds
> > > > Fork latency per gigabyte: 0.004547 seconds
> > > >
> > > > AMD EPYC 7B12 64-Core Processor
> > > > Base:
> > > > Fork latency per gigabyte: 0.003923 seconds
> > > > Fork latency per gigabyte: 0.003909 seconds
> > > > Fork latency per gigabyte: 0.003955 seconds
> > > > COW kernel:
> > > > Fork latency per gigabyte: 0.004221 seconds
> > > > Fork latency per gigabyte: 0.003882 seconds
> > > > Fork latency per gigabyte: 0.003854 seconds
> > > >
> > > > Given, that page table for child is not copied, I was expecting the
> > > > performance to be better with COW kernel, and also not to depend on
> > > > the size of the parent.
> > >
> > > Yes, the child won't duplicate the page table, but fork will still
> > > traverse all the page table entries to do the accounting.
> > > And, since this patch expends the COW to the PTE table level, it's not
> > > the mapped page (page table entry) grained anymore, so we have to
> > > guarantee that all the mapped page is available to do COW mapping in
> > > the such page table.
> > > This kind of checking also costs some time.
> > > As a result, since the accounting and the checking, the COW PTE fork
> > > still depends on the size of the parent so the improvement might not
> > > be significant.
> > 
> > The current version of the series does not provide any performance
> > improvements for fork(). I would recommend removing claims from the
> > cover letter about better fork() performance, as this may be
> > misleading for those looking for a way to speed up forking. In my
> 
> From v3 to v4, I changed the implementation of the COW fork() part to do

Sorry, it's "RFC v2 to v3".

> the accounting and checking. At the time, I also removed most of the
> descriptions about the better fork() performance. Maybe it's not enough
> and still has some misleading. I will fix this in the next version.
> Thanks.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-10 17:20       ` Chih-En Lin
  2023-02-10 19:02         ` Chih-En Lin
@ 2023-02-14  9:58         ` David Hildenbrand
  2023-02-14 13:07           ` Pasha Tatashin
                             ` (2 more replies)
  1 sibling, 3 replies; 37+ messages in thread
From: David Hildenbrand @ 2023-02-14  9:58 UTC (permalink / raw)
  To: Chih-En Lin, Pasha Tatashin
  Cc: Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On 10.02.23 18:20, Chih-En Lin wrote:
> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
>>>>> Currently, copy-on-write is only used for the mapped memory; the child
>>>>> process still needs to copy the entire page table from the parent
>>>>> process during forking. The parent process might take a lot of time and
>>>>> memory to copy the page table when the parent has a big page table
>>>>> allocated. For example, the memory usage of a process after forking with
>>>>> 1 GB mapped memory is as follows:
>>>>
>>>> For some reason, I was not able to reproduce performance improvements
>>>> with a simple fork() performance measurement program. The results that
>>>> I saw are the following:
>>>>
>>>> Base:
>>>> Fork latency per gigabyte: 0.004416 seconds
>>>> Fork latency per gigabyte: 0.004382 seconds
>>>> Fork latency per gigabyte: 0.004442 seconds
>>>> COW kernel:
>>>> Fork latency per gigabyte: 0.004524 seconds
>>>> Fork latency per gigabyte: 0.004764 seconds
>>>> Fork latency per gigabyte: 0.004547 seconds
>>>>
>>>> AMD EPYC 7B12 64-Core Processor
>>>> Base:
>>>> Fork latency per gigabyte: 0.003923 seconds
>>>> Fork latency per gigabyte: 0.003909 seconds
>>>> Fork latency per gigabyte: 0.003955 seconds
>>>> COW kernel:
>>>> Fork latency per gigabyte: 0.004221 seconds
>>>> Fork latency per gigabyte: 0.003882 seconds
>>>> Fork latency per gigabyte: 0.003854 seconds
>>>>
>>>> Given, that page table for child is not copied, I was expecting the
>>>> performance to be better with COW kernel, and also not to depend on
>>>> the size of the parent.
>>>
>>> Yes, the child won't duplicate the page table, but fork will still
>>> traverse all the page table entries to do the accounting.
>>> And, since this patch expends the COW to the PTE table level, it's not
>>> the mapped page (page table entry) grained anymore, so we have to
>>> guarantee that all the mapped page is available to do COW mapping in
>>> the such page table.
>>> This kind of checking also costs some time.
>>> As a result, since the accounting and the checking, the COW PTE fork
>>> still depends on the size of the parent so the improvement might not
>>> be significant.
>>
>> The current version of the series does not provide any performance
>> improvements for fork(). I would recommend removing claims from the
>> cover letter about better fork() performance, as this may be
>> misleading for those looking for a way to speed up forking. In my
> 
>  From v3 to v4, I changed the implementation of the COW fork() part to do
> the accounting and checking. At the time, I also removed most of the
> descriptions about the better fork() performance. Maybe it's not enough
> and still has some misleading. I will fix this in the next version.
> Thanks.
> 
>> case, I was looking to speed up Redis OSS, which relies on fork() to
>> create consistent snapshots for driving replicates/backups. The O(N)
>> per-page operation causes fork() to be slow, so I was hoping that this
>> series, which does not duplicate the VA during fork(), would make the
>> operation much quicker.
> 
> Indeed, at first, I tried to avoid the O(N) per-page operation by
> deferring the accounting and the swap stuff to the page fault. But,
> as I mentioned, it's not suitable for the mainline.
> 
> Honestly, for improving the fork(), I have an idea to skip the per-page
> operation without breaking the logic. However, this will introduce the
> complicated mechanism and may has the overhead for other features. It
> might not be worth it. It's hard to strike a balance between the
> over-complicated mechanism with (probably) better performance and data
> consistency with the page status. So, I would focus on the safety and
> stable approach at first.

Yes, it is most probably possible, but complexity, robustness and 
maintainability have to be considered as well.

Thanks for implementing this approach (only deduplication without other 
optimizations) and evaluating it accordingly. It's certainly "cleaner", 
such that we only have to mess with unsharing and not with other 
accounting/pinning/mapcount thingies. But it also highlights how 
intrusive even this basic deduplication approach already is -- and that 
most benefits of the original approach requires even more complexity on top.

I am not quite sure if the benefit is worth the price (I am not to 
decide and I would like to hear other options).

My quick thoughts after skimming over the core parts of this series

(1) forgetting to break COW on a PTE in some pgtable walker feels quite
     likely (meaning that it might be fairly error-prone) and forgetting
     to break COW on a PTE table, accidentally modifying the shared
     table.
(2) break_cow_pte() can fail, which means that we can fail some
     operations (possibly silently halfway through) now. For example,
     looking at your change_pte_range() change, I suspect it's wrong.
(3) handle_cow_pte_fault() looks quite complicated and needs quite some
     double-checking: we temporarily clear the PMD, to reset it
     afterwards. I am not sure if that is correct. For example, what
     stops another page fault stumbling over that pmd_none() and
     allocating an empty page table? Maybe there are some locking details
     missing or they are very subtle such that we better document them. I
    recall that THP played quite some tricks to make such cases work ...

> 
>>> Actually, at the RFC v1 and v2, we proposed the version of skipping
>>> those works, and we got a significant improvement. You can see the
>>> number from RFC v2 cover letter [1]:
>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
>>> for normal fork"
>>
>> I suspect the 93% improvement (when the mapcount was not updated) was
>> only for VAs with 4K pages. With 2M mappings this series did not
>> provide any benefit is this correct?
> 
> Yes. In this case, the COW PTE performance is similar to the normal
> fork().


The thing with THP is, that during fork(), we always allocate a backup 
PTE table, to be able to PTE-map the THP whenever we have to. Otherwise 
we'd have to eventually fail some operations we don't want to fail -- 
similar to the case where break_cow_pte() could fail now due to -ENOMEM 
although we really don't want to fail (e.g., change_pte_range() ).

I always considered that wasteful, because in many scenarios, we'll 
never ever split a THP and possibly waste memory.

Optimizing that for THP (e.g., don't always allocate backup THP, have 
some global allocation backup pool for splits + refill when 
close-to-empty) might provide similar fork() improvements, both in speed 
and memory consumption when it comes to anonymous memory.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14  9:58         ` David Hildenbrand
@ 2023-02-14 13:07           ` Pasha Tatashin
  2023-02-14 13:17             ` David Hildenbrand
  2023-02-14 15:59           ` Chih-En Lin
  2023-02-14 17:23           ` Yang Shi
  2 siblings, 1 reply; 37+ messages in thread
From: Pasha Tatashin @ 2023-02-14 13:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chih-En Lin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 4:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.02.23 18:20, Chih-En Lin wrote:
> > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
> >>>>> Currently, copy-on-write is only used for the mapped memory; the child
> >>>>> process still needs to copy the entire page table from the parent
> >>>>> process during forking. The parent process might take a lot of time and
> >>>>> memory to copy the page table when the parent has a big page table
> >>>>> allocated. For example, the memory usage of a process after forking with
> >>>>> 1 GB mapped memory is as follows:
> >>>>
> >>>> For some reason, I was not able to reproduce performance improvements
> >>>> with a simple fork() performance measurement program. The results that
> >>>> I saw are the following:
> >>>>
> >>>> Base:
> >>>> Fork latency per gigabyte: 0.004416 seconds
> >>>> Fork latency per gigabyte: 0.004382 seconds
> >>>> Fork latency per gigabyte: 0.004442 seconds
> >>>> COW kernel:
> >>>> Fork latency per gigabyte: 0.004524 seconds
> >>>> Fork latency per gigabyte: 0.004764 seconds
> >>>> Fork latency per gigabyte: 0.004547 seconds
> >>>>
> >>>> AMD EPYC 7B12 64-Core Processor
> >>>> Base:
> >>>> Fork latency per gigabyte: 0.003923 seconds
> >>>> Fork latency per gigabyte: 0.003909 seconds
> >>>> Fork latency per gigabyte: 0.003955 seconds
> >>>> COW kernel:
> >>>> Fork latency per gigabyte: 0.004221 seconds
> >>>> Fork latency per gigabyte: 0.003882 seconds
> >>>> Fork latency per gigabyte: 0.003854 seconds
> >>>>
> >>>> Given, that page table for child is not copied, I was expecting the
> >>>> performance to be better with COW kernel, and also not to depend on
> >>>> the size of the parent.
> >>>
> >>> Yes, the child won't duplicate the page table, but fork will still
> >>> traverse all the page table entries to do the accounting.
> >>> And, since this patch expends the COW to the PTE table level, it's not
> >>> the mapped page (page table entry) grained anymore, so we have to
> >>> guarantee that all the mapped page is available to do COW mapping in
> >>> the such page table.
> >>> This kind of checking also costs some time.
> >>> As a result, since the accounting and the checking, the COW PTE fork
> >>> still depends on the size of the parent so the improvement might not
> >>> be significant.
> >>
> >> The current version of the series does not provide any performance
> >> improvements for fork(). I would recommend removing claims from the
> >> cover letter about better fork() performance, as this may be
> >> misleading for those looking for a way to speed up forking. In my
> >
> >  From v3 to v4, I changed the implementation of the COW fork() part to do
> > the accounting and checking. At the time, I also removed most of the
> > descriptions about the better fork() performance. Maybe it's not enough
> > and still has some misleading. I will fix this in the next version.
> > Thanks.
> >
> >> case, I was looking to speed up Redis OSS, which relies on fork() to
> >> create consistent snapshots for driving replicates/backups. The O(N)
> >> per-page operation causes fork() to be slow, so I was hoping that this
> >> series, which does not duplicate the VA during fork(), would make the
> >> operation much quicker.
> >
> > Indeed, at first, I tried to avoid the O(N) per-page operation by
> > deferring the accounting and the swap stuff to the page fault. But,
> > as I mentioned, it's not suitable for the mainline.
> >
> > Honestly, for improving the fork(), I have an idea to skip the per-page
> > operation without breaking the logic. However, this will introduce the
> > complicated mechanism and may has the overhead for other features. It
> > might not be worth it. It's hard to strike a balance between the
> > over-complicated mechanism with (probably) better performance and data
> > consistency with the page status. So, I would focus on the safety and
> > stable approach at first.
>
> Yes, it is most probably possible, but complexity, robustness and
> maintainability have to be considered as well.
>
> Thanks for implementing this approach (only deduplication without other
> optimizations) and evaluating it accordingly. It's certainly "cleaner",
> such that we only have to mess with unsharing and not with other
> accounting/pinning/mapcount thingies. But it also highlights how
> intrusive even this basic deduplication approach already is -- and that
> most benefits of the original approach requires even more complexity on top.
>
> I am not quite sure if the benefit is worth the price (I am not to
> decide and I would like to hear other options).
>
> My quick thoughts after skimming over the core parts of this series
>
> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
>      likely (meaning that it might be fairly error-prone) and forgetting
>      to break COW on a PTE table, accidentally modifying the shared
>      table.
> (2) break_cow_pte() can fail, which means that we can fail some
>      operations (possibly silently halfway through) now. For example,
>      looking at your change_pte_range() change, I suspect it's wrong.
> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>      double-checking: we temporarily clear the PMD, to reset it
>      afterwards. I am not sure if that is correct. For example, what
>      stops another page fault stumbling over that pmd_none() and
>      allocating an empty page table? Maybe there are some locking details
>      missing or they are very subtle such that we better document them. I
>     recall that THP played quite some tricks to make such cases work ...
>
> >
> >>> Actually, at the RFC v1 and v2, we proposed the version of skipping
> >>> those works, and we got a significant improvement. You can see the
> >>> number from RFC v2 cover letter [1]:
> >>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
> >>> for normal fork"
> >>
> >> I suspect the 93% improvement (when the mapcount was not updated) was
> >> only for VAs with 4K pages. With 2M mappings this series did not
> >> provide any benefit is this correct?
> >
> > Yes. In this case, the COW PTE performance is similar to the normal
> > fork().
>
>
> The thing with THP is, that during fork(), we always allocate a backup
> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise
> we'd have to eventually fail some operations we don't want to fail --
> similar to the case where break_cow_pte() could fail now due to -ENOMEM
> although we really don't want to fail (e.g., change_pte_range() ).
>
> I always considered that wasteful, because in many scenarios, we'll
> never ever split a THP and possibly waste memory.

Yes, it does sound wasteful for a pretty rare corner case that
combines splitting THP in a process, and not having enough memory to
allocate PTE page tables.

> Optimizing that for THP (e.g., don't always allocate backup THP, have
> some global allocation backup pool for splits + refill when
> close-to-empty) might provide similar fork() improvements, both in speed
> and memory consumption when it comes to anonymous memory.

This sounds like a reasonable way to optimize the fork performance for
processes with large RSS, which in most cases would have 2M THP
mappings. When you say global pool, do you mean per machine, per
cgroup, or per process?

Pasha

>
> --
> Thanks,
>
> David / dhildenb
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 13:07           ` Pasha Tatashin
@ 2023-02-14 13:17             ` David Hildenbrand
  0 siblings, 0 replies; 37+ messages in thread
From: David Hildenbrand @ 2023-02-14 13:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Chih-En Lin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On 14.02.23 14:07, Pasha Tatashin wrote:
> On Tue, Feb 14, 2023 at 4:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.02.23 18:20, Chih-En Lin wrote:
>>> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
>>>>>>> Currently, copy-on-write is only used for the mapped memory; the child
>>>>>>> process still needs to copy the entire page table from the parent
>>>>>>> process during forking. The parent process might take a lot of time and
>>>>>>> memory to copy the page table when the parent has a big page table
>>>>>>> allocated. For example, the memory usage of a process after forking with
>>>>>>> 1 GB mapped memory is as follows:
>>>>>>
>>>>>> For some reason, I was not able to reproduce performance improvements
>>>>>> with a simple fork() performance measurement program. The results that
>>>>>> I saw are the following:
>>>>>>
>>>>>> Base:
>>>>>> Fork latency per gigabyte: 0.004416 seconds
>>>>>> Fork latency per gigabyte: 0.004382 seconds
>>>>>> Fork latency per gigabyte: 0.004442 seconds
>>>>>> COW kernel:
>>>>>> Fork latency per gigabyte: 0.004524 seconds
>>>>>> Fork latency per gigabyte: 0.004764 seconds
>>>>>> Fork latency per gigabyte: 0.004547 seconds
>>>>>>
>>>>>> AMD EPYC 7B12 64-Core Processor
>>>>>> Base:
>>>>>> Fork latency per gigabyte: 0.003923 seconds
>>>>>> Fork latency per gigabyte: 0.003909 seconds
>>>>>> Fork latency per gigabyte: 0.003955 seconds
>>>>>> COW kernel:
>>>>>> Fork latency per gigabyte: 0.004221 seconds
>>>>>> Fork latency per gigabyte: 0.003882 seconds
>>>>>> Fork latency per gigabyte: 0.003854 seconds
>>>>>>
>>>>>> Given, that page table for child is not copied, I was expecting the
>>>>>> performance to be better with COW kernel, and also not to depend on
>>>>>> the size of the parent.
>>>>>
>>>>> Yes, the child won't duplicate the page table, but fork will still
>>>>> traverse all the page table entries to do the accounting.
>>>>> And, since this patch expends the COW to the PTE table level, it's not
>>>>> the mapped page (page table entry) grained anymore, so we have to
>>>>> guarantee that all the mapped page is available to do COW mapping in
>>>>> the such page table.
>>>>> This kind of checking also costs some time.
>>>>> As a result, since the accounting and the checking, the COW PTE fork
>>>>> still depends on the size of the parent so the improvement might not
>>>>> be significant.
>>>>
>>>> The current version of the series does not provide any performance
>>>> improvements for fork(). I would recommend removing claims from the
>>>> cover letter about better fork() performance, as this may be
>>>> misleading for those looking for a way to speed up forking. In my
>>>
>>>   From v3 to v4, I changed the implementation of the COW fork() part to do
>>> the accounting and checking. At the time, I also removed most of the
>>> descriptions about the better fork() performance. Maybe it's not enough
>>> and still has some misleading. I will fix this in the next version.
>>> Thanks.
>>>
>>>> case, I was looking to speed up Redis OSS, which relies on fork() to
>>>> create consistent snapshots for driving replicates/backups. The O(N)
>>>> per-page operation causes fork() to be slow, so I was hoping that this
>>>> series, which does not duplicate the VA during fork(), would make the
>>>> operation much quicker.
>>>
>>> Indeed, at first, I tried to avoid the O(N) per-page operation by
>>> deferring the accounting and the swap stuff to the page fault. But,
>>> as I mentioned, it's not suitable for the mainline.
>>>
>>> Honestly, for improving the fork(), I have an idea to skip the per-page
>>> operation without breaking the logic. However, this will introduce the
>>> complicated mechanism and may has the overhead for other features. It
>>> might not be worth it. It's hard to strike a balance between the
>>> over-complicated mechanism with (probably) better performance and data
>>> consistency with the page status. So, I would focus on the safety and
>>> stable approach at first.
>>
>> Yes, it is most probably possible, but complexity, robustness and
>> maintainability have to be considered as well.
>>
>> Thanks for implementing this approach (only deduplication without other
>> optimizations) and evaluating it accordingly. It's certainly "cleaner",
>> such that we only have to mess with unsharing and not with other
>> accounting/pinning/mapcount thingies. But it also highlights how
>> intrusive even this basic deduplication approach already is -- and that
>> most benefits of the original approach requires even more complexity on top.
>>
>> I am not quite sure if the benefit is worth the price (I am not to
>> decide and I would like to hear other options).
>>
>> My quick thoughts after skimming over the core parts of this series
>>
>> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
>>       likely (meaning that it might be fairly error-prone) and forgetting
>>       to break COW on a PTE table, accidentally modifying the shared
>>       table.
>> (2) break_cow_pte() can fail, which means that we can fail some
>>       operations (possibly silently halfway through) now. For example,
>>       looking at your change_pte_range() change, I suspect it's wrong.
>> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>>       double-checking: we temporarily clear the PMD, to reset it
>>       afterwards. I am not sure if that is correct. For example, what
>>       stops another page fault stumbling over that pmd_none() and
>>       allocating an empty page table? Maybe there are some locking details
>>       missing or they are very subtle such that we better document them. I
>>      recall that THP played quite some tricks to make such cases work ...
>>
>>>
>>>>> Actually, at the RFC v1 and v2, we proposed the version of skipping
>>>>> those works, and we got a significant improvement. You can see the
>>>>> number from RFC v2 cover letter [1]:
>>>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
>>>>> for normal fork"
>>>>
>>>> I suspect the 93% improvement (when the mapcount was not updated) was
>>>> only for VAs with 4K pages. With 2M mappings this series did not
>>>> provide any benefit is this correct?
>>>
>>> Yes. In this case, the COW PTE performance is similar to the normal
>>> fork().
>>
>>
>> The thing with THP is, that during fork(), we always allocate a backup
>> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise
>> we'd have to eventually fail some operations we don't want to fail --
>> similar to the case where break_cow_pte() could fail now due to -ENOMEM
>> although we really don't want to fail (e.g., change_pte_range() ).
>>
>> I always considered that wasteful, because in many scenarios, we'll
>> never ever split a THP and possibly waste memory.
> 
> Yes, it does sound wasteful for a pretty rare corner case that
> combines splitting THP in a process, and not having enough memory to
> allocate PTE page tables.
> 
>> Optimizing that for THP (e.g., don't always allocate backup THP, have
>> some global allocation backup pool for splits + refill when
>> close-to-empty) might provide similar fork() improvements, both in speed
>> and memory consumption when it comes to anonymous memory.
> 
> This sounds like a reasonable way to optimize the fork performance for
> processes with large RSS, which in most cases would have 2M THP
> mappings. When you say global pool, do you mean per machine, per
> cgroup, or per process?

Good question. I recall that the problem is that we sometimes need a new 
pgtable when splitting a THP, but

(a) we might be under spinlock and cannot sleep. We need an atomic
     allocation that might fail. For this, a pool might be helpful.

(b) we might actually be out of memory.

My gut feeling is that a global pool would be sufficient, only yo be 
used when we run into (a) or (b) to be able to make progress in these 
rare cases.

Something that would be interesting to evaluate is which THP split 
operations might might require some way to recover when really OOM.

For example, __split_huge_pmd() is currently not able to report a 
failure. I assume that we could sleep in there. And if we're not able to 
allocate any memory in there (with sleeping), maybe the process should 
be zapped either way by the OOM killer.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14  9:58         ` David Hildenbrand
  2023-02-14 13:07           ` Pasha Tatashin
@ 2023-02-14 15:59           ` Chih-En Lin
  2023-02-14 16:30             ` Pasha Tatashin
  2023-02-14 16:58             ` David Hildenbrand
  2023-02-14 17:23           ` Yang Shi
  2 siblings, 2 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-14 15:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 10:58:30AM +0100, David Hildenbrand wrote:
> On 10.02.23 18:20, Chih-En Lin wrote:
> > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
> > > > > > Currently, copy-on-write is only used for the mapped memory; the child
> > > > > > process still needs to copy the entire page table from the parent
> > > > > > process during forking. The parent process might take a lot of time and
> > > > > > memory to copy the page table when the parent has a big page table
> > > > > > allocated. For example, the memory usage of a process after forking with
> > > > > > 1 GB mapped memory is as follows:
> > > > > 
> > > > > For some reason, I was not able to reproduce performance improvements
> > > > > with a simple fork() performance measurement program. The results that
> > > > > I saw are the following:
> > > > > 
> > > > > Base:
> > > > > Fork latency per gigabyte: 0.004416 seconds
> > > > > Fork latency per gigabyte: 0.004382 seconds
> > > > > Fork latency per gigabyte: 0.004442 seconds
> > > > > COW kernel:
> > > > > Fork latency per gigabyte: 0.004524 seconds
> > > > > Fork latency per gigabyte: 0.004764 seconds
> > > > > Fork latency per gigabyte: 0.004547 seconds
> > > > > 
> > > > > AMD EPYC 7B12 64-Core Processor
> > > > > Base:
> > > > > Fork latency per gigabyte: 0.003923 seconds
> > > > > Fork latency per gigabyte: 0.003909 seconds
> > > > > Fork latency per gigabyte: 0.003955 seconds
> > > > > COW kernel:
> > > > > Fork latency per gigabyte: 0.004221 seconds
> > > > > Fork latency per gigabyte: 0.003882 seconds
> > > > > Fork latency per gigabyte: 0.003854 seconds
> > > > > 
> > > > > Given, that page table for child is not copied, I was expecting the
> > > > > performance to be better with COW kernel, and also not to depend on
> > > > > the size of the parent.
> > > > 
> > > > Yes, the child won't duplicate the page table, but fork will still
> > > > traverse all the page table entries to do the accounting.
> > > > And, since this patch expends the COW to the PTE table level, it's not
> > > > the mapped page (page table entry) grained anymore, so we have to
> > > > guarantee that all the mapped page is available to do COW mapping in
> > > > the such page table.
> > > > This kind of checking also costs some time.
> > > > As a result, since the accounting and the checking, the COW PTE fork
> > > > still depends on the size of the parent so the improvement might not
> > > > be significant.
> > > 
> > > The current version of the series does not provide any performance
> > > improvements for fork(). I would recommend removing claims from the
> > > cover letter about better fork() performance, as this may be
> > > misleading for those looking for a way to speed up forking. In my
> > 
> >  From v3 to v4, I changed the implementation of the COW fork() part to do
> > the accounting and checking. At the time, I also removed most of the
> > descriptions about the better fork() performance. Maybe it's not enough
> > and still has some misleading. I will fix this in the next version.
> > Thanks.
> > 
> > > case, I was looking to speed up Redis OSS, which relies on fork() to
> > > create consistent snapshots for driving replicates/backups. The O(N)
> > > per-page operation causes fork() to be slow, so I was hoping that this
> > > series, which does not duplicate the VA during fork(), would make the
> > > operation much quicker.
> > 
> > Indeed, at first, I tried to avoid the O(N) per-page operation by
> > deferring the accounting and the swap stuff to the page fault. But,
> > as I mentioned, it's not suitable for the mainline.
> > 
> > Honestly, for improving the fork(), I have an idea to skip the per-page
> > operation without breaking the logic. However, this will introduce the
> > complicated mechanism and may has the overhead for other features. It
> > might not be worth it. It's hard to strike a balance between the
> > over-complicated mechanism with (probably) better performance and data
> > consistency with the page status. So, I would focus on the safety and
> > stable approach at first.
> 
> Yes, it is most probably possible, but complexity, robustness and
> maintainability have to be considered as well.
> 
> Thanks for implementing this approach (only deduplication without other
> optimizations) and evaluating it accordingly. It's certainly "cleaner", such
> that we only have to mess with unsharing and not with other
> accounting/pinning/mapcount thingies. But it also highlights how intrusive
> even this basic deduplication approach already is -- and that most benefits
> of the original approach requires even more complexity on top.
> 
> I am not quite sure if the benefit is worth the price (I am not to decide
> and I would like to hear other options).

I'm looking at the discussion of page table sharing in 2002 [1]. 
It looks like in 2002 ~ 2006, there also have some patches try to
improve fork().

After that, I also saw one thread which is about another shared page
table patch's benchmark. I can't find the original patch though [2].
But, I found the probably same patch in 2005 [3], it also mentioned
the previous benchmark discussion:

"
For those familiar with the shared page table patch I did a couple of years
ago, this patch does not implement copy-on-write page tables for private
mappings.  Analysis showed the cost and complexity far outweighed any
potential benefit.
"

However, it might be different right now. For example, the implemetation
. We have split page table lock now, so we don't have to consider the
page_table_share_lock thing. Also, presently, we have different use
cases (shells [2] v.s. VM cloning and fuzzing) to consider.

Nonetheless, I still think the discussion can provide some of the mind
to us.

BTW, It seems like the 2002 patch [1] is different from the 2002 [2]
and 2005 [3].

[1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html
[2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@digeo.com/
[3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u

> My quick thoughts after skimming over the core parts of this series
> 
> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
>     likely (meaning that it might be fairly error-prone) and forgetting
>     to break COW on a PTE table, accidentally modifying the shared
>     table.

Maybe I should also handle arch/ and others parts.
I will keep looking at where I missed.

> (2) break_cow_pte() can fail, which means that we can fail some
>     operations (possibly silently halfway through) now. For example,
>     looking at your change_pte_range() change, I suspect it's wrong.

Maybe I should add WARN_ON() and skip the failed COW PTE.

> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>     double-checking: we temporarily clear the PMD, to reset it
>     afterwards. I am not sure if that is correct. For example, what
>     stops another page fault stumbling over that pmd_none() and
>     allocating an empty page table? Maybe there are some locking details
>     missing or they are very subtle such that we better document them. I
>    recall that THP played quite some tricks to make such cases work ...

I think that holding mmap_write_lock may be enough (I added
mmap_assert_write_locked() in the fault function btw). But, I might
be wrong. I will look at the THP stuff to see how they work. Thanks.

Thanks for the review.

> > 
> > > > Actually, at the RFC v1 and v2, we proposed the version of skipping
> > > > those works, and we got a significant improvement. You can see the
> > > > number from RFC v2 cover letter [1]:
> > > > "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
> > > > for normal fork"
> > > 
> > > I suspect the 93% improvement (when the mapcount was not updated) was
> > > only for VAs with 4K pages. With 2M mappings this series did not
> > > provide any benefit is this correct?
> > 
> > Yes. In this case, the COW PTE performance is similar to the normal
> > fork().
> 
> 
> The thing with THP is, that during fork(), we always allocate a backup PTE
> table, to be able to PTE-map the THP whenever we have to. Otherwise we'd
> have to eventually fail some operations we don't want to fail -- similar to
> the case where break_cow_pte() could fail now due to -ENOMEM although we
> really don't want to fail (e.g., change_pte_range() ).
> 
> I always considered that wasteful, because in many scenarios, we'll never
> ever split a THP and possibly waste memory.
> 
> Optimizing that for THP (e.g., don't always allocate backup THP, have some
> global allocation backup pool for splits + refill when close-to-empty) might
> provide similar fork() improvements, both in speed and memory consumption
> when it comes to anonymous memory.

When collapsing huge pages, do/can they reuse those PTEs for backup?
So, we don't have to allocate the PTE or maintain the pool.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 15:59           ` Chih-En Lin
@ 2023-02-14 16:30             ` Pasha Tatashin
  2023-02-14 18:41               ` Chih-En Lin
  2023-02-14 16:58             ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Pasha Tatashin @ 2023-02-14 16:30 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: David Hildenbrand, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

> > The thing with THP is, that during fork(), we always allocate a backup PTE
> > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd
> > have to eventually fail some operations we don't want to fail -- similar to
> > the case where break_cow_pte() could fail now due to -ENOMEM although we
> > really don't want to fail (e.g., change_pte_range() ).
> >
> > I always considered that wasteful, because in many scenarios, we'll never
> > ever split a THP and possibly waste memory.
> >
> > Optimizing that for THP (e.g., don't always allocate backup THP, have some
> > global allocation backup pool for splits + refill when close-to-empty) might
> > provide similar fork() improvements, both in speed and memory consumption
> > when it comes to anonymous memory.
>
> When collapsing huge pages, do/can they reuse those PTEs for backup?
> So, we don't have to allocate the PTE or maintain the pool.

It might not work for all pages, as collapsing pages might have had
holes in the user page table, and there were no PTE tables.
Pasha

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 15:59           ` Chih-En Lin
  2023-02-14 16:30             ` Pasha Tatashin
@ 2023-02-14 16:58             ` David Hildenbrand
  2023-02-14 17:03               ` David Hildenbrand
  2023-02-14 17:54               ` Chih-En Lin
  1 sibling, 2 replies; 37+ messages in thread
From: David Hildenbrand @ 2023-02-14 16:58 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng


>>>
>>> Honestly, for improving the fork(), I have an idea to skip the per-page
>>> operation without breaking the logic. However, this will introduce the
>>> complicated mechanism and may has the overhead for other features. It
>>> might not be worth it. It's hard to strike a balance between the
>>> over-complicated mechanism with (probably) better performance and data
>>> consistency with the page status. So, I would focus on the safety and
>>> stable approach at first.
>>
>> Yes, it is most probably possible, but complexity, robustness and
>> maintainability have to be considered as well.
>>
>> Thanks for implementing this approach (only deduplication without other
>> optimizations) and evaluating it accordingly. It's certainly "cleaner", such
>> that we only have to mess with unsharing and not with other
>> accounting/pinning/mapcount thingies. But it also highlights how intrusive
>> even this basic deduplication approach already is -- and that most benefits
>> of the original approach requires even more complexity on top.
>>
>> I am not quite sure if the benefit is worth the price (I am not to decide
>> and I would like to hear other options).
> 
> I'm looking at the discussion of page table sharing in 2002 [1].
> It looks like in 2002 ~ 2006, there also have some patches try to
> improve fork().
> 
> After that, I also saw one thread which is about another shared page
> table patch's benchmark. I can't find the original patch though [2].
> But, I found the probably same patch in 2005 [3], it also mentioned
> the previous benchmark discussion:
> 
> "
> For those familiar with the shared page table patch I did a couple of years
> ago, this patch does not implement copy-on-write page tables for private
> mappings.  Analysis showed the cost and complexity far outweighed any
> potential benefit.
> "

Thanks for the pointer, interesting read. And my personal opinion is 
that part of that statement still hold true :)

> 
> However, it might be different right now. For example, the implemetation
> . We have split page table lock now, so we don't have to consider the
> page_table_share_lock thing. Also, presently, we have different use
> cases (shells [2] v.s. VM cloning and fuzzing) to consider.
> 
> Nonetheless, I still think the discussion can provide some of the mind
> to us.
> 
> BTW, It seems like the 2002 patch [1] is different from the 2002 [2]
> and 2005 [3].
> 
> [1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html
> [2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@digeo.com/
> [3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u
> 
>> My quick thoughts after skimming over the core parts of this series
>>
>> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
>>      likely (meaning that it might be fairly error-prone) and forgetting
>>      to break COW on a PTE table, accidentally modifying the shared
>>      table.
> 
> Maybe I should also handle arch/ and others parts.
> I will keep looking at where I missed.

One could add sanity checks when modifying a PTE while the PTE table is 
still marked shared ... but I guess there are some valid reasons where 
we might want to modify shared PTE tables (rmap).

> 
>> (2) break_cow_pte() can fail, which means that we can fail some
>>      operations (possibly silently halfway through) now. For example,
>>      looking at your change_pte_range() change, I suspect it's wrong.
> 
> Maybe I should add WARN_ON() and skip the failed COW PTE.

One way or the other we'll have to handle it. WARN_ON() sounds wrong for 
handling OOM situations (e.g., if only that cgroup is OOM).

> 
>> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>>      double-checking: we temporarily clear the PMD, to reset it
>>      afterwards. I am not sure if that is correct. For example, what
>>      stops another page fault stumbling over that pmd_none() and
>>      allocating an empty page table? Maybe there are some locking details
>>      missing or they are very subtle such that we better document them. I
>>     recall that THP played quite some tricks to make such cases work ...
> 
> I think that holding mmap_write_lock may be enough (I added
> mmap_assert_write_locked() in the fault function btw). But, I might
> be wrong. I will look at the THP stuff to see how they work. Thanks.
> 

Ehm, but page faults don't hold the mmap lock writable? And so are other 
callers, like MADV_DONTNEED or MADV_FREE.

handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() 
should bail out.

Either I am missing something or you didn't test with lockdep enabled :)

Note that there are upstream efforts to use only a VMA lock (and some 
people even want to perform some page faults only protected by RCU).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 16:58             ` David Hildenbrand
@ 2023-02-14 17:03               ` David Hildenbrand
  2023-02-14 17:56                 ` Chih-En Lin
  2023-02-14 17:54               ` Chih-En Lin
  1 sibling, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2023-02-14 17:03 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On 14.02.23 17:58, David Hildenbrand wrote:
> 
>>>>
>>>> Honestly, for improving the fork(), I have an idea to skip the per-page
>>>> operation without breaking the logic. However, this will introduce the
>>>> complicated mechanism and may has the overhead for other features. It
>>>> might not be worth it. It's hard to strike a balance between the
>>>> over-complicated mechanism with (probably) better performance and data
>>>> consistency with the page status. So, I would focus on the safety and
>>>> stable approach at first.
>>>
>>> Yes, it is most probably possible, but complexity, robustness and
>>> maintainability have to be considered as well.
>>>
>>> Thanks for implementing this approach (only deduplication without other
>>> optimizations) and evaluating it accordingly. It's certainly "cleaner", such
>>> that we only have to mess with unsharing and not with other
>>> accounting/pinning/mapcount thingies. But it also highlights how intrusive
>>> even this basic deduplication approach already is -- and that most benefits
>>> of the original approach requires even more complexity on top.
>>>
>>> I am not quite sure if the benefit is worth the price (I am not to decide
>>> and I would like to hear other options).
>>
>> I'm looking at the discussion of page table sharing in 2002 [1].
>> It looks like in 2002 ~ 2006, there also have some patches try to
>> improve fork().
>>
>> After that, I also saw one thread which is about another shared page
>> table patch's benchmark. I can't find the original patch though [2].
>> But, I found the probably same patch in 2005 [3], it also mentioned
>> the previous benchmark discussion:
>>
>> "
>> For those familiar with the shared page table patch I did a couple of years
>> ago, this patch does not implement copy-on-write page tables for private
>> mappings.  Analysis showed the cost and complexity far outweighed any
>> potential benefit.
>> "
> 
> Thanks for the pointer, interesting read. And my personal opinion is
> that part of that statement still hold true :)
> 
>>
>> However, it might be different right now. For example, the implemetation
>> . We have split page table lock now, so we don't have to consider the
>> page_table_share_lock thing. Also, presently, we have different use
>> cases (shells [2] v.s. VM cloning and fuzzing) to consider.


Oh, and because I stumbled over it, just as an interesting pointer on 
QEMU devel:

"[PATCH 00/10] Retire Fork-Based Fuzzing" [1]

[1] https://lore.kernel.org/all/20230205042951.3570008-1-alxndr@bu.edu/T/#u

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14  9:58         ` David Hildenbrand
  2023-02-14 13:07           ` Pasha Tatashin
  2023-02-14 15:59           ` Chih-En Lin
@ 2023-02-14 17:23           ` Yang Shi
  2023-02-14 17:39             ` David Hildenbrand
  2 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2023-02-14 17:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chih-En Lin, Pasha Tatashin, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.02.23 18:20, Chih-En Lin wrote:
> > On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
> >>>>> Currently, copy-on-write is only used for the mapped memory; the child
> >>>>> process still needs to copy the entire page table from the parent
> >>>>> process during forking. The parent process might take a lot of time and
> >>>>> memory to copy the page table when the parent has a big page table
> >>>>> allocated. For example, the memory usage of a process after forking with
> >>>>> 1 GB mapped memory is as follows:
> >>>>
> >>>> For some reason, I was not able to reproduce performance improvements
> >>>> with a simple fork() performance measurement program. The results that
> >>>> I saw are the following:
> >>>>
> >>>> Base:
> >>>> Fork latency per gigabyte: 0.004416 seconds
> >>>> Fork latency per gigabyte: 0.004382 seconds
> >>>> Fork latency per gigabyte: 0.004442 seconds
> >>>> COW kernel:
> >>>> Fork latency per gigabyte: 0.004524 seconds
> >>>> Fork latency per gigabyte: 0.004764 seconds
> >>>> Fork latency per gigabyte: 0.004547 seconds
> >>>>
> >>>> AMD EPYC 7B12 64-Core Processor
> >>>> Base:
> >>>> Fork latency per gigabyte: 0.003923 seconds
> >>>> Fork latency per gigabyte: 0.003909 seconds
> >>>> Fork latency per gigabyte: 0.003955 seconds
> >>>> COW kernel:
> >>>> Fork latency per gigabyte: 0.004221 seconds
> >>>> Fork latency per gigabyte: 0.003882 seconds
> >>>> Fork latency per gigabyte: 0.003854 seconds
> >>>>
> >>>> Given, that page table for child is not copied, I was expecting the
> >>>> performance to be better with COW kernel, and also not to depend on
> >>>> the size of the parent.
> >>>
> >>> Yes, the child won't duplicate the page table, but fork will still
> >>> traverse all the page table entries to do the accounting.
> >>> And, since this patch expends the COW to the PTE table level, it's not
> >>> the mapped page (page table entry) grained anymore, so we have to
> >>> guarantee that all the mapped page is available to do COW mapping in
> >>> the such page table.
> >>> This kind of checking also costs some time.
> >>> As a result, since the accounting and the checking, the COW PTE fork
> >>> still depends on the size of the parent so the improvement might not
> >>> be significant.
> >>
> >> The current version of the series does not provide any performance
> >> improvements for fork(). I would recommend removing claims from the
> >> cover letter about better fork() performance, as this may be
> >> misleading for those looking for a way to speed up forking. In my
> >
> >  From v3 to v4, I changed the implementation of the COW fork() part to do
> > the accounting and checking. At the time, I also removed most of the
> > descriptions about the better fork() performance. Maybe it's not enough
> > and still has some misleading. I will fix this in the next version.
> > Thanks.
> >
> >> case, I was looking to speed up Redis OSS, which relies on fork() to
> >> create consistent snapshots for driving replicates/backups. The O(N)
> >> per-page operation causes fork() to be slow, so I was hoping that this
> >> series, which does not duplicate the VA during fork(), would make the
> >> operation much quicker.
> >
> > Indeed, at first, I tried to avoid the O(N) per-page operation by
> > deferring the accounting and the swap stuff to the page fault. But,
> > as I mentioned, it's not suitable for the mainline.
> >
> > Honestly, for improving the fork(), I have an idea to skip the per-page
> > operation without breaking the logic. However, this will introduce the
> > complicated mechanism and may has the overhead for other features. It
> > might not be worth it. It's hard to strike a balance between the
> > over-complicated mechanism with (probably) better performance and data
> > consistency with the page status. So, I would focus on the safety and
> > stable approach at first.
>
> Yes, it is most probably possible, but complexity, robustness and
> maintainability have to be considered as well.
>
> Thanks for implementing this approach (only deduplication without other
> optimizations) and evaluating it accordingly. It's certainly "cleaner",
> such that we only have to mess with unsharing and not with other
> accounting/pinning/mapcount thingies. But it also highlights how
> intrusive even this basic deduplication approach already is -- and that
> most benefits of the original approach requires even more complexity on top.
>
> I am not quite sure if the benefit is worth the price (I am not to
> decide and I would like to hear other options).
>
> My quick thoughts after skimming over the core parts of this series
>
> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
>      likely (meaning that it might be fairly error-prone) and forgetting
>      to break COW on a PTE table, accidentally modifying the shared
>      table.
> (2) break_cow_pte() can fail, which means that we can fail some
>      operations (possibly silently halfway through) now. For example,
>      looking at your change_pte_range() change, I suspect it's wrong.
> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>      double-checking: we temporarily clear the PMD, to reset it
>      afterwards. I am not sure if that is correct. For example, what
>      stops another page fault stumbling over that pmd_none() and
>      allocating an empty page table? Maybe there are some locking details
>      missing or they are very subtle such that we better document them. I
>     recall that THP played quite some tricks to make such cases work ...
>
> >
> >>> Actually, at the RFC v1 and v2, we proposed the version of skipping
> >>> those works, and we got a significant improvement. You can see the
> >>> number from RFC v2 cover letter [1]:
> >>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
> >>> for normal fork"
> >>
> >> I suspect the 93% improvement (when the mapcount was not updated) was
> >> only for VAs with 4K pages. With 2M mappings this series did not
> >> provide any benefit is this correct?
> >
> > Yes. In this case, the COW PTE performance is similar to the normal
> > fork().
>
>
> The thing with THP is, that during fork(), we always allocate a backup
> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise
> we'd have to eventually fail some operations we don't want to fail --
> similar to the case where break_cow_pte() could fail now due to -ENOMEM
> although we really don't want to fail (e.g., change_pte_range() ).
>
> I always considered that wasteful, because in many scenarios, we'll
> never ever split a THP and possibly waste memory.

When you say "split THP", do you mean split the compound page to base
pages? IIUC the backup PTE table page is used to guarantee the PMD
split (just convert pmd mapped THP to PTE-mapped but not split the
compound page) succeed. You may already notice there is no return
value for PMD split.

The PMD split may be called quite often, for example, MADV_DONTNEED,
mbind, mlock, and even in memory reclamation context  (THP swap).

>
> Optimizing that for THP (e.g., don't always allocate backup THP, have
> some global allocation backup pool for splits + refill when
> close-to-empty) might provide similar fork() improvements, both in speed
> and memory consumption when it comes to anonymous memory.

It might work. But may be much more complicated than what you thought
when handling multiple parallel PMD splits.

>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 17:23           ` Yang Shi
@ 2023-02-14 17:39             ` David Hildenbrand
  2023-02-14 18:25               ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2023-02-14 17:39 UTC (permalink / raw)
  To: Yang Shi
  Cc: Chih-En Lin, Pasha Tatashin, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On 14.02.23 18:23, Yang Shi wrote:
> On Tue, Feb 14, 2023 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.02.23 18:20, Chih-En Lin wrote:
>>> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
>>>>>>> Currently, copy-on-write is only used for the mapped memory; the child
>>>>>>> process still needs to copy the entire page table from the parent
>>>>>>> process during forking. The parent process might take a lot of time and
>>>>>>> memory to copy the page table when the parent has a big page table
>>>>>>> allocated. For example, the memory usage of a process after forking with
>>>>>>> 1 GB mapped memory is as follows:
>>>>>>
>>>>>> For some reason, I was not able to reproduce performance improvements
>>>>>> with a simple fork() performance measurement program. The results that
>>>>>> I saw are the following:
>>>>>>
>>>>>> Base:
>>>>>> Fork latency per gigabyte: 0.004416 seconds
>>>>>> Fork latency per gigabyte: 0.004382 seconds
>>>>>> Fork latency per gigabyte: 0.004442 seconds
>>>>>> COW kernel:
>>>>>> Fork latency per gigabyte: 0.004524 seconds
>>>>>> Fork latency per gigabyte: 0.004764 seconds
>>>>>> Fork latency per gigabyte: 0.004547 seconds
>>>>>>
>>>>>> AMD EPYC 7B12 64-Core Processor
>>>>>> Base:
>>>>>> Fork latency per gigabyte: 0.003923 seconds
>>>>>> Fork latency per gigabyte: 0.003909 seconds
>>>>>> Fork latency per gigabyte: 0.003955 seconds
>>>>>> COW kernel:
>>>>>> Fork latency per gigabyte: 0.004221 seconds
>>>>>> Fork latency per gigabyte: 0.003882 seconds
>>>>>> Fork latency per gigabyte: 0.003854 seconds
>>>>>>
>>>>>> Given, that page table for child is not copied, I was expecting the
>>>>>> performance to be better with COW kernel, and also not to depend on
>>>>>> the size of the parent.
>>>>>
>>>>> Yes, the child won't duplicate the page table, but fork will still
>>>>> traverse all the page table entries to do the accounting.
>>>>> And, since this patch expends the COW to the PTE table level, it's not
>>>>> the mapped page (page table entry) grained anymore, so we have to
>>>>> guarantee that all the mapped page is available to do COW mapping in
>>>>> the such page table.
>>>>> This kind of checking also costs some time.
>>>>> As a result, since the accounting and the checking, the COW PTE fork
>>>>> still depends on the size of the parent so the improvement might not
>>>>> be significant.
>>>>
>>>> The current version of the series does not provide any performance
>>>> improvements for fork(). I would recommend removing claims from the
>>>> cover letter about better fork() performance, as this may be
>>>> misleading for those looking for a way to speed up forking. In my
>>>
>>>   From v3 to v4, I changed the implementation of the COW fork() part to do
>>> the accounting and checking. At the time, I also removed most of the
>>> descriptions about the better fork() performance. Maybe it's not enough
>>> and still has some misleading. I will fix this in the next version.
>>> Thanks.
>>>
>>>> case, I was looking to speed up Redis OSS, which relies on fork() to
>>>> create consistent snapshots for driving replicates/backups. The O(N)
>>>> per-page operation causes fork() to be slow, so I was hoping that this
>>>> series, which does not duplicate the VA during fork(), would make the
>>>> operation much quicker.
>>>
>>> Indeed, at first, I tried to avoid the O(N) per-page operation by
>>> deferring the accounting and the swap stuff to the page fault. But,
>>> as I mentioned, it's not suitable for the mainline.
>>>
>>> Honestly, for improving the fork(), I have an idea to skip the per-page
>>> operation without breaking the logic. However, this will introduce the
>>> complicated mechanism and may has the overhead for other features. It
>>> might not be worth it. It's hard to strike a balance between the
>>> over-complicated mechanism with (probably) better performance and data
>>> consistency with the page status. So, I would focus on the safety and
>>> stable approach at first.
>>
>> Yes, it is most probably possible, but complexity, robustness and
>> maintainability have to be considered as well.
>>
>> Thanks for implementing this approach (only deduplication without other
>> optimizations) and evaluating it accordingly. It's certainly "cleaner",
>> such that we only have to mess with unsharing and not with other
>> accounting/pinning/mapcount thingies. But it also highlights how
>> intrusive even this basic deduplication approach already is -- and that
>> most benefits of the original approach requires even more complexity on top.
>>
>> I am not quite sure if the benefit is worth the price (I am not to
>> decide and I would like to hear other options).
>>
>> My quick thoughts after skimming over the core parts of this series
>>
>> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
>>       likely (meaning that it might be fairly error-prone) and forgetting
>>       to break COW on a PTE table, accidentally modifying the shared
>>       table.
>> (2) break_cow_pte() can fail, which means that we can fail some
>>       operations (possibly silently halfway through) now. For example,
>>       looking at your change_pte_range() change, I suspect it's wrong.
>> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>>       double-checking: we temporarily clear the PMD, to reset it
>>       afterwards. I am not sure if that is correct. For example, what
>>       stops another page fault stumbling over that pmd_none() and
>>       allocating an empty page table? Maybe there are some locking details
>>       missing or they are very subtle such that we better document them. I
>>      recall that THP played quite some tricks to make such cases work ...
>>
>>>
>>>>> Actually, at the RFC v1 and v2, we proposed the version of skipping
>>>>> those works, and we got a significant improvement. You can see the
>>>>> number from RFC v2 cover letter [1]:
>>>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
>>>>> for normal fork"
>>>>
>>>> I suspect the 93% improvement (when the mapcount was not updated) was
>>>> only for VAs with 4K pages. With 2M mappings this series did not
>>>> provide any benefit is this correct?
>>>
>>> Yes. In this case, the COW PTE performance is similar to the normal
>>> fork().
>>
>>
>> The thing with THP is, that during fork(), we always allocate a backup
>> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise
>> we'd have to eventually fail some operations we don't want to fail --
>> similar to the case where break_cow_pte() could fail now due to -ENOMEM
>> although we really don't want to fail (e.g., change_pte_range() ).
>>
>> I always considered that wasteful, because in many scenarios, we'll
>> never ever split a THP and possibly waste memory.
> 
> When you say "split THP", do you mean split the compound page to base
> pages? IIUC the backup PTE table page is used to guarantee the PMD
> split (just convert pmd mapped THP to PTE-mapped but not split the
> compound page) succeed. You may already notice there is no return
> value for PMD split.

Yes, as I raised in my other reply.

> 
> The PMD split may be called quite often, for example, MADV_DONTNEED,
> mbind, mlock, and even in memory reclamation context  (THP swap).

Yes, but with a single MADV_DONTNEED call you cannot PTE-map more than 2 
THP (all other overlapped THP will get zapped). Same with most other 
operations.

There are corner cases, though. I recall that s390x/kvm wants to break 
all THP in a given VMA range. But that operation could safely fail if we 
can't do that.

Certainly needs some investigation, that's most probably why it hasn't 
been done yet.

> 
>>
>> Optimizing that for THP (e.g., don't always allocate backup THP, have
>> some global allocation backup pool for splits + refill when
>> close-to-empty) might provide similar fork() improvements, both in speed
>> and memory consumption when it comes to anonymous memory.
> 
> It might work. But may be much more complicated than what you thought
> when handling multiple parallel PMD splits.


I consider the whole PTE-table linking to THPs complicated enough to 
eventually replace it by something differently complicated that wastes 
less memory ;)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 16:58             ` David Hildenbrand
  2023-02-14 17:03               ` David Hildenbrand
@ 2023-02-14 17:54               ` Chih-En Lin
  2023-02-14 17:59                 ` David Hildenbrand
  1 sibling, 1 reply; 37+ messages in thread
From: Chih-En Lin @ 2023-02-14 17:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 05:58:45PM +0100, David Hildenbrand wrote:
> 
> > > > 
> > > > Honestly, for improving the fork(), I have an idea to skip the per-page
> > > > operation without breaking the logic. However, this will introduce the
> > > > complicated mechanism and may has the overhead for other features. It
> > > > might not be worth it. It's hard to strike a balance between the
> > > > over-complicated mechanism with (probably) better performance and data
> > > > consistency with the page status. So, I would focus on the safety and
> > > > stable approach at first.
> > > 
> > > Yes, it is most probably possible, but complexity, robustness and
> > > maintainability have to be considered as well.
> > > 
> > > Thanks for implementing this approach (only deduplication without other
> > > optimizations) and evaluating it accordingly. It's certainly "cleaner", such
> > > that we only have to mess with unsharing and not with other
> > > accounting/pinning/mapcount thingies. But it also highlights how intrusive
> > > even this basic deduplication approach already is -- and that most benefits
> > > of the original approach requires even more complexity on top.
> > > 
> > > I am not quite sure if the benefit is worth the price (I am not to decide
> > > and I would like to hear other options).
> > 
> > I'm looking at the discussion of page table sharing in 2002 [1].
> > It looks like in 2002 ~ 2006, there also have some patches try to
> > improve fork().
> > 
> > After that, I also saw one thread which is about another shared page
> > table patch's benchmark. I can't find the original patch though [2].
> > But, I found the probably same patch in 2005 [3], it also mentioned
> > the previous benchmark discussion:
> > 
> > "
> > For those familiar with the shared page table patch I did a couple of years
> > ago, this patch does not implement copy-on-write page tables for private
> > mappings.  Analysis showed the cost and complexity far outweighed any
> > potential benefit.
> > "
> 
> Thanks for the pointer, interesting read. And my personal opinion is that
> part of that statement still hold true :)

;)

> > 
> > However, it might be different right now. For example, the implemetation
> > . We have split page table lock now, so we don't have to consider the
> > page_table_share_lock thing. Also, presently, we have different use
> > cases (shells [2] v.s. VM cloning and fuzzing) to consider.
> > 
> > Nonetheless, I still think the discussion can provide some of the mind
> > to us.
> > 
> > BTW, It seems like the 2002 patch [1] is different from the 2002 [2]
> > and 2005 [3].
> > 
> > [1] https://lkml.iu.edu/hypermail/linux/kernel/0202.2/0102.html
> > [2] https://lore.kernel.org/linux-mm/3E02FACD.5B300794@digeo.com/
> > [3] https://lore.kernel.org/linux-mm/7C49DFF721CB4E671DB260F9@%5B10.1.1.4%5D/T/#u
> > 
> > > My quick thoughts after skimming over the core parts of this series
> > > 
> > > (1) forgetting to break COW on a PTE in some pgtable walker feels quite
> > >      likely (meaning that it might be fairly error-prone) and forgetting
> > >      to break COW on a PTE table, accidentally modifying the shared
> > >      table.
> > 
> > Maybe I should also handle arch/ and others parts.
> > I will keep looking at where I missed.
> 
> One could add sanity checks when modifying a PTE while the PTE table is
> still marked shared ... but I guess there are some valid reasons where we
> might want to modify shared PTE tables (rmap).

Sounds good for adding sanity checks. I will look at this.
One of the valid reasons that come to my head might be the
referenced bit (rmap).

> > 
> > > (2) break_cow_pte() can fail, which means that we can fail some
> > >      operations (possibly silently halfway through) now. For example,
> > >      looking at your change_pte_range() change, I suspect it's wrong.
> > 
> > Maybe I should add WARN_ON() and skip the failed COW PTE.
> 
> One way or the other we'll have to handle it. WARN_ON() sounds wrong for
> handling OOM situations (e.g., if only that cgroup is OOM).

Or we should do the same thing like you mentioned:
"
For example, __split_huge_pmd() is currently not able to report a 
failure. I assume that we could sleep in there. And if we're not able to 
allocate any memory in there (with sleeping), maybe the process should 
be zapped either way by the OOM killer.
"

But instead of zapping the process, we just skip the failed COW PTE.
I don't think the user will expect their process to be killed by
changing the protection.

> > 
> > > (3) handle_cow_pte_fault() looks quite complicated and needs quite some
> > >      double-checking: we temporarily clear the PMD, to reset it
> > >      afterwards. I am not sure if that is correct. For example, what
> > >      stops another page fault stumbling over that pmd_none() and
> > >      allocating an empty page table? Maybe there are some locking details
> > >      missing or they are very subtle such that we better document them. I
> > >     recall that THP played quite some tricks to make such cases work ...
> > 
> > I think that holding mmap_write_lock may be enough (I added
> > mmap_assert_write_locked() in the fault function btw). But, I might
> > be wrong. I will look at the THP stuff to see how they work. Thanks.
> > 
> 
> Ehm, but page faults don't hold the mmap lock writable? And so are other
> callers, like MADV_DONTNEED or MADV_FREE.
> 
> handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should
> bail out.
> 
> Either I am missing something or you didn't test with lockdep enabled :)

You're right. I thought I enabled the lockdep.
And, why do I have the page fault will handle the mmap lock writable in my mind.
The page fault holds the mmap lock readable instead of writable.
;-)

I should check/test all the locks again.
Thanks.

> 
> Note that there are upstream efforts to use only a VMA lock (and some people
> even want to perform some page faults only protected by RCU).

I saw the discussion (https://lwn.net/Articles/906852/) before.
If the page fault handler only uses a VMA lock, handle_cow_pte_fault() might not
be affected since it only takes one VMA at a time. handle_cow_pte_fault() just
allocate the PTE and copy the COW mapping entries to the new one.
It's alredy handle the checking and accounting in copy_cow_pte_range().

But, if we decide to skip the per-page operation during fork().
We should handle the VMA lock (or RCU) for the accounting and other
stuff. It might be more complicated than before...

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 17:03               ` David Hildenbrand
@ 2023-02-14 17:56                 ` Chih-En Lin
  0 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-14 17:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 06:03:58PM +0100, David Hildenbrand wrote:
> On 14.02.23 17:58, David Hildenbrand wrote:
> > 
> > > > > 
> > > > > Honestly, for improving the fork(), I have an idea to skip the per-page
> > > > > operation without breaking the logic. However, this will introduce the
> > > > > complicated mechanism and may has the overhead for other features. It
> > > > > might not be worth it. It's hard to strike a balance between the
> > > > > over-complicated mechanism with (probably) better performance and data
> > > > > consistency with the page status. So, I would focus on the safety and
> > > > > stable approach at first.
> > > > 
> > > > Yes, it is most probably possible, but complexity, robustness and
> > > > maintainability have to be considered as well.
> > > > 
> > > > Thanks for implementing this approach (only deduplication without other
> > > > optimizations) and evaluating it accordingly. It's certainly "cleaner", such
> > > > that we only have to mess with unsharing and not with other
> > > > accounting/pinning/mapcount thingies. But it also highlights how intrusive
> > > > even this basic deduplication approach already is -- and that most benefits
> > > > of the original approach requires even more complexity on top.
> > > > 
> > > > I am not quite sure if the benefit is worth the price (I am not to decide
> > > > and I would like to hear other options).
> > > 
> > > I'm looking at the discussion of page table sharing in 2002 [1].
> > > It looks like in 2002 ~ 2006, there also have some patches try to
> > > improve fork().
> > > 
> > > After that, I also saw one thread which is about another shared page
> > > table patch's benchmark. I can't find the original patch though [2].
> > > But, I found the probably same patch in 2005 [3], it also mentioned
> > > the previous benchmark discussion:
> > > 
> > > "
> > > For those familiar with the shared page table patch I did a couple of years
> > > ago, this patch does not implement copy-on-write page tables for private
> > > mappings.  Analysis showed the cost and complexity far outweighed any
> > > potential benefit.
> > > "
> > 
> > Thanks for the pointer, interesting read. And my personal opinion is
> > that part of that statement still hold true :)
> > 
> > > 
> > > However, it might be different right now. For example, the implemetation
> > > . We have split page table lock now, so we don't have to consider the
> > > page_table_share_lock thing. Also, presently, we have different use
> > > cases (shells [2] v.s. VM cloning and fuzzing) to consider.
> 
> 
> Oh, and because I stumbled over it, just as an interesting pointer on QEMU
> devel:
> 
> "[PATCH 00/10] Retire Fork-Based Fuzzing" [1]
> 
> [1] https://lore.kernel.org/all/20230205042951.3570008-1-alxndr@bu.edu/T/#u

Thanks for the information.
It's interesting.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 17:54               ` Chih-En Lin
@ 2023-02-14 17:59                 ` David Hildenbrand
  2023-02-14 19:06                   ` Chih-En Lin
  0 siblings, 1 reply; 37+ messages in thread
From: David Hildenbrand @ 2023-02-14 17:59 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On 14.02.23 18:54, Chih-En Lin wrote:
>>>
>>>> (2) break_cow_pte() can fail, which means that we can fail some
>>>>       operations (possibly silently halfway through) now. For example,
>>>>       looking at your change_pte_range() change, I suspect it's wrong.
>>>
>>> Maybe I should add WARN_ON() and skip the failed COW PTE.
>>
>> One way or the other we'll have to handle it. WARN_ON() sounds wrong for
>> handling OOM situations (e.g., if only that cgroup is OOM).
> 
> Or we should do the same thing like you mentioned:
> "
> For example, __split_huge_pmd() is currently not able to report a
> failure. I assume that we could sleep in there. And if we're not able to
> allocate any memory in there (with sleeping), maybe the process should
> be zapped either way by the OOM killer.
> "
> 
> But instead of zapping the process, we just skip the failed COW PTE.
> I don't think the user will expect their process to be killed by
> changing the protection.

The process is consuming more memory than it is capable of consuming. 
The process most probably would have died earlier without the PTE 
optimization.

But yeah, it all gets tricky ...

> 
>>>
>>>> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
>>>>       double-checking: we temporarily clear the PMD, to reset it
>>>>       afterwards. I am not sure if that is correct. For example, what
>>>>       stops another page fault stumbling over that pmd_none() and
>>>>       allocating an empty page table? Maybe there are some locking details
>>>>       missing or they are very subtle such that we better document them. I
>>>>      recall that THP played quite some tricks to make such cases work ...
>>>
>>> I think that holding mmap_write_lock may be enough (I added
>>> mmap_assert_write_locked() in the fault function btw). But, I might
>>> be wrong. I will look at the THP stuff to see how they work. Thanks.
>>>
>>
>> Ehm, but page faults don't hold the mmap lock writable? And so are other
>> callers, like MADV_DONTNEED or MADV_FREE.
>>
>> handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should
>> bail out.
>>
>> Either I am missing something or you didn't test with lockdep enabled :)
> 
> You're right. I thought I enabled the lockdep.
> And, why do I have the page fault will handle the mmap lock writable in my mind.
> The page fault holds the mmap lock readable instead of writable.
> ;-)
> 
> I should check/test all the locks again.
> Thanks.

Note that we have other ways of traversing page tables, especially, 
using the rmap which does not hold the mmap lock. Not sure if there are 
similar issues when suddenly finding no page table where there logically 
should be one. Or when a page table gets replaced and modified, while 
rmap code still walks the shared copy. Hm.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 17:39             ` David Hildenbrand
@ 2023-02-14 18:25               ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2023-02-14 18:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chih-En Lin, Pasha Tatashin, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 9:39 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.02.23 18:23, Yang Shi wrote:
> > On Tue, Feb 14, 2023 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 10.02.23 18:20, Chih-En Lin wrote:
> >>> On Fri, Feb 10, 2023 at 11:21:16AM -0500, Pasha Tatashin wrote:
> >>>>>>> Currently, copy-on-write is only used for the mapped memory; the child
> >>>>>>> process still needs to copy the entire page table from the parent
> >>>>>>> process during forking. The parent process might take a lot of time and
> >>>>>>> memory to copy the page table when the parent has a big page table
> >>>>>>> allocated. For example, the memory usage of a process after forking with
> >>>>>>> 1 GB mapped memory is as follows:
> >>>>>>
> >>>>>> For some reason, I was not able to reproduce performance improvements
> >>>>>> with a simple fork() performance measurement program. The results that
> >>>>>> I saw are the following:
> >>>>>>
> >>>>>> Base:
> >>>>>> Fork latency per gigabyte: 0.004416 seconds
> >>>>>> Fork latency per gigabyte: 0.004382 seconds
> >>>>>> Fork latency per gigabyte: 0.004442 seconds
> >>>>>> COW kernel:
> >>>>>> Fork latency per gigabyte: 0.004524 seconds
> >>>>>> Fork latency per gigabyte: 0.004764 seconds
> >>>>>> Fork latency per gigabyte: 0.004547 seconds
> >>>>>>
> >>>>>> AMD EPYC 7B12 64-Core Processor
> >>>>>> Base:
> >>>>>> Fork latency per gigabyte: 0.003923 seconds
> >>>>>> Fork latency per gigabyte: 0.003909 seconds
> >>>>>> Fork latency per gigabyte: 0.003955 seconds
> >>>>>> COW kernel:
> >>>>>> Fork latency per gigabyte: 0.004221 seconds
> >>>>>> Fork latency per gigabyte: 0.003882 seconds
> >>>>>> Fork latency per gigabyte: 0.003854 seconds
> >>>>>>
> >>>>>> Given, that page table for child is not copied, I was expecting the
> >>>>>> performance to be better with COW kernel, and also not to depend on
> >>>>>> the size of the parent.
> >>>>>
> >>>>> Yes, the child won't duplicate the page table, but fork will still
> >>>>> traverse all the page table entries to do the accounting.
> >>>>> And, since this patch expends the COW to the PTE table level, it's not
> >>>>> the mapped page (page table entry) grained anymore, so we have to
> >>>>> guarantee that all the mapped page is available to do COW mapping in
> >>>>> the such page table.
> >>>>> This kind of checking also costs some time.
> >>>>> As a result, since the accounting and the checking, the COW PTE fork
> >>>>> still depends on the size of the parent so the improvement might not
> >>>>> be significant.
> >>>>
> >>>> The current version of the series does not provide any performance
> >>>> improvements for fork(). I would recommend removing claims from the
> >>>> cover letter about better fork() performance, as this may be
> >>>> misleading for those looking for a way to speed up forking. In my
> >>>
> >>>   From v3 to v4, I changed the implementation of the COW fork() part to do
> >>> the accounting and checking. At the time, I also removed most of the
> >>> descriptions about the better fork() performance. Maybe it's not enough
> >>> and still has some misleading. I will fix this in the next version.
> >>> Thanks.
> >>>
> >>>> case, I was looking to speed up Redis OSS, which relies on fork() to
> >>>> create consistent snapshots for driving replicates/backups. The O(N)
> >>>> per-page operation causes fork() to be slow, so I was hoping that this
> >>>> series, which does not duplicate the VA during fork(), would make the
> >>>> operation much quicker.
> >>>
> >>> Indeed, at first, I tried to avoid the O(N) per-page operation by
> >>> deferring the accounting and the swap stuff to the page fault. But,
> >>> as I mentioned, it's not suitable for the mainline.
> >>>
> >>> Honestly, for improving the fork(), I have an idea to skip the per-page
> >>> operation without breaking the logic. However, this will introduce the
> >>> complicated mechanism and may has the overhead for other features. It
> >>> might not be worth it. It's hard to strike a balance between the
> >>> over-complicated mechanism with (probably) better performance and data
> >>> consistency with the page status. So, I would focus on the safety and
> >>> stable approach at first.
> >>
> >> Yes, it is most probably possible, but complexity, robustness and
> >> maintainability have to be considered as well.
> >>
> >> Thanks for implementing this approach (only deduplication without other
> >> optimizations) and evaluating it accordingly. It's certainly "cleaner",
> >> such that we only have to mess with unsharing and not with other
> >> accounting/pinning/mapcount thingies. But it also highlights how
> >> intrusive even this basic deduplication approach already is -- and that
> >> most benefits of the original approach requires even more complexity on top.
> >>
> >> I am not quite sure if the benefit is worth the price (I am not to
> >> decide and I would like to hear other options).
> >>
> >> My quick thoughts after skimming over the core parts of this series
> >>
> >> (1) forgetting to break COW on a PTE in some pgtable walker feels quite
> >>       likely (meaning that it might be fairly error-prone) and forgetting
> >>       to break COW on a PTE table, accidentally modifying the shared
> >>       table.
> >> (2) break_cow_pte() can fail, which means that we can fail some
> >>       operations (possibly silently halfway through) now. For example,
> >>       looking at your change_pte_range() change, I suspect it's wrong.
> >> (3) handle_cow_pte_fault() looks quite complicated and needs quite some
> >>       double-checking: we temporarily clear the PMD, to reset it
> >>       afterwards. I am not sure if that is correct. For example, what
> >>       stops another page fault stumbling over that pmd_none() and
> >>       allocating an empty page table? Maybe there are some locking details
> >>       missing or they are very subtle such that we better document them. I
> >>      recall that THP played quite some tricks to make such cases work ...
> >>
> >>>
> >>>>> Actually, at the RFC v1 and v2, we proposed the version of skipping
> >>>>> those works, and we got a significant improvement. You can see the
> >>>>> number from RFC v2 cover letter [1]:
> >>>>> "In short, with 512 MB mapped memory, COW PTE decreases latency by 93%
> >>>>> for normal fork"
> >>>>
> >>>> I suspect the 93% improvement (when the mapcount was not updated) was
> >>>> only for VAs with 4K pages. With 2M mappings this series did not
> >>>> provide any benefit is this correct?
> >>>
> >>> Yes. In this case, the COW PTE performance is similar to the normal
> >>> fork().
> >>
> >>
> >> The thing with THP is, that during fork(), we always allocate a backup
> >> PTE table, to be able to PTE-map the THP whenever we have to. Otherwise
> >> we'd have to eventually fail some operations we don't want to fail --
> >> similar to the case where break_cow_pte() could fail now due to -ENOMEM
> >> although we really don't want to fail (e.g., change_pte_range() ).
> >>
> >> I always considered that wasteful, because in many scenarios, we'll
> >> never ever split a THP and possibly waste memory.
> >
> > When you say "split THP", do you mean split the compound page to base
> > pages? IIUC the backup PTE table page is used to guarantee the PMD
> > split (just convert pmd mapped THP to PTE-mapped but not split the
> > compound page) succeed. You may already notice there is no return
> > value for PMD split.
>
> Yes, as I raised in my other reply.
>
> >
> > The PMD split may be called quite often, for example, MADV_DONTNEED,
> > mbind, mlock, and even in memory reclamation context  (THP swap).
>
> Yes, but with a single MADV_DONTNEED call you cannot PTE-map more than 2
> THP (all other overlapped THP will get zapped). Same with most other
> operations.

My point is there may be multiple processes calling PMD split on
different THPs at the same time.

>
> There are corner cases, though. I recall that s390x/kvm wants to break
> all THP in a given VMA range. But that operation could safely fail if we
> can't do that.

I'm supposed that is THP split (split the compound page), it may fail.

>
> Certainly needs some investigation, that's most probably why it hasn't
> been done yet.
>
> >
> >>
> >> Optimizing that for THP (e.g., don't always allocate backup THP, have
> >> some global allocation backup pool for splits + refill when
> >> close-to-empty) might provide similar fork() improvements, both in speed
> >> and memory consumption when it comes to anonymous memory.
> >
> > It might work. But may be much more complicated than what you thought
> > when handling multiple parallel PMD splits.
>
>
> I consider the whole PTE-table linking to THPs complicated enough to
> eventually replace it by something differently complicated that wastes
> less memory ;)

Maybe...

>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 16:30             ` Pasha Tatashin
@ 2023-02-14 18:41               ` Chih-En Lin
  2023-02-14 18:52                 ` Pasha Tatashin
  0 siblings, 1 reply; 37+ messages in thread
From: Chih-En Lin @ 2023-02-14 18:41 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Hildenbrand, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 11:30:26AM -0500, Pasha Tatashin wrote:
> > > The thing with THP is, that during fork(), we always allocate a backup PTE
> > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd
> > > have to eventually fail some operations we don't want to fail -- similar to
> > > the case where break_cow_pte() could fail now due to -ENOMEM although we
> > > really don't want to fail (e.g., change_pte_range() ).
> > >
> > > I always considered that wasteful, because in many scenarios, we'll never
> > > ever split a THP and possibly waste memory.
> > >
> > > Optimizing that for THP (e.g., don't always allocate backup THP, have some
> > > global allocation backup pool for splits + refill when close-to-empty) might
> > > provide similar fork() improvements, both in speed and memory consumption
> > > when it comes to anonymous memory.
> >
> > When collapsing huge pages, do/can they reuse those PTEs for backup?
> > So, we don't have to allocate the PTE or maintain the pool.
> 
> It might not work for all pages, as collapsing pages might have had
> holes in the user page table, and there were no PTE tables.

So if there have holes in the user page table, after we doing the
collapsing and then splitting. Do those holes be filled? Assume it is,
then, I think it's the reason why it's not work for all the pages.

But, after those operations, Will the user get the additional and
unexpected memory (which is from the huge page filling)?

I'm a little bit confused now.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 18:41               ` Chih-En Lin
@ 2023-02-14 18:52                 ` Pasha Tatashin
  2023-02-14 19:17                   ` Chih-En Lin
  0 siblings, 1 reply; 37+ messages in thread
From: Pasha Tatashin @ 2023-02-14 18:52 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: David Hildenbrand, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 1:42 PM Chih-En Lin <shiyn.lin@gmail.com> wrote:
>
> On Tue, Feb 14, 2023 at 11:30:26AM -0500, Pasha Tatashin wrote:
> > > > The thing with THP is, that during fork(), we always allocate a backup PTE
> > > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd
> > > > have to eventually fail some operations we don't want to fail -- similar to
> > > > the case where break_cow_pte() could fail now due to -ENOMEM although we
> > > > really don't want to fail (e.g., change_pte_range() ).
> > > >
> > > > I always considered that wasteful, because in many scenarios, we'll never
> > > > ever split a THP and possibly waste memory.
> > > >
> > > > Optimizing that for THP (e.g., don't always allocate backup THP, have some
> > > > global allocation backup pool for splits + refill when close-to-empty) might
> > > > provide similar fork() improvements, both in speed and memory consumption
> > > > when it comes to anonymous memory.
> > >
> > > When collapsing huge pages, do/can they reuse those PTEs for backup?
> > > So, we don't have to allocate the PTE or maintain the pool.
> >
> > It might not work for all pages, as collapsing pages might have had
> > holes in the user page table, and there were no PTE tables.
>
> So if there have holes in the user page table, after we doing the
> collapsing and then splitting. Do those holes be filled? Assume it is,
> then, I think it's the reason why it's not work for all the pages.
>
> But, after those operations, Will the user get the additional and
> unexpected memory (which is from the huge page filling)?

Yes, more memory is going to be allocated for a process in such THP
collapse case. This is similar to madvise huge pages, and touching the
first byte may allocate 2M.

Pasha

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 17:59                 ` David Hildenbrand
@ 2023-02-14 19:06                   ` Chih-En Lin
  0 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-14 19:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pasha Tatashin, Andrew Morton, Qi Zheng, Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 06:59:50PM +0100, David Hildenbrand wrote:
> On 14.02.23 18:54, Chih-En Lin wrote:
> > > > 
> > > > > (2) break_cow_pte() can fail, which means that we can fail some
> > > > >       operations (possibly silently halfway through) now. For example,
> > > > >       looking at your change_pte_range() change, I suspect it's wrong.
> > > > 
> > > > Maybe I should add WARN_ON() and skip the failed COW PTE.
> > > 
> > > One way or the other we'll have to handle it. WARN_ON() sounds wrong for
> > > handling OOM situations (e.g., if only that cgroup is OOM).
> > 
> > Or we should do the same thing like you mentioned:
> > "
> > For example, __split_huge_pmd() is currently not able to report a
> > failure. I assume that we could sleep in there. And if we're not able to
> > allocate any memory in there (with sleeping), maybe the process should
> > be zapped either way by the OOM killer.
> > "
> > 
> > But instead of zapping the process, we just skip the failed COW PTE.
> > I don't think the user will expect their process to be killed by
> > changing the protection.
> 
> The process is consuming more memory than it is capable of consuming. The
> process most probably would have died earlier without the PTE optimization.
> 
> But yeah, it all gets tricky ...
> 
> > 
> > > > 
> > > > > (3) handle_cow_pte_fault() looks quite complicated and needs quite some
> > > > >       double-checking: we temporarily clear the PMD, to reset it
> > > > >       afterwards. I am not sure if that is correct. For example, what
> > > > >       stops another page fault stumbling over that pmd_none() and
> > > > >       allocating an empty page table? Maybe there are some locking details
> > > > >       missing or they are very subtle such that we better document them. I
> > > > >      recall that THP played quite some tricks to make such cases work ...
> > > > 
> > > > I think that holding mmap_write_lock may be enough (I added
> > > > mmap_assert_write_locked() in the fault function btw). But, I might
> > > > be wrong. I will look at the THP stuff to see how they work. Thanks.
> > > > 
> > > 
> > > Ehm, but page faults don't hold the mmap lock writable? And so are other
> > > callers, like MADV_DONTNEED or MADV_FREE.
> > > 
> > > handle_pte_fault()->handle_pte_fault()->mmap_assert_write_locked() should
> > > bail out.
> > > 
> > > Either I am missing something or you didn't test with lockdep enabled :)
> > 
> > You're right. I thought I enabled the lockdep.
> > And, why do I have the page fault will handle the mmap lock writable in my mind.
> > The page fault holds the mmap lock readable instead of writable.
> > ;-)
> > 
> > I should check/test all the locks again.
> > Thanks.
> 
> Note that we have other ways of traversing page tables, especially, using
> the rmap which does not hold the mmap lock. Not sure if there are similar
> issues when suddenly finding no page table where there logically should be
> one. Or when a page table gets replaced and modified, while rmap code still
> walks the shared copy. Hm.

It seems like I should take carefully for the page table entry in page
fault with rmap. ;)
While the rmap code walks the page table, it will hold the pt lock.
So, maybe I should hold the old (shared) PTE table's lock in
handle_cow_pte_fault() all the time.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v4 00/14] Introduce Copy-On-Write to Page Table
  2023-02-14 18:52                 ` Pasha Tatashin
@ 2023-02-14 19:17                   ` Chih-En Lin
  0 siblings, 0 replies; 37+ messages in thread
From: Chih-En Lin @ 2023-02-14 19:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: David Hildenbrand, Andrew Morton, Qi Zheng,
	Matthew Wilcox (Oracle),
	Christophe Leroy, John Hubbard, Nadav Amit, Barry Song,
	Steven Rostedt, Masami Hiramatsu, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Yang Shi, Peter Xu, Vlastimil Babka,
	Zach O'Keefe, Yun Zhou, Hugh Dickins, Suren Baghdasaryan,
	Yu Zhao, Juergen Gross, Tong Tiangen, Liu Shixin,
	Anshuman Khandual, Li kunyu, Minchan Kim, Miaohe Lin,
	Gautam Menghani, Catalin Marinas, Mark Brown, Will Deacon,
	Vincenzo Frascino, Thomas Gleixner, Eric W. Biederman,
	Andy Lutomirski, Sebastian Andrzej Siewior, Liam R. Howlett,
	Fenghua Yu, Andrei Vagin, Barret Rhoden, Michal Hocko,
	Jason A. Donenfeld, Alexey Gladkov, linux-kernel, linux-fsdevel,
	linux-mm, linux-trace-kernel, linux-perf-users, Dinglan Peng,
	Pedro Fonseca, Jim Huang, Huichun Feng

On Tue, Feb 14, 2023 at 01:52:16PM -0500, Pasha Tatashin wrote:
> On Tue, Feb 14, 2023 at 1:42 PM Chih-En Lin <shiyn.lin@gmail.com> wrote:
> >
> > On Tue, Feb 14, 2023 at 11:30:26AM -0500, Pasha Tatashin wrote:
> > > > > The thing with THP is, that during fork(), we always allocate a backup PTE
> > > > > table, to be able to PTE-map the THP whenever we have to. Otherwise we'd
> > > > > have to eventually fail some operations we don't want to fail -- similar to
> > > > > the case where break_cow_pte() could fail now due to -ENOMEM although we
> > > > > really don't want to fail (e.g., change_pte_range() ).
> > > > >
> > > > > I always considered that wasteful, because in many scenarios, we'll never
> > > > > ever split a THP and possibly waste memory.
> > > > >
> > > > > Optimizing that for THP (e.g., don't always allocate backup THP, have some
> > > > > global allocation backup pool for splits + refill when close-to-empty) might
> > > > > provide similar fork() improvements, both in speed and memory consumption
> > > > > when it comes to anonymous memory.
> > > >
> > > > When collapsing huge pages, do/can they reuse those PTEs for backup?
> > > > So, we don't have to allocate the PTE or maintain the pool.
> > >
> > > It might not work for all pages, as collapsing pages might have had
> > > holes in the user page table, and there were no PTE tables.
> >
> > So if there have holes in the user page table, after we doing the
> > collapsing and then splitting. Do those holes be filled? Assume it is,
> > then, I think it's the reason why it's not work for all the pages.
> >
> > But, after those operations, Will the user get the additional and
> > unexpected memory (which is from the huge page filling)?
> 
> Yes, more memory is going to be allocated for a process in such THP
> collapse case. This is similar to madvise huge pages, and touching the
> first byte may allocate 2M.

Thanks for the explanation.
Yeah, It seems like the reuse case can't work for all the pages.

Thanks,
Chih-En Lin

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-02-14 19:17 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-07  3:51 [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 02/14] mm: Add Copy-On-Write PTE to fork() Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 03/14] mm: Add break COW PTE fault and helper functions Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 04/14] mm/rmap: Break COW PTE in rmap walking Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 05/14] mm/khugepaged: Break COW PTE before scanning pte Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 06/14] mm/ksm: Break COW PTE before modify shared PTE Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 07/14] mm/madvise: Handle COW-ed PTE with madvise() Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 08/14] mm/gup: Trigger break COW PTE before calling follow_pfn_pte() Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 09/14] mm/mprotect: Break COW PTE before changing protection Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 10/14] mm/userfaultfd: Support COW PTE Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 11/14] mm/migrate_device: " Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 12/14] fs/proc: Support COW PTE with clear_refs_write Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 13/14] events/uprobes: Break COW PTE before replacing page Chih-En Lin
2023-02-07  3:51 ` [PATCH v4 14/14] mm: fork: Enable COW PTE to fork system call Chih-En Lin
2023-02-09 18:15 ` [PATCH v4 00/14] Introduce Copy-On-Write to Page Table Pasha Tatashin
2023-02-10  2:17   ` Chih-En Lin
2023-02-10 16:21     ` Pasha Tatashin
2023-02-10 17:20       ` Chih-En Lin
2023-02-10 19:02         ` Chih-En Lin
2023-02-14  9:58         ` David Hildenbrand
2023-02-14 13:07           ` Pasha Tatashin
2023-02-14 13:17             ` David Hildenbrand
2023-02-14 15:59           ` Chih-En Lin
2023-02-14 16:30             ` Pasha Tatashin
2023-02-14 18:41               ` Chih-En Lin
2023-02-14 18:52                 ` Pasha Tatashin
2023-02-14 19:17                   ` Chih-En Lin
2023-02-14 16:58             ` David Hildenbrand
2023-02-14 17:03               ` David Hildenbrand
2023-02-14 17:56                 ` Chih-En Lin
2023-02-14 17:54               ` Chih-En Lin
2023-02-14 17:59                 ` David Hildenbrand
2023-02-14 19:06                   ` Chih-En Lin
2023-02-14 17:23           ` Yang Shi
2023-02-14 17:39             ` David Hildenbrand
2023-02-14 18:25               ` Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.