All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
@ 2022-05-19 18:31 Chih-En Lin
  2022-05-19 18:31 ` [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table Chih-En Lin
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

When creating the user process, it usually uses the Copy-On-Write (COW)
mechanism to save the memory usage and the cost of time for copying.
COW defers the work of copying private memory and shares it across the
processes as read-only. If either process wants to write in these
memories, it will page fault and copy the shared memory, so the process
will now get its private memory right here, which is called break COW.

Presently this kind of technology is only used as the mapping memory.
It still needs to copy the entire page table from the parent.
It might cost a lot of time and memory to copy each page table when the
parent already has a lot of page tables allocated. For example, here is
the state table for mapping the 1 GB memory of forking.

	    mmap before fork         mmap after fork
MemTotal:       32746776 kB             32746776 kB
MemFree:        31468152 kB             31463244 kB
AnonPages:       1073836 kB              1073628 kB
Mapped:            39520 kB                39992 kB
PageTables:         3356 kB                 5432 kB

This patch introduces Copy-On-Write to the page table. This patch only
implements the COW on the PTE level. It's based on the paper
On-Demand Fork [1]. Summary of the implementation for the paper:

- Only implements the COW to the anonymous mapping
- Only do COW to the PTE table which the range is all covered by a
  single VMA.
- Use the reference count to control the COW PTE table lifetime.
  Decrease the counter when breaking COW or dereference the COW PTE
  table. When the counter reduces to zero, free the PTE table.

The paper is based on v5.6, and this patch is for v.518-rc6. And, this
patch has some differences between the version of paper. To reduce the
work of duplicating page tables, I adapted the restriction of the COW
page table. Excluding the brk and shared memory, it will do the COW to
all the PTE tables. With a reference count of one, we reuse the table
when breaking COW. To handle the page table state of the process, it
adds the ownership of the COW PTE table. It uses the address of the PMD
index for the ownership of the PTE table to maintain the COW PTE table
state to the RSS and pgtable_bytes.

If we do the COW to the PTE table once as the time we touch the PMD
entry, it cannot preserves the reference count of the COW PTE table.
Since the address range of VMA may overlap the PTE table, the copying
function will use VMA to travel the page table for copying it.
So it may increase the reference count of the COW PTE table multiple
times in one COW page table forking. Generically it will only increase
once time as the child reference it. To solve this problem, it needs to
check the destination of PMD entry does exist. And the reference count
of the source PTE table is more than one before doing the COW.

Here is the patch of a state table for mapping the 1 GB memory of
forking.

            mmap before fork         mmap after fork
MemTotal:       32746776 kB             32746776 kB
MemFree:        31471324 kB             31468888 kB
AnonPages:       1073628 kB              1073660 kB
Mapped:            39264 kB                39504 kB
PageTables:         3304 kB                 3396 kB

TODO list:
- Handle the swap
- Rewrite the TLB flush for zapping the COW PTE table.
- Experiment COW to the entire page table. (Now just for PTE level)
- Bug in some case from copy_pte_range()::vm_normal_page()::print_bad_pte().
- Bug of Bad RSS counter in multiple times COW PTE table forking.

[1] https://dl.acm.org/doi/10.1145/3447786.3456258

This patch is based on v5.18-rc6.

---

Chih-En Lin (6):
  mm: Add a new mm flag for Copy-On-Write PTE table
  mm: clone3: Add CLONE_COW_PGTABLE flag
  mm, pgtable: Add ownership for the PTE table
  mm: Add COW PTE fallback function
  mm, pgtable: Add the reference counter for COW PTE
  mm: Expand Copy-On-Write to PTE table

 include/linux/mm.h             |   2 +
 include/linux/mm_types.h       |   2 +
 include/linux/pgtable.h        |  44 +++++
 include/linux/sched/coredump.h |   5 +-
 include/uapi/linux/sched.h     |   1 +
 kernel/fork.c                  |   6 +-
 mm/memory.c                    | 329 ++++++++++++++++++++++++++++++---
 mm/mmap.c                      |   4 +
 mm/mremap.c                    |   5 +
 9 files changed, 373 insertions(+), 25 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
@ 2022-05-19 18:31 ` Chih-En Lin
  2022-05-19 18:31 ` [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag Chih-En Lin
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

Add MMF_COW_PGTABLE flag to prepare the subsequent implementation of
copy-on-write for the page table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/sched/coredump.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 4d9e3a656875..19e9f2b71398 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -83,7 +83,10 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_PINNED		28	/* FOLL_PIN has run, never cleared */
 #define MMF_DISABLE_THP_MASK	(1 << MMF_DISABLE_THP)
 
+#define MMF_COW_PGTABLE		29
+#define MMF_COW_PGTABLE_MASK	(1 << MMF_COW_PGTABLE)
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
-				 MMF_DISABLE_THP_MASK)
+				 MMF_DISABLE_THP_MASK | MMF_COW_PGTABLE_MASK)
 
 #endif /* _LINUX_SCHED_COREDUMP_H */
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
  2022-05-19 18:31 ` [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table Chih-En Lin
@ 2022-05-19 18:31 ` Chih-En Lin
  2022-05-20 14:13   ` Christophe Leroy
  2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

Add CLONE_COW_PGTABLE flag to support clone3() system call to enable the
Copy-On-Write (COW) mechanism on the page table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/uapi/linux/sched.h | 1 +
 kernel/fork.c              | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..3b92ff589e0f 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -36,6 +36,7 @@
 /* Flags for the clone3() syscall. */
 #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
 #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
+#define CLONE_COW_PGTABLE 0x400000000ULL /* Copy-On-Write for page table */
 
 /*
  * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
diff --git a/kernel/fork.c b/kernel/fork.c
index 35a3beff140b..08cf95201333 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2636,6 +2636,9 @@ pid_t kernel_clone(struct kernel_clone_args *args)
 			trace = 0;
 	}
 
+	if (clone_flags & CLONE_COW_PGTABLE)
+		set_bit(MMF_COW_PGTABLE, &current->mm->flags);
+
 	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
 	add_latent_entropy();
 
@@ -2860,7 +2863,8 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
 	if (kargs->flags &
-	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP))
+	    ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
+		    CLONE_COW_PGTABLE))
 		return false;
 
 	/*
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
  2022-05-19 18:31 ` [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table Chih-En Lin
  2022-05-19 18:31 ` [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag Chih-En Lin
@ 2022-05-19 18:31 ` Chih-En Lin
  2022-05-19 23:07   ` kernel test robot
                     ` (3 more replies)
  2022-05-19 18:31 ` [RFC PATCH 4/6] mm: Add COW PTE fallback function Chih-En Lin
                   ` (4 subsequent siblings)
  7 siblings, 4 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

Introduce the ownership for the PTE table to prepare the following patch
of the Copy-On-Write (COW) page table. It uses the address of PMD index
to become the owner to identify which process can update its page table
state from the COW page table.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h |  1 +
 include/linux/pgtable.h  | 14 ++++++++++++++
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9f44254af8ce..221926a3d818 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2328,6 +2328,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
 		return false;
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
+	page->cow_pte_owner = NULL;
 	return true;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..5dcbd7f6c361 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -221,6 +221,7 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+	pmd_t *cow_pte_owner; /* cow pte: pmd */
 } _struct_page_alignment;
 
 /**
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..faca57af332e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -590,6 +590,20 @@ static inline int pte_unused(pte_t pte)
 }
 #endif
 
+static inline bool set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
+{
+	struct page *page = pmd_page(*pmd);
+
+	smp_store_release(&page->cow_pte_owner, owner);
+	return true;
+}
+
+static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
+{
+	return (smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner) ?
+		true : false;
+}
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 4/6] mm: Add COW PTE fallback function
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (2 preceding siblings ...)
  2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
@ 2022-05-19 18:31 ` Chih-En Lin
  2022-05-20  0:20   ` kernel test robot
  2022-05-20 14:21   ` Christophe Leroy
  2022-05-19 18:31 ` [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Chih-En Lin
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

The lifetime of COW PTE will handle by ownership and a reference count.
When the process wants to write the COW PTE, which reference count is 1,
it will reuse the COW PTE instead of copying then free.

Only the owner will update its RSS state and the record of page table
bytes allocation. So we need to handle when the non-owner process gets
the fallback COW PTE.

This commit prepares for the following implementation of the reference
count for COW PTE.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 mm/memory.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 76e3af9639d9..dcb678cbb051 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1000,6 +1000,34 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
 	return new_page;
 }
 
+static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
+	pmd_t *pmdp, unsigned long addr, unsigned long end, bool inc_dec)
+{
+	int rss[NR_MM_COUNTERS];
+	pte_t *orig_ptep, *ptep;
+	struct page *page;
+
+	init_rss_vec(rss);
+
+	ptep = pte_offset_map(pmdp, addr);
+	orig_ptep = ptep;
+	arch_enter_lazy_mmu_mode();
+	do {
+		if (pte_none(*ptep) || pte_special(*ptep))
+			continue;
+
+		page = vm_normal_page(vma, addr, *ptep);
+		if (page) {
+			if (inc_dec)
+				rss[mm_counter(page)]++;
+			else
+				rss[mm_counter(page)]--;
+		}
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
+	arch_leave_lazy_mmu_mode();
+	add_mm_rss_vec(mm, rss);
+}
+
 static int
 copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	       pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -4554,6 +4582,44 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 	return VM_FAULT_FALLBACK;
 }
 
+/* COW PTE fallback to normal PTE:
+ * - two state here
+ *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
+ *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
+ *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
+ *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
+ */
+void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long start, end;
+	pmd_t new;
+
+	BUG_ON(pmd_write(*pmd));
+
+	start = addr & PMD_MASK;
+	end = (addr + PMD_SIZE) & PMD_MASK;
+
+	/* If pmd is not owner, it needs to increase the rss.
+	 * Since only the owner has the RSS state for the COW PTE.
+	 */
+	if (!cow_pte_owner_is_same(pmd, pmd)) {
+		cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
+		mm_inc_nr_ptes(mm);
+		smp_wmb();
+		pmd_populate(mm, pmd, pmd_page(*pmd));
+	}
+
+	/* Reuse the pte page */
+	set_cow_pte_owner(pmd, NULL);
+	new = pmd_mkwrite(*pmd);
+	set_pmd_at(mm, addr, pmd, new);
+
+	BUG_ON(!pmd_write(*pmd));
+	BUG_ON(pmd_page(*pmd)->cow_pte_owner);
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (3 preceding siblings ...)
  2022-05-19 18:31 ` [RFC PATCH 4/6] mm: Add COW PTE fallback function Chih-En Lin
@ 2022-05-19 18:31 ` Chih-En Lin
  2022-05-20 14:30   ` Christophe Leroy
  2022-05-21  4:08   ` Matthew Wilcox
  2022-05-19 18:31 ` [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table Chih-En Lin
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

Add the reference counter cow_pgtable_refcount to maintain the number
of process references to COW PTE. Before decreasing the reference
count, it will check whether the counter is one or not for reusing
COW PTE when the counter is one.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/mm.h       |  1 +
 include/linux/mm_types.h |  1 +
 include/linux/pgtable.h  | 27 +++++++++++++++++++++++++++
 mm/memory.c              |  1 +
 4 files changed, 30 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 221926a3d818..e48bb3fbc33c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2329,6 +2329,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
 	page->cow_pte_owner = NULL;
+	atomic_set(&page->cow_pgtable_refcount, 1);
 	return true;
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5dcbd7f6c361..984d81e47d53 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -221,6 +221,7 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
 	int _last_cpupid;
 #endif
+	atomic_t cow_pgtable_refcount; /* COW page table */
 	pmd_t *cow_pte_owner; /* cow pte: pmd */
 } _struct_page_alignment;
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index faca57af332e..33c01fec7b92 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -604,6 +604,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
 		true : false;
 }
 
+extern void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr);
+
+static inline int pmd_get_pte(pmd_t *pmd)
+{
+	return atomic_inc_return(&pmd_page(*pmd)->cow_pgtable_refcount);
+}
+
+/* If the COW PTE page->cow_pgtable_refcount is 1, instead of decreasing the
+ * counter, clear write protection of the corresponding PMD entry and reset
+ * the COW PTE owner to reuse the table.
+ */
+static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr)
+{
+	if (!atomic_add_unless(&pmd_page(*pmd)->cow_pgtable_refcount, -1, 1)) {
+		cow_pte_fallback(vma, pmd, addr);
+		return 1;
+	}
+	return 0;
+}
+
+static inline int cow_pte_refcount_read(pmd_t *pmd)
+{
+	return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount);
+}
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
diff --git a/mm/memory.c b/mm/memory.c
index dcb678cbb051..aa66af76e214 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4597,6 +4597,7 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
 	pmd_t new;
 
 	BUG_ON(pmd_write(*pmd));
+	BUG_ON(cow_pte_refcount_read(pmd) != 1);
 
 	start = addr & PMD_MASK;
 	end = (addr + PMD_SIZE) & PMD_MASK;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (4 preceding siblings ...)
  2022-05-19 18:31 ` [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Chih-En Lin
@ 2022-05-19 18:31 ` Chih-En Lin
  2022-05-20 14:49   ` Christophe Leroy
  2022-05-21  8:59 ` [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Qi Zheng
  2022-05-21 16:07 ` David Hildenbrand
  7 siblings, 1 reply; 35+ messages in thread
From: Chih-En Lin @ 2022-05-19 18:31 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Chih-En Lin, Colin Cross,
	Feng Tang, Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
To enable the COW page table use the clone3() system call with the
CLONE_COW_PGTABLE flag. It will set the MMF_COW_PGTABLE flag to the
processes.

It uses the MMF_COW_PGTABLE flag to distinguish the default page table
and the COW one. Moreover, it is difficult to distinguish whether the
entire page table is out of COW state. So the MMF_COW_PGTABLE flag won't
be disabled after its setup.

Since the memory space of the page table is distinctive for each process
in kernel space. It uses the address of the PMD index for the ownership
of the PTE table to identify which one of the processes needs to update
the page table state. In other words, only the owner will update COW PTE
state, like the RSS and pgtable_bytes.

It uses the reference count to control the lifetime of COW PTE table.
When someone breaks COW, it will copy the COW PTE table and decrease the
reference count. But if the reference count is equal to one before the
break COW, it will reuse the COW PTE table.

This patch modifies the part of the copy page table to do the basic COW.
For the break COW, it modifies the part of a page fault, zaps page table
, unmapping, and remapping.

Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
---
 include/linux/pgtable.h |   3 +
 mm/memory.c             | 262 ++++++++++++++++++++++++++++++++++++----
 mm/mmap.c               |   4 +
 mm/mremap.c             |   5 +
 4 files changed, 251 insertions(+), 23 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 33c01fec7b92..357ce3722ee8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -631,6 +631,9 @@ static inline int cow_pte_refcount_read(pmd_t *pmd)
 	return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount);
 }
 
+extern int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, bool alloc);
+
 #ifndef pte_access_permitted
 #define pte_access_permitted(pte, write) \
 	(pte_present(pte) && (!(write) || pte_write(pte)))
diff --git a/mm/memory.c b/mm/memory.c
index aa66af76e214..ff3fcbe4dfb5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -247,6 +247,8 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
+		BUG_ON(cow_pte_refcount_read(pmd) != 1);
+		BUG_ON(!cow_pte_owner_is_same(pmd, NULL));
 		free_pte_range(tlb, pmd, addr);
 	} while (pmd++, addr = next, addr != end);
 
@@ -1031,7 +1033,7 @@ static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
 static int
 copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	       pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-	       unsigned long end)
+	       unsigned long end, bool is_src_pte_locked)
 {
 	struct mm_struct *dst_mm = dst_vma->vm_mm;
 	struct mm_struct *src_mm = src_vma->vm_mm;
@@ -1053,8 +1055,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		goto out;
 	}
 	src_pte = pte_offset_map(src_pmd, addr);
-	src_ptl = pte_lockptr(src_mm, src_pmd);
-	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+	if (!is_src_pte_locked) {
+		src_ptl = pte_lockptr(src_mm, src_pmd);
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+	}
 	orig_src_pte = src_pte;
 	orig_dst_pte = dst_pte;
 	arch_enter_lazy_mmu_mode();
@@ -1067,7 +1071,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		if (progress >= 32) {
 			progress = 0;
 			if (need_resched() ||
-			    spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
+			    (!is_src_pte_locked && spin_needbreak(src_ptl)) ||
+			    spin_needbreak(dst_ptl))
 				break;
 		}
 		if (pte_none(*src_pte)) {
@@ -1118,7 +1123,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
 
 	arch_leave_lazy_mmu_mode();
-	spin_unlock(src_ptl);
+	if (!is_src_pte_locked)
+		spin_unlock(src_ptl);
 	pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
@@ -1180,11 +1186,55 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 				continue;
 			/* fall through */
 		}
-		if (pmd_none_or_clear_bad(src_pmd))
-			continue;
-		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
-				   addr, next))
+
+		if (test_bit(MMF_COW_PGTABLE, &src_mm->flags)) {
+
+			 if (pmd_none(*src_pmd))
+				continue;
+
+			/* XXX: Skip if the PTE already COW this time. */
+			if (!pmd_none(*dst_pmd) &&
+			    cow_pte_refcount_read(src_pmd) > 1)
+				continue;
+
+			/* If PTE doesn't have an owner, the parent needs to
+			 * take this PTE.
+			 */
+			if (cow_pte_owner_is_same(src_pmd, NULL)) {
+				set_cow_pte_owner(src_pmd, src_pmd);
+				/* XXX: The process may COW PTE fork two times.
+				 * But in some situations, owner has cleared.
+				 * Previously Child (This time is the parent)
+				 * COW PTE forking, but previously parent, owner
+				 * , break COW. So it needs to add back the RSS
+				 * state and pgtable bytes.
+				 */
+				if (!pmd_write(*src_pmd)) {
+					unsigned long pte_start =
+						addr & PMD_MASK;
+					unsigned long pte_end =
+						(addr + PMD_SIZE) & PMD_MASK;
+					cow_pte_rss(src_mm, src_vma, src_pmd,
+					    pte_start, pte_end, true /* inc */);
+					mm_inc_nr_ptes(src_mm);
+					smp_wmb();
+					pmd_populate(src_mm, src_pmd,
+							pmd_page(*src_pmd));
+				}
+			}
+
+			pmdp_set_wrprotect(src_mm, addr, src_pmd);
+
+			/* Child reference count */
+			pmd_get_pte(src_pmd);
+
+			/* COW for PTE table */
+			set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd);
+		} else if (!pmd_none_or_clear_bad(src_pmd) &&
+			    copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
+				    addr, next, false)) {
 			return -ENOMEM;
+		}
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -1336,6 +1386,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 struct zap_details {
 	struct folio *single_folio;	/* Locked folio to be unmapped */
 	bool even_cows;			/* Zap COWed private pages too? */
+	bool cow_pte;			/* Do not free COW PTE */
 };
 
 /* Whether we should zap all COWed (private) pages too */
@@ -1398,8 +1449,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(!should_zap_page(details, page)))
 				continue;
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
+			if (!details || !details->cow_pte)
+				ptent = ptep_get_and_clear_full(mm, addr, pte,
+								tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1413,8 +1465,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				    likely(!(vma->vm_flags & VM_SEQ_READ)))
 					mark_page_accessed(page);
 			}
-			rss[mm_counter(page)]--;
-			page_remove_rmap(page, vma, false);
+			if (!details || !details->cow_pte) {
+				rss[mm_counter(page)]--;
+				page_remove_rmap(page, vma, false);
+			} else
+				continue;
 			if (unlikely(page_mapcount(page) < 0))
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1425,6 +1480,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 
+		// TODO: Deal COW PTE with swap
+
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry) ||
 		    is_device_exclusive_entry(entry)) {
@@ -1513,16 +1570,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 			spin_unlock(ptl);
 		}
 
-		/*
-		 * Here there can be other concurrent MADV_DONTNEED or
-		 * trans huge page faults running, and if the pmd is
-		 * none or trans huge it can change under us. This is
-		 * because MADV_DONTNEED holds the mmap_lock in read
-		 * mode.
-		 */
-		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
-			goto next;
-		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+
+		if (test_bit(MMF_COW_PGTABLE, &tlb->mm->flags) &&
+		    !pmd_none(*pmd) && !pmd_write(*pmd)) {
+			struct zap_details cow_pte_details = {0};
+			if (details)
+				cow_pte_details = *details;
+			cow_pte_details.cow_pte = true;
+			/* Flush the TLB but do not free the COW PTE */
+			next = zap_pte_range(tlb, vma, pmd, addr,
+						next, &cow_pte_details);
+			if (details)
+				*details = cow_pte_details;
+			handle_cow_pte(vma, pmd, addr, false);
+		} else {
+			if (details)
+				details->cow_pte = false;
+			/*
+			 * Here there can be other concurrent MADV_DONTNEED or
+			 * trans huge page faults running, and if the pmd is
+			 * none or trans huge it can change under us. This is
+			 * because MADV_DONTNEED holds the mmap_lock in read
+			 * mode.
+			 */
+			if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+				goto next;
+			next = zap_pte_range(tlb, vma, pmd, addr, next,
+					details);
+		}
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
@@ -4621,6 +4696,134 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
 	BUG_ON(pmd_page(*pmd)->cow_pte_owner);
 }
 
+/* Break COW PTE:
+ * - two state here
+ *   - After fork :   [parent, rss=1, ref=2, write=NO , owner=parent]
+ *                 to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
+ *                    COW PTE become [ref=1, write=NO , owner=NULL  ]
+ *                    [child , rss=0, ref=2, write=NO , owner=parent]
+ *                 to [child , rss=1, ref=1, write=YES, owner=NULL  ]
+ *                    COW PTE become [ref=1, write=NO , owner=parent]
+ *   NOTE
+ *     - Copy the COW PTE to new PTE.
+ *     - Clear the owner of COW PTE and set PMD entry writable when it is owner.
+ *     - Increase RSS if it is not owner.
+ */
+static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long start, end;
+	pmd_t cowed_entry = *pmd;
+
+	if (cow_pte_refcount_read(&cowed_entry) == 1) {
+		cow_pte_fallback(vma, pmd, addr);
+		return 1;
+	}
+
+	BUG_ON(pmd_write(cowed_entry));
+
+	start = addr & PMD_MASK;
+	end = (addr + PMD_SIZE) & PMD_MASK;
+
+	pmd_clear(pmd);
+	if (copy_pte_range(vma, vma, pmd, &cowed_entry,
+				start, end, true))
+		return -ENOMEM;
+
+	/* Here, it is the owner, so clear the ownership. To keep RSS state and
+	 * page table bytes correct, it needs to decrease them.
+	 */
+	if (cow_pte_owner_is_same(&cowed_entry, pmd)) {
+		set_cow_pte_owner(&cowed_entry, NULL);
+		cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
+		mm_dec_nr_ptes(mm);
+	}
+
+	pmd_put_pte(vma, &cowed_entry, addr);
+
+	BUG_ON(!pmd_write(*pmd));
+	BUG_ON(cow_pte_refcount_read(pmd) != 1);
+
+	return 0;
+}
+
+static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long start, end;
+
+	if (pmd_put_pte(vma, pmd, addr)) {
+		// fallback
+		return 1;
+	}
+
+	start = addr & PMD_MASK;
+	end = (addr + PMD_SIZE) & PMD_MASK;
+
+	/* If PMD entry is owner, clear the ownership, and decrease RSS state
+	 * and pgtable_bytes.
+	 */
+	if (cow_pte_owner_is_same(pmd, pmd)) {
+		set_cow_pte_owner(pmd, NULL);
+		cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
+		mm_dec_nr_ptes(mm);
+	}
+
+	pmd_clear(pmd);
+	return 0;
+}
+
+/* If alloc set means it won't break COW. For this case, it will just decrease
+ * the reference count. The address needs to be at the beginning of the PTE page
+ * since COW PTE is copy-on-write the entire PTE.
+ * If pmd is NULL, it will get the pmd from vma and check it is cowing.
+ */
+int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long addr, bool alloc)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	struct mm_struct *mm = vma->vm_mm;
+	int ret = 0;
+	spinlock_t *ptl = NULL;
+
+	if (!pmd) {
+		pgd = pgd_offset(mm, addr);
+		if (pgd_none_or_clear_bad(pgd))
+			return 0;
+		p4d = p4d_offset(pgd, addr);
+		if (p4d_none_or_clear_bad(p4d))
+			return 0;
+		pud = pud_offset(p4d, addr);
+		if (pud_none_or_clear_bad(pud))
+			return 0;
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd) || pmd_write(*pmd))
+			return 0;
+	}
+
+	// TODO: handle COW PTE with swap
+	BUG_ON(is_swap_pmd(*pmd));
+	BUG_ON(pmd_trans_huge(*pmd));
+	BUG_ON(pmd_devmap(*pmd));
+
+	BUG_ON(pmd_none(*pmd));
+	BUG_ON(pmd_write(*pmd));
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+	if (!alloc)
+		ret = zap_cow_pte(vma, pmd, addr);
+	else
+		ret = break_cow_pte(vma, pmd, addr);
+	spin_unlock(ptl);
+
+	return ret;
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -4825,6 +5028,19 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				return 0;
 			}
 		}
+
+		/* When the PMD entry is set with write protection, it needs to
+		 * handle the on-demand PTE. It will allocate a new PTE and copy
+		 * the old one, then set this entry writeable and decrease the
+		 * reference count at COW PTE.
+		 */
+		if (test_bit(MMF_COW_PGTABLE, &mm->flags) &&
+		    !pmd_none(vmf.orig_pmd) && !pmd_write(vmf.orig_pmd)) {
+			if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address,
+			   (cow_pte_refcount_read(&vmf.orig_pmd) > 1) ?
+			   true : false) < 0)
+				return VM_FAULT_OOM;
+		}
 	}
 
 	return handle_pte_fault(&vmf);
diff --git a/mm/mmap.c b/mm/mmap.c
index 313b57d55a63..e3a9c38e87e8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2709,6 +2709,10 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 			return err;
 	}
 
+	if (test_bit(MMF_COW_PGTABLE, &vma->vm_mm->flags) &&
+	    handle_cow_pte(vma, NULL, addr, true) < 0)
+		return -ENOMEM;
+
 	new = vm_area_dup(vma);
 	if (!new)
 		return -ENOMEM;
diff --git a/mm/mremap.c b/mm/mremap.c
index 303d3290b938..01aefdfc61b7 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,6 +532,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;
+
+		if (test_bit(MMF_COW_PGTABLE, &vma->vm_mm->flags) &&
+		    !pmd_none(*old_pmd) && !pmd_write(*old_pmd))
+			handle_cow_pte(vma, old_pmd, old_addr, true);
+
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
@ 2022-05-19 23:07   ` kernel test robot
  2022-05-20  0:08   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2022-05-19 23:07 UTC (permalink / raw)
  To: Chih-En Lin; +Cc: llvm, kbuild-all

Hi Chih-En,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on tip/sched/core]
[also build test ERROR on soc/for-next linus/master v5.18-rc7 next-20220519]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Chih-En-Lin/Introduce-Copy-On-Write-to-Page-Table/20220520-023243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 734387ec2f9d77b00276042b1fa7c95f48ee879d
config: powerpc-microwatt_defconfig (https://download.01.org/0day-ci/archive/20220520/202205200722.h5TDQrQZ-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project e00cbbec06c08dc616a0d52a20f678b8fbd4e304)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install powerpc cross compiling tool for clang build
        # apt-get install binutils-powerpc-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/aa5b69eef6a0be734cd331cb3ab4172d854fb93c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chih-En-Lin/Introduce-Copy-On-Write-to-Page-Table/20220520-023243
        git checkout aa5b69eef6a0be734cd331cb3ab4172d854fb93c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=powerpc prepare

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   include/linux/signal.h:162:1: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_BINOP(sigandnsets, _sig_andn)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:139:3: note: expanded from macro '_SIG_SET_BINOP'
                   r->sig[3] = op(a3, b3);                                 \
                   ^      ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:162:1: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_BINOP(sigandnsets, _sig_andn)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:140:3: note: expanded from macro '_SIG_SET_BINOP'
                   r->sig[2] = op(a2, b2);                                 \
                   ^      ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:186:1: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_OP(signotset, _sig_not)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:173:27: note: expanded from macro '_SIG_SET_OP'
           case 4: set->sig[3] = op(set->sig[3]);                          \
                                    ^        ~
   include/linux/signal.h:185:24: note: expanded from macro '_sig_not'
   #define _sig_not(x)     (~(x))
                              ^
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:186:1: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_OP(signotset, _sig_not)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:173:10: note: expanded from macro '_SIG_SET_OP'
           case 4: set->sig[3] = op(set->sig[3]);                          \
                   ^        ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:186:1: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_OP(signotset, _sig_not)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:20: note: expanded from macro '_SIG_SET_OP'
                   set->sig[2] = op(set->sig[2]);                          \
                                    ^        ~
   include/linux/signal.h:185:24: note: expanded from macro '_sig_not'
   #define _sig_not(x)     (~(x))
                              ^
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:186:1: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_OP(signotset, _sig_not)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:174:3: note: expanded from macro '_SIG_SET_OP'
                   set->sig[2] = op(set->sig[2]);                          \
                   ^        ~
   arch/powerpc/include/uapi/asm/signal.h:18:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/powerpc/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:9:
   In file included from include/linux/sched/task.h:11:
   In file included from include/linux/uaccess.h:11:
   In file included from arch/powerpc/include/asm/uaccess.h:9:
   In file included from arch/powerpc/include/asm/kup.h:37:
>> include/linux/pgtable.h:603:59: error: invalid operands to binary expression ('void' and 'pmd_t *')
           return (smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner) ?
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^  ~~~~~
   In file included from arch/powerpc/kernel/asm-offsets.c:19:
   In file included from include/linux/mman.h:5:
   In file included from include/linux/mm.h:25:
>> include/linux/page_ref.h:89:32: error: no member named 'page' in 'struct folio'
           return page_ref_count(&folio->page);
                                  ~~~~~  ^
   include/linux/page_ref.h:106:25: error: no member named 'page' in 'struct folio'
           set_page_count(&folio->page, v);
                           ~~~~~  ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   28 warnings and 20 errors generated.
   make[2]: *** [scripts/Makefile.build:120: arch/powerpc/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1194: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:219: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +603 include/linux/pgtable.h

   600	
   601	static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
   602	{
 > 603		return (smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner) ?
   604			true : false;
   605	}
   606	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
  2022-05-19 23:07   ` kernel test robot
@ 2022-05-20  0:08   ` kernel test robot
  2022-05-20 14:15   ` Christophe Leroy
  2022-05-21  4:02   ` Matthew Wilcox
  3 siblings, 0 replies; 35+ messages in thread
From: kernel test robot @ 2022-05-20  0:08 UTC (permalink / raw)
  To: Chih-En Lin; +Cc: llvm, kbuild-all

Hi Chih-En,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on tip/sched/core]
[also build test ERROR on soc/for-next linus/master v5.18-rc7 next-20220519]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Chih-En-Lin/Introduce-Copy-On-Write-to-Page-Table/20220520-023243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 734387ec2f9d77b00276042b1fa7c95f48ee879d
config: hexagon-randconfig-r041-20220519 (https://download.01.org/0day-ci/archive/20220520/202205200836.FzJQhYQg-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project e00cbbec06c08dc616a0d52a20f678b8fbd4e304)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/aa5b69eef6a0be734cd331cb3ab4172d854fb93c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chih-En-Lin/Introduce-Copy-On-Write-to-Page-Table/20220520-023243
        git checkout aa5b69eef6a0be734cd331cb3ab4172d854fb93c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon prepare

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:224:2: error: unknown type name 'pmd_t'
           pmd_t *cow_pte_owner; /* cow pte: pmd */
           ^
>> include/linux/mm_types.h:278:1: error: static_assert failed due to requirement 'sizeof(struct page) == sizeof(struct folio)' "sizeof(struct page) == sizeof(struct folio)"
   static_assert(sizeof(struct page) == sizeof(struct folio));
   ^             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:77:34: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                    ^               ~~~~
   include/linux/build_bug.h:78:41: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                           ^              ~~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:281:1: error: no member named 'flags' in 'folio'
   FOLIO_MATCH(flags, flags);
   ^                  ~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:282:1: error: no member named 'lru' in 'folio'
   FOLIO_MATCH(lru, lru);
   ^                ~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:283:1: error: no member named 'mapping' in 'folio'
   FOLIO_MATCH(mapping, mapping);
   ^                    ~~~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
   include/linux/mm_types.h:284:1: error: no member named 'lru' in 'folio'
   FOLIO_MATCH(compound_head, lru);
   ^                          ~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:285:1: error: no member named 'index' in 'folio'
   FOLIO_MATCH(index, index);
   ^                  ~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:286:1: error: no member named 'private' in 'folio'
   FOLIO_MATCH(private, private);
   ^                    ~~~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:287:1: error: no member named '_mapcount' in 'folio'
   FOLIO_MATCH(_mapcount, _mapcount);
   ^                      ~~~~~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:288:1: error: no member named '_refcount' in 'folio'
   FOLIO_MATCH(_refcount, _refcount);
   ^                      ~~~~~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:290:1: error: no member named 'memcg_data' in 'folio'
   FOLIO_MATCH(memcg_data, memcg_data);
   ^                       ~~~~~~~~~~
   include/linux/mm_types.h:280:45: note: expanded from macro 'FOLIO_MATCH'
           static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
                                                      ^                      ~~
   include/linux/stddef.h:16:32: note: expanded from macro 'offsetof'
   #define offsetof(TYPE, MEMBER)  __builtin_offsetof(TYPE, MEMBER)
                                   ^                        ~~~~~~
   include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert'
   #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
                                                    ^~~~
   include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert'
   #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                                          ^~~~
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
   In file included from include/linux/uio.h:10:
>> include/linux/mm_types.h:296:30: error: no member named 'page' in 'struct folio'
           struct page *tail = &folio->page + 1;
                                ~~~~~  ^
>> include/linux/mm_types.h:333:16: error: no member named 'private' in 'struct folio'
           return folio->private;
                  ~~~~~  ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:15:
   In file included from include/linux/socket.h:8:
>> include/linux/uio.h:153:35: error: no member named 'page' in 'struct folio'
           return copy_page_to_iter(&folio->page, offset, bytes, i);
                                     ~~~~~  ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:13:
   In file included from include/linux/list_lru.h:14:
   In file included from include/linux/xarray.h:15:
   In file included from include/linux/gfp.h:6:
   In file included from include/linux/mmzone.h:22:
>> include/linux/page-flags.h:327:30: error: no member named 'page' in 'struct folio'
           struct page *page = &folio->page;
                                ~~~~~  ^
>> include/linux/page-flags.h:651:32: error: no member named 'mapping' in 'struct folio'
           return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
                                  ~~~~~  ^
   include/linux/page-flags.h:1049:34: error: no member named 'page' in 'struct folio'
           return page_has_private(&folio->page);
                                    ~~~~~  ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:97:11: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                           ^        ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:97:25: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return (set->sig[3] | set->sig[2] |
                                         ^        ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:113:11: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                            ^         ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:113:27: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                   return  (set1->sig[3] == set2->sig[3]) &&
                                            ^         ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:5: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
                            ^         ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:114:21: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
                           (set1->sig[2] == set2->sig[2]) &&
                                            ^         ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:156:1: warning: array index 3 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_BINOP(sigorsets, _sig_or)
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/signal.h:137:8: note: expanded from macro '_SIG_SET_BINOP'
                   a3 = a->sig[3]; a2 = a->sig[2];                         \
                        ^      ~
   include/uapi/asm-generic/signal.h:62:2: note: array 'sig' declared here
           unsigned long sig[_NSIG_WORDS];
           ^
   In file included from arch/hexagon/kernel/asm-offsets.c:12:
   In file included from include/linux/compat.h:17:
   In file included from include/linux/fs.h:33:
   In file included from include/linux/percpu-rwsem.h:7:
   In file included from include/linux/rcuwait.h:6:
   In file included from include/linux/sched/signal.h:6:
   include/linux/signal.h:156:1: warning: array index 2 is past the end of the array (which contains 2 elements) [-Warray-bounds]
   _SIG_SET_BINOP(sigorsets, _sig_or)


vim +/pmd_t +224 include/linux/mm_types.h

   220	
   221	#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
   222		int _last_cpupid;
   223	#endif
 > 224		pmd_t *cow_pte_owner; /* cow pte: pmd */
   225	} _struct_page_alignment;
   226	
   227	/**
   228	 * struct folio - Represents a contiguous set of bytes.
   229	 * @flags: Identical to the page flags.
   230	 * @lru: Least Recently Used list; tracks how recently this folio was used.
   231	 * @mapping: The file this page belongs to, or refers to the anon_vma for
   232	 *    anonymous memory.
   233	 * @index: Offset within the file, in units of pages.  For anonymous memory,
   234	 *    this is the index from the beginning of the mmap.
   235	 * @private: Filesystem per-folio data (see folio_attach_private()).
   236	 *    Used for swp_entry_t if folio_test_swapcache().
   237	 * @_mapcount: Do not access this member directly.  Use folio_mapcount() to
   238	 *    find out how many times this folio is mapped by userspace.
   239	 * @_refcount: Do not access this member directly.  Use folio_ref_count()
   240	 *    to find how many references there are to this folio.
   241	 * @memcg_data: Memory Control Group data.
   242	 *
   243	 * A folio is a physically, virtually and logically contiguous set
   244	 * of bytes.  It is a power-of-two in size, and it is aligned to that
   245	 * same power-of-two.  It is at least as large as %PAGE_SIZE.  If it is
   246	 * in the page cache, it is at a file offset which is a multiple of that
   247	 * power-of-two.  It may be mapped into userspace at an address which is
   248	 * at an arbitrary page offset, but its kernel virtual address is aligned
   249	 * to its size.
   250	 */
   251	struct folio {
   252		/* private: don't document the anon union */
   253		union {
   254			struct {
   255		/* public: */
   256				unsigned long flags;
   257				union {
   258					struct list_head lru;
   259					struct {
   260						void *__filler;
   261						unsigned int mlock_count;
   262					};
   263				};
   264				struct address_space *mapping;
   265				pgoff_t index;
   266				void *private;
   267				atomic_t _mapcount;
   268				atomic_t _refcount;
   269	#ifdef CONFIG_MEMCG
   270				unsigned long memcg_data;
   271	#endif
   272		/* private: the union with struct page is transitional */
   273			};
   274			struct page page;
   275		};
   276	};
   277	
 > 278	static_assert(sizeof(struct page) == sizeof(struct folio));
   279	#define FOLIO_MATCH(pg, fl)						\
   280		static_assert(offsetof(struct page, pg) == offsetof(struct folio, fl))
 > 281	FOLIO_MATCH(flags, flags);
 > 282	FOLIO_MATCH(lru, lru);
 > 283	FOLIO_MATCH(mapping, mapping);
   284	FOLIO_MATCH(compound_head, lru);
 > 285	FOLIO_MATCH(index, index);
 > 286	FOLIO_MATCH(private, private);
 > 287	FOLIO_MATCH(_mapcount, _mapcount);
 > 288	FOLIO_MATCH(_refcount, _refcount);
   289	#ifdef CONFIG_MEMCG
 > 290	FOLIO_MATCH(memcg_data, memcg_data);
   291	#endif
   292	#undef FOLIO_MATCH
   293	
   294	static inline atomic_t *folio_mapcount_ptr(struct folio *folio)
   295	{
 > 296		struct page *tail = &folio->page + 1;
   297		return &tail->compound_mapcount;
   298	}
   299	
   300	static inline atomic_t *compound_mapcount_ptr(struct page *page)
   301	{
   302		return &page[1].compound_mapcount;
   303	}
   304	
   305	static inline atomic_t *compound_pincount_ptr(struct page *page)
   306	{
   307		return &page[1].compound_pincount;
   308	}
   309	
   310	/*
   311	 * Used for sizing the vmemmap region on some architectures
   312	 */
   313	#define STRUCT_PAGE_MAX_SHIFT	(order_base_2(sizeof(struct page)))
   314	
   315	#define PAGE_FRAG_CACHE_MAX_SIZE	__ALIGN_MASK(32768, ~PAGE_MASK)
   316	#define PAGE_FRAG_CACHE_MAX_ORDER	get_order(PAGE_FRAG_CACHE_MAX_SIZE)
   317	
   318	/*
   319	 * page_private can be used on tail pages.  However, PagePrivate is only
   320	 * checked by the VM on the head page.  So page_private on the tail pages
   321	 * should be used for data that's ancillary to the head page (eg attaching
   322	 * buffer heads to tail pages after attaching buffer heads to the head page)
   323	 */
   324	#define page_private(page)		((page)->private)
   325	
   326	static inline void set_page_private(struct page *page, unsigned long private)
   327	{
   328		page->private = private;
   329	}
   330	
   331	static inline void *folio_get_private(struct folio *folio)
   332	{
 > 333		return folio->private;
   334	}
   335	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/6] mm: Add COW PTE fallback function
  2022-05-19 18:31 ` [RFC PATCH 4/6] mm: Add COW PTE fallback function Chih-En Lin
@ 2022-05-20  0:20   ` kernel test robot
  2022-05-20 14:21   ` Christophe Leroy
  1 sibling, 0 replies; 35+ messages in thread
From: kernel test robot @ 2022-05-20  0:20 UTC (permalink / raw)
  To: Chih-En Lin; +Cc: llvm, kbuild-all

Hi Chih-En,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on tip/sched/core]
[also build test ERROR on soc/for-next linus/master v5.18-rc7 next-20220519]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Chih-En-Lin/Introduce-Copy-On-Write-to-Page-Table/20220520-023243
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 734387ec2f9d77b00276042b1fa7c95f48ee879d
config: arm-shannon_defconfig (https://download.01.org/0day-ci/archive/20220520/202205200816.ZhOASVVs-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project e00cbbec06c08dc616a0d52a20f678b8fbd4e304)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/e4e2e178c5a43b37925972bc0eab9976d41d35c7
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chih-En-Lin/Introduce-Copy-On-Write-to-Page-Table/20220520-023243
        git checkout e4e2e178c5a43b37925972bc0eab9976d41d35c7
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/memory.c:1007:9: warning: variable 'orig_ptep' set but not used [-Wunused-but-set-variable]
           pte_t *orig_ptep, *ptep;
                  ^
>> mm/memory.c:4616:8: error: call to undeclared function 'pmd_mkwrite'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           new = pmd_mkwrite(*pmd);
                 ^
>> mm/memory.c:4617:2: error: call to undeclared function 'set_pmd_at'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
           set_pmd_at(mm, addr, pmd, new);
           ^
   mm/memory.c:4617:2: note: did you mean 'set_pte_at'?
   arch/arm/include/asm/pgtable.h:225:6: note: 'set_pte_at' declared here
   void set_pte_at(struct mm_struct *mm, unsigned long addr,
        ^
   mm/memory.c:4592:6: warning: no previous prototype for function 'cow_pte_fallback' [-Wmissing-prototypes]
   void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
        ^
   mm/memory.c:4592:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
   ^
   static 
   2 warnings and 2 errors generated.


vim +/pmd_mkwrite +4616 mm/memory.c

  4584	
  4585	/* COW PTE fallback to normal PTE:
  4586	 * - two state here
  4587	 *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
  4588	 *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
  4589	 *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
  4590	 *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
  4591	 */
  4592	void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
  4593			unsigned long addr)
  4594	{
  4595		struct mm_struct *mm = vma->vm_mm;
  4596		unsigned long start, end;
  4597		pmd_t new;
  4598	
  4599		BUG_ON(pmd_write(*pmd));
  4600	
  4601		start = addr & PMD_MASK;
  4602		end = (addr + PMD_SIZE) & PMD_MASK;
  4603	
  4604		/* If pmd is not owner, it needs to increase the rss.
  4605		 * Since only the owner has the RSS state for the COW PTE.
  4606		 */
  4607		if (!cow_pte_owner_is_same(pmd, pmd)) {
  4608			cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
  4609			mm_inc_nr_ptes(mm);
  4610			smp_wmb();
  4611			pmd_populate(mm, pmd, pmd_page(*pmd));
  4612		}
  4613	
  4614		/* Reuse the pte page */
  4615		set_cow_pte_owner(pmd, NULL);
> 4616		new = pmd_mkwrite(*pmd);
> 4617		set_pmd_at(mm, addr, pmd, new);
  4618	
  4619		BUG_ON(!pmd_write(*pmd));
  4620		BUG_ON(pmd_page(*pmd)->cow_pte_owner);
  4621	}
  4622	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag
  2022-05-19 18:31 ` [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag Chih-En Lin
@ 2022-05-20 14:13   ` Christophe Leroy
  2022-05-21  3:50     ` Chih-En Lin
  0 siblings, 1 reply; 35+ messages in thread
From: Christophe Leroy @ 2022-05-20 14:13 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang



Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> Add CLONE_COW_PGTABLE flag to support clone3() system call to enable the
> Copy-On-Write (COW) mechanism on the page table.

Is that really something we want the user to decide ? Isn't it an 
internal stuff that should be transparent for users ?

As far as I know, there is no way today to decide whether you want COW 
or not for main memory. Why should there be a choice for the COW of page 
tables ?


> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
>   include/uapi/linux/sched.h | 1 +
>   kernel/fork.c              | 6 +++++-
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..3b92ff589e0f 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -36,6 +36,7 @@
>   /* Flags for the clone3() syscall. */
>   #define CLONE_CLEAR_SIGHAND 0x100000000ULL /* Clear any signal handler and reset to SIG_DFL. */
>   #define CLONE_INTO_CGROUP 0x200000000ULL /* Clone into a specific cgroup given the right permissions. */
> +#define CLONE_COW_PGTABLE 0x400000000ULL /* Copy-On-Write for page table */
> 
>   /*
>    * cloning flags intersect with CSIGNAL so can be used with unshare and clone3
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 35a3beff140b..08cf95201333 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2636,6 +2636,9 @@ pid_t kernel_clone(struct kernel_clone_args *args)
>                          trace = 0;
>          }
> 
> +       if (clone_flags & CLONE_COW_PGTABLE)
> +               set_bit(MMF_COW_PGTABLE, &current->mm->flags);
> +
>          p = copy_process(NULL, trace, NUMA_NO_NODE, args);
>          add_latent_entropy();
> 
> @@ -2860,7 +2863,8 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
>   {
>          /* Verify that no unknown flags are passed along. */
>          if (kargs->flags &
> -           ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP))
> +           ~(CLONE_LEGACY_FLAGS | CLONE_CLEAR_SIGHAND | CLONE_INTO_CGROUP |
> +                   CLONE_COW_PGTABLE))
>                  return false;
> 
>          /*
> --
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
  2022-05-19 23:07   ` kernel test robot
  2022-05-20  0:08   ` kernel test robot
@ 2022-05-20 14:15   ` Christophe Leroy
  2022-05-21  4:03     ` Chih-En Lin
  2022-05-21  4:02   ` Matthew Wilcox
  3 siblings, 1 reply; 35+ messages in thread
From: Christophe Leroy @ 2022-05-20 14:15 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang



Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> Introduce the ownership for the PTE table to prepare the following patch
> of the Copy-On-Write (COW) page table. It uses the address of PMD index
> to become the owner to identify which process can update its page table
> state from the COW page table.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
>   include/linux/mm.h       |  1 +
>   include/linux/mm_types.h |  1 +
>   include/linux/pgtable.h  | 14 ++++++++++++++
>   3 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9f44254af8ce..221926a3d818 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2328,6 +2328,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
>                  return false;
>          __SetPageTable(page);
>          inc_lruvec_page_state(page, NR_PAGETABLE);
> +       page->cow_pte_owner = NULL;
>          return true;
>   }
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8834e38c06a4..5dcbd7f6c361 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -221,6 +221,7 @@ struct page {
>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>          int _last_cpupid;
>   #endif
> +       pmd_t *cow_pte_owner; /* cow pte: pmd */
>   } _struct_page_alignment;
> 
>   /**
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f4f4077b97aa..faca57af332e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -590,6 +590,20 @@ static inline int pte_unused(pte_t pte)
>   }
>   #endif
> 
> +static inline bool set_cow_pte_owner(pmd_t *pmd, pmd_t *owner)
> +{
> +       struct page *page = pmd_page(*pmd);
> +
> +       smp_store_release(&page->cow_pte_owner, owner);
> +       return true;
> +}
> +
> +static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
> +{
> +       return (smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner) ?
> +               true : false;

The above seems uggly, the following should be equivalent :

	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;

> +}
> +
>   #ifndef pte_access_permitted
>   #define pte_access_permitted(pte, write) \
>          (pte_present(pte) && (!(write) || pte_write(pte)))
> --
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/6] mm: Add COW PTE fallback function
  2022-05-19 18:31 ` [RFC PATCH 4/6] mm: Add COW PTE fallback function Chih-En Lin
  2022-05-20  0:20   ` kernel test robot
@ 2022-05-20 14:21   ` Christophe Leroy
  2022-05-21  4:15     ` Chih-En Lin
  1 sibling, 1 reply; 35+ messages in thread
From: Christophe Leroy @ 2022-05-20 14:21 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang



Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> The lifetime of COW PTE will handle by ownership and a reference count.
> When the process wants to write the COW PTE, which reference count is 1,
> it will reuse the COW PTE instead of copying then free.
> 
> Only the owner will update its RSS state and the record of page table
> bytes allocation. So we need to handle when the non-owner process gets
> the fallback COW PTE.
> 
> This commit prepares for the following implementation of the reference
> count for COW PTE.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
>   mm/memory.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 66 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 76e3af9639d9..dcb678cbb051 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1000,6 +1000,34 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma,
>          return new_page;
>   }
> 
> +static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
> +       pmd_t *pmdp, unsigned long addr, unsigned long end, bool inc_dec)

Parenthesis alignment is not correct.

You should run 'scripts/checkpatch.pl --strict' on you patch.

> +{
> +       int rss[NR_MM_COUNTERS];
> +       pte_t *orig_ptep, *ptep;
> +       struct page *page;
> +
> +       init_rss_vec(rss);
> +
> +       ptep = pte_offset_map(pmdp, addr);
> +       orig_ptep = ptep;
> +       arch_enter_lazy_mmu_mode();
> +       do {
> +               if (pte_none(*ptep) || pte_special(*ptep))
> +                       continue;
> +
> +               page = vm_normal_page(vma, addr, *ptep);
> +               if (page) {
> +                       if (inc_dec)
> +                               rss[mm_counter(page)]++;
> +                       else
> +                               rss[mm_counter(page)]--;
> +               }
> +       } while (ptep++, addr += PAGE_SIZE, addr != end);
> +       arch_leave_lazy_mmu_mode();
> +       add_mm_rss_vec(mm, rss);
> +}
> +
>   static int
>   copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> @@ -4554,6 +4582,44 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
>          return VM_FAULT_FALLBACK;
>   }
> 
> +/* COW PTE fallback to normal PTE:
> + * - two state here
> + *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
> + *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> + *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
> + *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> + */
> +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr)

There should be a prototype in a header somewhere for a non static function.

You are encouraged to run 'make mm/memory.o C=2' to check sparse reports.

> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       unsigned long start, end;
> +       pmd_t new;
> +
> +       BUG_ON(pmd_write(*pmd));

You seem to add a lot of BUG_ONs(). Are they really necessary ? See 
https://docs.kernel.org/process/deprecated.html?highlight=bug_on#bug-and-bug-on

You may also use VM_BUG_ON().

> +
> +       start = addr & PMD_MASK;
> +       end = (addr + PMD_SIZE) & PMD_MASK;
> +
> +       /* If pmd is not owner, it needs to increase the rss.
> +        * Since only the owner has the RSS state for the COW PTE.
> +        */
> +       if (!cow_pte_owner_is_same(pmd, pmd)) {
> +               cow_pte_rss(mm, vma, pmd, start, end, true /* inc */);
> +               mm_inc_nr_ptes(mm);
> +               smp_wmb();
> +               pmd_populate(mm, pmd, pmd_page(*pmd));
> +       }
> +
> +       /* Reuse the pte page */
> +       set_cow_pte_owner(pmd, NULL);
> +       new = pmd_mkwrite(*pmd);
> +       set_pmd_at(mm, addr, pmd, new);
> +
> +       BUG_ON(!pmd_write(*pmd));
> +       BUG_ON(pmd_page(*pmd)->cow_pte_owner);
> +}
> +
>   /*
>    * These routines also need to handle stuff like marking pages dirty
>    * and/or accessed for architectures that don't do it in hardware (most
> --
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE
  2022-05-19 18:31 ` [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Chih-En Lin
@ 2022-05-20 14:30   ` Christophe Leroy
  2022-05-21  4:22     ` Chih-En Lin
  2022-05-21  4:08   ` Matthew Wilcox
  1 sibling, 1 reply; 35+ messages in thread
From: Christophe Leroy @ 2022-05-20 14:30 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang



Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> Add the reference counter cow_pgtable_refcount to maintain the number
> of process references to COW PTE. Before decreasing the reference
> count, it will check whether the counter is one or not for reusing
> COW PTE when the counter is one.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
>   include/linux/mm.h       |  1 +
>   include/linux/mm_types.h |  1 +
>   include/linux/pgtable.h  | 27 +++++++++++++++++++++++++++
>   mm/memory.c              |  1 +
>   4 files changed, 30 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 221926a3d818..e48bb3fbc33c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2329,6 +2329,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
>          __SetPageTable(page);
>          inc_lruvec_page_state(page, NR_PAGETABLE);
>          page->cow_pte_owner = NULL;
> +       atomic_set(&page->cow_pgtable_refcount, 1);
>          return true;
>   }
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5dcbd7f6c361..984d81e47d53 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -221,6 +221,7 @@ struct page {
>   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>          int _last_cpupid;
>   #endif
> +       atomic_t cow_pgtable_refcount; /* COW page table */
>          pmd_t *cow_pte_owner; /* cow pte: pmd */
>   } _struct_page_alignment;
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index faca57af332e..33c01fec7b92 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -604,6 +604,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
>                  true : false;
>   }
> 
> +extern void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr);

'extern' keyword is pointless for fonction prototypes. No new ones 
should added.

> +
> +static inline int pmd_get_pte(pmd_t *pmd)
> +{
> +       return atomic_inc_return(&pmd_page(*pmd)->cow_pgtable_refcount);
> +}
> +
> +/* If the COW PTE page->cow_pgtable_refcount is 1, instead of decreasing the
> + * counter, clear write protection of the corresponding PMD entry and reset
> + * the COW PTE owner to reuse the table.
> + */
> +static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr)
> +{
> +       if (!atomic_add_unless(&pmd_page(*pmd)->cow_pgtable_refcount, -1, 1)) {
> +               cow_pte_fallback(vma, pmd, addr);
> +               return 1;
> +       }
> +       return 0;

I would do something more flat by reverting the test:

{
	if (atomic_add_unless(&pmd_page(*pmd)->cow_pgtable_refcount, -1, 1))
		return 0;

	cow_pte_fallback(vma, pmd, addr);
	return 1;
}

> +}
> +
> +static inline int cow_pte_refcount_read(pmd_t *pmd)
> +{
> +       return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount);
> +}
> +
>   #ifndef pte_access_permitted
>   #define pte_access_permitted(pte, write) \
>          (pte_present(pte) && (!(write) || pte_write(pte)))
> diff --git a/mm/memory.c b/mm/memory.c
> index dcb678cbb051..aa66af76e214 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4597,6 +4597,7 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
>          pmd_t new;
> 
>          BUG_ON(pmd_write(*pmd));
> +       BUG_ON(cow_pte_refcount_read(pmd) != 1);
> 
>          start = addr & PMD_MASK;
>          end = (addr + PMD_SIZE) & PMD_MASK;
> --
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table
  2022-05-19 18:31 ` [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table Chih-En Lin
@ 2022-05-20 14:49   ` Christophe Leroy
  2022-05-21  4:38     ` Chih-En Lin
  0 siblings, 1 reply; 35+ messages in thread
From: Christophe Leroy @ 2022-05-20 14:49 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang



Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
> To enable the COW page table use the clone3() system call with the
> CLONE_COW_PGTABLE flag. It will set the MMF_COW_PGTABLE flag to the
> processes.
> 
> It uses the MMF_COW_PGTABLE flag to distinguish the default page table
> and the COW one. Moreover, it is difficult to distinguish whether the
> entire page table is out of COW state. So the MMF_COW_PGTABLE flag won't
> be disabled after its setup.
> 
> Since the memory space of the page table is distinctive for each process
> in kernel space. It uses the address of the PMD index for the ownership
> of the PTE table to identify which one of the processes needs to update
> the page table state. In other words, only the owner will update COW PTE
> state, like the RSS and pgtable_bytes.
> 
> It uses the reference count to control the lifetime of COW PTE table.
> When someone breaks COW, it will copy the COW PTE table and decrease the
> reference count. But if the reference count is equal to one before the
> break COW, it will reuse the COW PTE table.
> 
> This patch modifies the part of the copy page table to do the basic COW.
> For the break COW, it modifies the part of a page fault, zaps page table
> , unmapping, and remapping.
> 
> Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> ---
>   include/linux/pgtable.h |   3 +
>   mm/memory.c             | 262 ++++++++++++++++++++++++++++++++++++----
>   mm/mmap.c               |   4 +
>   mm/mremap.c             |   5 +
>   4 files changed, 251 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 33c01fec7b92..357ce3722ee8 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -631,6 +631,9 @@ static inline int cow_pte_refcount_read(pmd_t *pmd)
>          return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount);
>   }
> 
> +extern int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr, bool alloc);
> +
>   #ifndef pte_access_permitted
>   #define pte_access_permitted(pte, write) \
>          (pte_present(pte) && (!(write) || pte_write(pte)))
> diff --git a/mm/memory.c b/mm/memory.c
> index aa66af76e214..ff3fcbe4dfb5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -247,6 +247,8 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
>                  next = pmd_addr_end(addr, end);
>                  if (pmd_none_or_clear_bad(pmd))
>                          continue;
> +               BUG_ON(cow_pte_refcount_read(pmd) != 1);
> +               BUG_ON(!cow_pte_owner_is_same(pmd, NULL));

See comment on a previous patch of this series, there seem to be a huge 
number of new BUG_ONs.

>                  free_pte_range(tlb, pmd, addr);
>          } while (pmd++, addr = next, addr != end);
> 
> @@ -1031,7 +1033,7 @@ static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
>   static int
>   copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> -              unsigned long end)
> +              unsigned long end, bool is_src_pte_locked)
>   {
>          struct mm_struct *dst_mm = dst_vma->vm_mm;
>          struct mm_struct *src_mm = src_vma->vm_mm;
> @@ -1053,8 +1055,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                  goto out;
>          }
>          src_pte = pte_offset_map(src_pmd, addr);
> -       src_ptl = pte_lockptr(src_mm, src_pmd);
> -       spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +       if (!is_src_pte_locked) {
> +               src_ptl = pte_lockptr(src_mm, src_pmd);
> +               spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> +       }

Odd construct, that kind of construct often leads to messy errors.

Could you construct things differently by refactoring the code ?

>          orig_src_pte = src_pte;
>          orig_dst_pte = dst_pte;
>          arch_enter_lazy_mmu_mode();
> @@ -1067,7 +1071,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                  if (progress >= 32) {
>                          progress = 0;
>                          if (need_resched() ||
> -                           spin_needbreak(src_ptl) || spin_needbreak(dst_ptl))
> +                           (!is_src_pte_locked && spin_needbreak(src_ptl)) ||
> +                           spin_needbreak(dst_ptl))
>                                  break;
>                  }
>                  if (pte_none(*src_pte)) {
> @@ -1118,7 +1123,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>          } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
> 
>          arch_leave_lazy_mmu_mode();
> -       spin_unlock(src_ptl);
> +       if (!is_src_pte_locked)
> +               spin_unlock(src_ptl);
>          pte_unmap(orig_src_pte);
>          add_mm_rss_vec(dst_mm, rss);
>          pte_unmap_unlock(orig_dst_pte, dst_ptl);
> @@ -1180,11 +1186,55 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>                                  continue;
>                          /* fall through */
>                  }
> -               if (pmd_none_or_clear_bad(src_pmd))
> -                       continue;
> -               if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> -                                  addr, next))
> +
> +               if (test_bit(MMF_COW_PGTABLE, &src_mm->flags)) {
> +
> +                        if (pmd_none(*src_pmd))
> +                               continue;

Why not keep the pmd_none_or_clear_bad(src_pmd) instead ?

> +
> +                       /* XXX: Skip if the PTE already COW this time. */
> +                       if (!pmd_none(*dst_pmd) &&

Shouldn't is be a pmd_none_or_clear_bad() ?

> +                           cow_pte_refcount_read(src_pmd) > 1)
> +                               continue;
> +
> +                       /* If PTE doesn't have an owner, the parent needs to
> +                        * take this PTE.
> +                        */
> +                       if (cow_pte_owner_is_same(src_pmd, NULL)) {
> +                               set_cow_pte_owner(src_pmd, src_pmd);
> +                               /* XXX: The process may COW PTE fork two times.
> +                                * But in some situations, owner has cleared.
> +                                * Previously Child (This time is the parent)
> +                                * COW PTE forking, but previously parent, owner
> +                                * , break COW. So it needs to add back the RSS
> +                                * state and pgtable bytes.
> +                                */
> +                               if (!pmd_write(*src_pmd)) {
> +                                       unsigned long pte_start =
> +                                               addr & PMD_MASK;
> +                                       unsigned long pte_end =
> +                                               (addr + PMD_SIZE) & PMD_MASK;
> +                                       cow_pte_rss(src_mm, src_vma, src_pmd,
> +                                           pte_start, pte_end, true /* inc */);
> +                                       mm_inc_nr_ptes(src_mm);
> +                                       smp_wmb();
> +                                       pmd_populate(src_mm, src_pmd,
> +                                                       pmd_page(*src_pmd));
> +                               }
> +                       }
> +
> +                       pmdp_set_wrprotect(src_mm, addr, src_pmd);
> +
> +                       /* Child reference count */
> +                       pmd_get_pte(src_pmd);
> +
> +                       /* COW for PTE table */
> +                       set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd);
> +               } else if (!pmd_none_or_clear_bad(src_pmd) &&

Can't we keep pmd_none_or_clear_bad(src_pmd) common to both cases ?


> +                           copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> +                                   addr, next, false)) {
>                          return -ENOMEM;
> +               }
>          } while (dst_pmd++, src_pmd++, addr = next, addr != end);
>          return 0;
>   }
> @@ -1336,6 +1386,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
>   struct zap_details {
>          struct folio *single_folio;     /* Locked folio to be unmapped */
>          bool even_cows;                 /* Zap COWed private pages too? */
> +       bool cow_pte;                   /* Do not free COW PTE */
>   };
> 
>   /* Whether we should zap all COWed (private) pages too */
> @@ -1398,8 +1449,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                          page = vm_normal_page(vma, addr, ptent);
>                          if (unlikely(!should_zap_page(details, page)))
>                                  continue;
> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> -                                                       tlb->fullmm);
> +                       if (!details || !details->cow_pte)
> +                               ptent = ptep_get_and_clear_full(mm, addr, pte,
> +                                                               tlb->fullmm);
>                          tlb_remove_tlb_entry(tlb, pte, addr);
>                          if (unlikely(!page))
>                                  continue;
> @@ -1413,8 +1465,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                      likely(!(vma->vm_flags & VM_SEQ_READ)))
>                                          mark_page_accessed(page);
>                          }
> -                       rss[mm_counter(page)]--;
> -                       page_remove_rmap(page, vma, false);
> +                       if (!details || !details->cow_pte) {
> +                               rss[mm_counter(page)]--;
> +                               page_remove_rmap(page, vma, false);
> +                       } else
> +                               continue;

Can you do the reverse:

			if (details && details->cow_pte)
				continue;

			rss[mm_counter(page)]--;
			page_remove_rmap(page, vma, false);


>                          if (unlikely(page_mapcount(page) < 0))
>                                  print_bad_pte(vma, addr, ptent, page);
>                          if (unlikely(__tlb_remove_page(tlb, page))) {
> @@ -1425,6 +1480,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                          continue;
>                  }
> 
> +               // TODO: Deal COW PTE with swap
> +
>                  entry = pte_to_swp_entry(ptent);
>                  if (is_device_private_entry(entry) ||
>                      is_device_exclusive_entry(entry)) {
> @@ -1513,16 +1570,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
>                          spin_unlock(ptl);
>                  }
> 
> -               /*
> -                * Here there can be other concurrent MADV_DONTNEED or
> -                * trans huge page faults running, and if the pmd is
> -                * none or trans huge it can change under us. This is
> -                * because MADV_DONTNEED holds the mmap_lock in read
> -                * mode.
> -                */
> -               if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> -                       goto next;
> -               next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> +
> +               if (test_bit(MMF_COW_PGTABLE, &tlb->mm->flags) &&
> +                   !pmd_none(*pmd) && !pmd_write(*pmd)) {

Can't you use pmd_none_or_trans_huge_or_clear_bad() and keep it common ? ...

> +                       struct zap_details cow_pte_details = {0};
> +                       if (details)
> +                               cow_pte_details = *details;
> +                       cow_pte_details.cow_pte = true;
> +                       /* Flush the TLB but do not free the COW PTE */
> +                       next = zap_pte_range(tlb, vma, pmd, addr,
> +                                               next, &cow_pte_details);
> +                       if (details)
> +                               *details = cow_pte_details;
> +                       handle_cow_pte(vma, pmd, addr, false);

Or add a continue; here and avoid the else below

> +               } else {
> +                       if (details)
> +                               details->cow_pte = false;
> +                       /*
> +                        * Here there can be other concurrent MADV_DONTNEED or
> +                        * trans huge page faults running, and if the pmd is
> +                        * none or trans huge it can change under us. This is
> +                        * because MADV_DONTNEED holds the mmap_lock in read
> +                        * mode.
> +                        */
> +                       if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> +                               goto next;
> +                       next = zap_pte_range(tlb, vma, pmd, addr, next,
> +                                       details);
> +               }
>   next:
>                  cond_resched();
>          } while (pmd++, addr = next, addr != end);
> @@ -4621,6 +4696,134 @@ void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
>          BUG_ON(pmd_page(*pmd)->cow_pte_owner);
>   }
> 
> +/* Break COW PTE:
> + * - two state here
> + *   - After fork :   [parent, rss=1, ref=2, write=NO , owner=parent]
> + *                 to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> + *                    COW PTE become [ref=1, write=NO , owner=NULL  ]
> + *                    [child , rss=0, ref=2, write=NO , owner=parent]
> + *                 to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> + *                    COW PTE become [ref=1, write=NO , owner=parent]
> + *   NOTE
> + *     - Copy the COW PTE to new PTE.
> + *     - Clear the owner of COW PTE and set PMD entry writable when it is owner.
> + *     - Increase RSS if it is not owner.
> + */
> +static int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       unsigned long start, end;
> +       pmd_t cowed_entry = *pmd;
> +
> +       if (cow_pte_refcount_read(&cowed_entry) == 1) {
> +               cow_pte_fallback(vma, pmd, addr);
> +               return 1;
> +       }
> +
> +       BUG_ON(pmd_write(cowed_entry));
> +
> +       start = addr & PMD_MASK;
> +       end = (addr + PMD_SIZE) & PMD_MASK;
> +
> +       pmd_clear(pmd);
> +       if (copy_pte_range(vma, vma, pmd, &cowed_entry,
> +                               start, end, true))
> +               return -ENOMEM;
> +
> +       /* Here, it is the owner, so clear the ownership. To keep RSS state and
> +        * page table bytes correct, it needs to decrease them.
> +        */
> +       if (cow_pte_owner_is_same(&cowed_entry, pmd)) {
> +               set_cow_pte_owner(&cowed_entry, NULL);
> +               cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> +               mm_dec_nr_ptes(mm);
> +       }
> +
> +       pmd_put_pte(vma, &cowed_entry, addr);
> +
> +       BUG_ON(!pmd_write(*pmd));
> +       BUG_ON(cow_pte_refcount_read(pmd) != 1);
> +
> +       return 0;
> +}
> +
> +static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       unsigned long start, end;
> +
> +       if (pmd_put_pte(vma, pmd, addr)) {
> +               // fallback
> +               return 1;
> +       }

No { } for a single line if. The comment could go just before the if.

> +
> +       start = addr & PMD_MASK;
> +       end = (addr + PMD_SIZE) & PMD_MASK;
> +
> +       /* If PMD entry is owner, clear the ownership, and decrease RSS state
> +        * and pgtable_bytes.
> +        */

Please follow the standard comments style:

/*
  * Some text
  * More text
  */

> +       if (cow_pte_owner_is_same(pmd, pmd)) {
> +               set_cow_pte_owner(pmd, NULL);
> +               cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> +               mm_dec_nr_ptes(mm);
> +       }
> +
> +       pmd_clear(pmd);
> +       return 0;
> +}
> +
> +/* If alloc set means it won't break COW. For this case, it will just decrease
> + * the reference count. The address needs to be at the beginning of the PTE page
> + * since COW PTE is copy-on-write the entire PTE.
> + * If pmd is NULL, it will get the pmd from vma and check it is cowing.
> + */
> +int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> +               unsigned long addr, bool alloc)
> +{
> +       pgd_t *pgd;
> +       p4d_t *p4d;
> +       pud_t *pud;
> +       struct mm_struct *mm = vma->vm_mm;
> +       int ret = 0;
> +       spinlock_t *ptl = NULL;
> +
> +       if (!pmd) {
> +               pgd = pgd_offset(mm, addr);
> +               if (pgd_none_or_clear_bad(pgd))
> +                       return 0;
> +               p4d = p4d_offset(pgd, addr);
> +               if (p4d_none_or_clear_bad(p4d))
> +                       return 0;
> +               pud = pud_offset(p4d, addr);
> +               if (pud_none_or_clear_bad(pud))
> +                       return 0;
> +               pmd = pmd_offset(pud, addr);
> +               if (pmd_none(*pmd) || pmd_write(*pmd))
> +                       return 0;
> +       }
> +
> +       // TODO: handle COW PTE with swap
> +       BUG_ON(is_swap_pmd(*pmd));
> +       BUG_ON(pmd_trans_huge(*pmd));
> +       BUG_ON(pmd_devmap(*pmd));
> +
> +       BUG_ON(pmd_none(*pmd));
> +       BUG_ON(pmd_write(*pmd));

So many BUG_ON ? All this has a cost during the execution.

> +
> +       ptl = pte_lockptr(mm, pmd);
> +       spin_lock(ptl);
> +       if (!alloc)
> +               ret = zap_cow_pte(vma, pmd, addr);
> +       else
> +               ret = break_cow_pte(vma, pmd, addr);

Better as

	if (alloc)
		break_cow_pte()
	else
		zap_cow_pte()

> +       spin_unlock(ptl);
> +
> +       return ret;
> +}
> +
>   /*
>    * These routines also need to handle stuff like marking pages dirty
>    * and/or accessed for architectures that don't do it in hardware (most
> @@ -4825,6 +5028,19 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>                                  return 0;
>                          }
>                  }
> +
> +               /* When the PMD entry is set with write protection, it needs to
> +                * handle the on-demand PTE. It will allocate a new PTE and copy
> +                * the old one, then set this entry writeable and decrease the
> +                * reference count at COW PTE.
> +                */
> +               if (test_bit(MMF_COW_PGTABLE, &mm->flags) &&
> +                   !pmd_none(vmf.orig_pmd) && !pmd_write(vmf.orig_pmd)) {
> +                       if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address,
> +                          (cow_pte_refcount_read(&vmf.orig_pmd) > 1) ?
> +                          true : false) < 0)

(condition ? true : false) is exactly the same as (condition)


> +                               return VM_FAULT_OOM;
> +               }
>          }
> 
>          return handle_pte_fault(&vmf);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 313b57d55a63..e3a9c38e87e8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2709,6 +2709,10 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
>                          return err;
>          }
> 
> +       if (test_bit(MMF_COW_PGTABLE, &vma->vm_mm->flags) &&
> +           handle_cow_pte(vma, NULL, addr, true) < 0)
> +               return -ENOMEM;
> +
>          new = vm_area_dup(vma);
>          if (!new)
>                  return -ENOMEM;
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 303d3290b938..01aefdfc61b7 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -532,6 +532,11 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>                  old_pmd = get_old_pmd(vma->vm_mm, old_addr);
>                  if (!old_pmd)
>                          continue;
> +
> +               if (test_bit(MMF_COW_PGTABLE, &vma->vm_mm->flags) &&
> +                   !pmd_none(*old_pmd) && !pmd_write(*old_pmd))
> +                       handle_cow_pte(vma, old_pmd, old_addr, true);
> +
>                  new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>                  if (!new_pmd)
>                          break;
> --
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag
  2022-05-20 14:13   ` Christophe Leroy
@ 2022-05-21  3:50     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  3:50 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:13:25PM +0000, Christophe Leroy wrote:
> 
> 
> Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> > Add CLONE_COW_PGTABLE flag to support clone3() system call to enable the
> > Copy-On-Write (COW) mechanism on the page table.
> 
> Is that really something we want the user to decide ? Isn't it an 
> internal stuff that should be transparent for users ?
> 
> As far as I know, there is no way today to decide whether you want COW 
> or not for main memory. Why should there be a choice for the COW of page 
> tables ?

Agree.
It should not expose to the user.
COW of page table should become the configuration.
Or, if the change is fine, it can even be the default setting.

Thanks.

---

Sorry I did not group reply the first time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
                     ` (2 preceding siblings ...)
  2022-05-20 14:15   ` Christophe Leroy
@ 2022-05-21  4:02   ` Matthew Wilcox
  2022-05-21  5:01     ` Chih-En Lin
  3 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2022-05-21  4:02 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:31:24AM +0800, Chih-En Lin wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 8834e38c06a4..5dcbd7f6c361 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -221,6 +221,7 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +	pmd_t *cow_pte_owner; /* cow pte: pmd */

This is definitely the wrong place.  I think it could replace _pt_pad_1,
since it's a pointer to a PMD and so the bottom bit will definitely
be clear.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-20 14:15   ` Christophe Leroy
@ 2022-05-21  4:03     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  4:03 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:15:12PM +0000, Christophe Leroy wrote:
> > +static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
> > +{
> > +       return (smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner) ?
> > +               true : false;

Why I wrote like this. ;-)

> 
> The above seems uggly, the following should be equivalent :
> 
> 	return smp_load_acquire(&pmd_page(*pmd)->cow_pte_owner) == owner;
> 

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE
  2022-05-19 18:31 ` [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Chih-En Lin
  2022-05-20 14:30   ` Christophe Leroy
@ 2022-05-21  4:08   ` Matthew Wilcox
  2022-05-21  5:10     ` Chih-En Lin
  1 sibling, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2022-05-21  4:08 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:31:26AM +0800, Chih-En Lin wrote:
> +++ b/include/linux/mm_types.h
> @@ -221,6 +221,7 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>  	int _last_cpupid;
>  #endif
> +	atomic_t cow_pgtable_refcount; /* COW page table */
>  	pmd_t *cow_pte_owner; /* cow pte: pmd */
>  } _struct_page_alignment;

Oh.  You need another 4 bytes.  Hmm.

Can you share _refcount?

Using _pt_pad_2 should be possible, but some care will be needed to make
sure it's (a) in a union with an unsigned long to keep the alignment
as expected, and (b) is definitely zero before the page is freed (or
the page allocator will squawk at you).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 4/6] mm: Add COW PTE fallback function
  2022-05-20 14:21   ` Christophe Leroy
@ 2022-05-21  4:15     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  4:15 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:21:54PM +0000, Christophe Leroy wrote:
> > +/* COW PTE fallback to normal PTE:
> > + * - two state here
> > + *   - After break child :   [parent, rss=1, ref=1, write=NO , owner=parent]
> > + *                        to [parent, rss=1, ref=1, write=YES, owner=NULL  ]
> > + *   - After break parent:   [child , rss=0, ref=1, write=NO , owner=NULL  ]
> > + *                        to [child , rss=1, ref=1, write=YES, owner=NULL  ]
> > + */
> > +void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr)
> 
> There should be a prototype in a header somewhere for a non static function.
> 
> You are encouraged to run 'make mm/memory.o C=2' to check sparse reports.
> 

I will do all the above checking before sending the next version.

> > +{
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       unsigned long start, end;
> > +       pmd_t new;
> > +
> > +       BUG_ON(pmd_write(*pmd));
> 
> You seem to add a lot of BUG_ONs(). Are they really necessary ? See 
> https://docs.kernel.org/process/deprecated.html?highlight=bug_on#bug-and-bug-on
> 
> You may also use VM_BUG_ON().
> 

Sure.
I added BUG_ON() when doing the debug.
I will consider again which one is necessary.
And change to use VM_BUG_ON().

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE
  2022-05-20 14:30   ` Christophe Leroy
@ 2022-05-21  4:22     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  4:22 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:30:29PM +0000, Christophe Leroy wrote:
> 
> 
> Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> > Add the reference counter cow_pgtable_refcount to maintain the number
> > of process references to COW PTE. Before decreasing the reference
> > count, it will check whether the counter is one or not for reusing
> > COW PTE when the counter is one.
> > 
> > Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> > ---
> >   include/linux/mm.h       |  1 +
> >   include/linux/mm_types.h |  1 +
> >   include/linux/pgtable.h  | 27 +++++++++++++++++++++++++++
> >   mm/memory.c              |  1 +
> >   4 files changed, 30 insertions(+)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 221926a3d818..e48bb3fbc33c 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2329,6 +2329,7 @@ static inline bool pgtable_pte_page_ctor(struct page *page)
> >          __SetPageTable(page);
> >          inc_lruvec_page_state(page, NR_PAGETABLE);
> >          page->cow_pte_owner = NULL;
> > +       atomic_set(&page->cow_pgtable_refcount, 1);
> >          return true;
> >   }
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5dcbd7f6c361..984d81e47d53 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -221,6 +221,7 @@ struct page {
> >   #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >          int _last_cpupid;
> >   #endif
> > +       atomic_t cow_pgtable_refcount; /* COW page table */
> >          pmd_t *cow_pte_owner; /* cow pte: pmd */
> >   } _struct_page_alignment;
> > 
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index faca57af332e..33c01fec7b92 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -604,6 +604,33 @@ static inline bool cow_pte_owner_is_same(pmd_t *pmd, pmd_t *owner)
> >                  true : false;
> >   }
> > 
> > +extern void cow_pte_fallback(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr);
> 
> 'extern' keyword is pointless for fonction prototypes. No new ones 
> should added.

I see.
It totally does not need the extern to the non-static function.

> > +
> > +static inline int pmd_get_pte(pmd_t *pmd)
> > +{
> > +       return atomic_inc_return(&pmd_page(*pmd)->cow_pgtable_refcount);
> > +}
> > +
> > +/* If the COW PTE page->cow_pgtable_refcount is 1, instead of decreasing the
> > + * counter, clear write protection of the corresponding PMD entry and reset
> > + * the COW PTE owner to reuse the table.
> > + */
> > +static inline int pmd_put_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr)
> > +{
> > +       if (!atomic_add_unless(&pmd_page(*pmd)->cow_pgtable_refcount, -1, 1)) {
> > +               cow_pte_fallback(vma, pmd, addr);
> > +               return 1;
> > +       }
> > +       return 0;
> 
> I would do something more flat by reverting the test:
> 
> {
> 	if (atomic_add_unless(&pmd_page(*pmd)->cow_pgtable_refcount, -1, 1))
> 		return 0;
> 
> 	cow_pte_fallback(vma, pmd, addr);
> 	return 1;
> }
> 

Thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table
  2022-05-20 14:49   ` Christophe Leroy
@ 2022-05-21  4:38     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  4:38 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, David Hildenbrand, linux-kernel, Kaiyang Zhao,
	Huichun Feng, Jim Huang

On Fri, May 20, 2022 at 02:49:31PM +0000, Christophe Leroy wrote:
> 
> 
> Le 19/05/2022 à 20:31, Chih-En Lin a écrit :
> > This patch adds the Copy-On-Write (COW) mechanism to the PTE table.
> > To enable the COW page table use the clone3() system call with the
> > CLONE_COW_PGTABLE flag. It will set the MMF_COW_PGTABLE flag to the
> > processes.
> > 
> > It uses the MMF_COW_PGTABLE flag to distinguish the default page table
> > and the COW one. Moreover, it is difficult to distinguish whether the
> > entire page table is out of COW state. So the MMF_COW_PGTABLE flag won't
> > be disabled after its setup.
> > 
> > Since the memory space of the page table is distinctive for each process
> > in kernel space. It uses the address of the PMD index for the ownership
> > of the PTE table to identify which one of the processes needs to update
> > the page table state. In other words, only the owner will update COW PTE
> > state, like the RSS and pgtable_bytes.
> > 
> > It uses the reference count to control the lifetime of COW PTE table.
> > When someone breaks COW, it will copy the COW PTE table and decrease the
> > reference count. But if the reference count is equal to one before the
> > break COW, it will reuse the COW PTE table.
> > 
> > This patch modifies the part of the copy page table to do the basic COW.
> > For the break COW, it modifies the part of a page fault, zaps page table
> > , unmapping, and remapping.
> > 
> > Signed-off-by: Chih-En Lin <shiyn.lin@gmail.com>
> > ---
> >   include/linux/pgtable.h |   3 +
> >   mm/memory.c             | 262 ++++++++++++++++++++++++++++++++++++----
> >   mm/mmap.c               |   4 +
> >   mm/mremap.c             |   5 +
> >   4 files changed, 251 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 33c01fec7b92..357ce3722ee8 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -631,6 +631,9 @@ static inline int cow_pte_refcount_read(pmd_t *pmd)
> >          return atomic_read(&pmd_page(*pmd)->cow_pgtable_refcount);
> >   }
> > 
> > +extern int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr, bool alloc);
> > +
> >   #ifndef pte_access_permitted
> >   #define pte_access_permitted(pte, write) \
> >          (pte_present(pte) && (!(write) || pte_write(pte)))
> > diff --git a/mm/memory.c b/mm/memory.c
> > index aa66af76e214..ff3fcbe4dfb5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -247,6 +247,8 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
> >                  next = pmd_addr_end(addr, end);
> >                  if (pmd_none_or_clear_bad(pmd))
> >                          continue;
> > +               BUG_ON(cow_pte_refcount_read(pmd) != 1);
> > +               BUG_ON(!cow_pte_owner_is_same(pmd, NULL));
> 
> See comment on a previous patch of this series, there seem to be a huge 
> number of new BUG_ONs.

Got it.

> >                  free_pte_range(tlb, pmd, addr);
> >          } while (pmd++, addr = next, addr != end);
> > 
> > @@ -1031,7 +1033,7 @@ static inline void cow_pte_rss(struct mm_struct *mm, struct vm_area_struct *vma,
> >   static int
> >   copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
> > -              unsigned long end)
> > +              unsigned long end, bool is_src_pte_locked)
> >   {
> >          struct mm_struct *dst_mm = dst_vma->vm_mm;
> >          struct mm_struct *src_mm = src_vma->vm_mm;
> > @@ -1053,8 +1055,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                  goto out;
> >          }
> >          src_pte = pte_offset_map(src_pmd, addr);
> > -       src_ptl = pte_lockptr(src_mm, src_pmd);
> > -       spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +       if (!is_src_pte_locked) {
> > +               src_ptl = pte_lockptr(src_mm, src_pmd);
> > +               spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > +       }
> 
> Odd construct, that kind of construct often leads to messy errors.
> 
> Could you construct things differently by refactoring the code ?

Sure, I will try my best.
It's probably why here have the bug when doing the stress testing.

> > @@ -1180,11 +1186,55 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> >                                  continue;
> >                          /* fall through */
> >                  }
> > -               if (pmd_none_or_clear_bad(src_pmd))
> > -                       continue;
> > -               if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> > -                                  addr, next))
> > +
> > +               if (test_bit(MMF_COW_PGTABLE, &src_mm->flags)) {
> > +
> > +                        if (pmd_none(*src_pmd))
> > +                               continue;
> 
> Why not keep the pmd_none_or_clear_bad(src_pmd) instead ?
> 
> > +
> > +                       /* XXX: Skip if the PTE already COW this time. */
> > +                       if (!pmd_none(*dst_pmd) &&
> 
> Shouldn't is be a pmd_none_or_clear_bad() ?
> 
> > +                           cow_pte_refcount_read(src_pmd) > 1)
> > +                               continue;
> > +
> > +                       /* If PTE doesn't have an owner, the parent needs to
> > +                        * take this PTE.
> > +                        */
> > +                       if (cow_pte_owner_is_same(src_pmd, NULL)) {
> > +                               set_cow_pte_owner(src_pmd, src_pmd);
> > +                               /* XXX: The process may COW PTE fork two times.
> > +                                * But in some situations, owner has cleared.
> > +                                * Previously Child (This time is the parent)
> > +                                * COW PTE forking, but previously parent, owner
> > +                                * , break COW. So it needs to add back the RSS
> > +                                * state and pgtable bytes.
> > +                                */
> > +                               if (!pmd_write(*src_pmd)) {
> > +                                       unsigned long pte_start =
> > +                                               addr & PMD_MASK;
> > +                                       unsigned long pte_end =
> > +                                               (addr + PMD_SIZE) & PMD_MASK;
> > +                                       cow_pte_rss(src_mm, src_vma, src_pmd,
> > +                                           pte_start, pte_end, true /* inc */);
> > +                                       mm_inc_nr_ptes(src_mm);
> > +                                       smp_wmb();
> > +                                       pmd_populate(src_mm, src_pmd,
> > +                                                       pmd_page(*src_pmd));
> > +                               }
> > +                       }
> > +
> > +                       pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > +
> > +                       /* Child reference count */
> > +                       pmd_get_pte(src_pmd);
> > +
> > +                       /* COW for PTE table */
> > +                       set_pmd_at(dst_mm, addr, dst_pmd, *src_pmd);
> > +               } else if (!pmd_none_or_clear_bad(src_pmd) &&
> 
> Can't we keep pmd_none_or_clear_bad(src_pmd) common to both cases ?
> 

You are right.
I will change to pmd_none_or_clear_bad().

> > +                           copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
> > +                                   addr, next, false)) {
> >                          return -ENOMEM;
> > +               }
> >          } while (dst_pmd++, src_pmd++, addr = next, addr != end);
> >          return 0;
> >   }
> > @@ -1336,6 +1386,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
> >   struct zap_details {
> >          struct folio *single_folio;     /* Locked folio to be unmapped */
> >          bool even_cows;                 /* Zap COWed private pages too? */
> > +       bool cow_pte;                   /* Do not free COW PTE */
> >   };
> > 
> >   /* Whether we should zap all COWed (private) pages too */
> > @@ -1398,8 +1449,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                          page = vm_normal_page(vma, addr, ptent);
> >                          if (unlikely(!should_zap_page(details, page)))
> >                                  continue;
> > -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> > -                                                       tlb->fullmm);
> > +                       if (!details || !details->cow_pte)
> > +                               ptent = ptep_get_and_clear_full(mm, addr, pte,
> > +                                                               tlb->fullmm);
> >                          tlb_remove_tlb_entry(tlb, pte, addr);
> >                          if (unlikely(!page))
> >                                  continue;
> > @@ -1413,8 +1465,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                                      likely(!(vma->vm_flags & VM_SEQ_READ)))
> >                                          mark_page_accessed(page);
> >                          }
> > -                       rss[mm_counter(page)]--;
> > -                       page_remove_rmap(page, vma, false);
> > +                       if (!details || !details->cow_pte) {
> > +                               rss[mm_counter(page)]--;
> > +                               page_remove_rmap(page, vma, false);
> > +                       } else
> > +                               continue;
> 
> Can you do the reverse:
> 
> 			if (details && details->cow_pte)
> 				continue;
> 
> 			rss[mm_counter(page)]--;
> 			page_remove_rmap(page, vma, false);

It's better than I wrote.
Thanks.

> 
> >                          if (unlikely(page_mapcount(page) < 0))
> >                                  print_bad_pte(vma, addr, ptent, page);
> >                          if (unlikely(__tlb_remove_page(tlb, page))) {
> > @@ -1425,6 +1480,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> >                          continue;
> >                  }
> > 
> > +               // TODO: Deal COW PTE with swap
> > +
> >                  entry = pte_to_swp_entry(ptent);
> >                  if (is_device_private_entry(entry) ||
> >                      is_device_exclusive_entry(entry)) {
> > @@ -1513,16 +1570,34 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
> >                          spin_unlock(ptl);
> >                  }
> > 
> > -               /*
> > -                * Here there can be other concurrent MADV_DONTNEED or
> > -                * trans huge page faults running, and if the pmd is
> > -                * none or trans huge it can change under us. This is
> > -                * because MADV_DONTNEED holds the mmap_lock in read
> > -                * mode.
> > -                */
> > -               if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> > -                       goto next;
> > -               next = zap_pte_range(tlb, vma, pmd, addr, next, details);
> > +
> > +               if (test_bit(MMF_COW_PGTABLE, &tlb->mm->flags) &&
> > +                   !pmd_none(*pmd) && !pmd_write(*pmd)) {
> 
> Can't you use pmd_none_or_trans_huge_or_clear_bad() and keep it common ? ...

Sure.

> > +static int zap_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr)
> > +{
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       unsigned long start, end;
> > +
> > +       if (pmd_put_pte(vma, pmd, addr)) {
> > +               // fallback
> > +               return 1;
> > +       }
> 
> No { } for a single line if. The comment could go just before the if.
> 
> > +
> > +       start = addr & PMD_MASK;
> > +       end = (addr + PMD_SIZE) & PMD_MASK;
> > +
> > +       /* If PMD entry is owner, clear the ownership, and decrease RSS state
> > +        * and pgtable_bytes.
> > +        */
> 
> Please follow the standard comments style:
> 
> /*
>   * Some text
>   * More text
>   */
> 

Got it.

> > +       if (cow_pte_owner_is_same(pmd, pmd)) {
> > +               set_cow_pte_owner(pmd, NULL);
> > +               cow_pte_rss(mm, vma, pmd, start, end, false /* dec */);
> > +               mm_dec_nr_ptes(mm);
> > +       }
> > +
> > +       pmd_clear(pmd);
> > +       return 0;
> > +}
> > +
> > +/* If alloc set means it won't break COW. For this case, it will just decrease
> > + * the reference count. The address needs to be at the beginning of the PTE page
> > + * since COW PTE is copy-on-write the entire PTE.
> > + * If pmd is NULL, it will get the pmd from vma and check it is cowing.
> > + */
> > +int handle_cow_pte(struct vm_area_struct *vma, pmd_t *pmd,
> > +               unsigned long addr, bool alloc)
> > +{
> > +       pgd_t *pgd;
> > +       p4d_t *p4d;
> > +       pud_t *pud;
> > +       struct mm_struct *mm = vma->vm_mm;
> > +       int ret = 0;
> > +       spinlock_t *ptl = NULL;
> > +
> > +       if (!pmd) {
> > +               pgd = pgd_offset(mm, addr);
> > +               if (pgd_none_or_clear_bad(pgd))
> > +                       return 0;
> > +               p4d = p4d_offset(pgd, addr);
> > +               if (p4d_none_or_clear_bad(p4d))
> > +                       return 0;
> > +               pud = pud_offset(p4d, addr);
> > +               if (pud_none_or_clear_bad(pud))
> > +                       return 0;
> > +               pmd = pmd_offset(pud, addr);
> > +               if (pmd_none(*pmd) || pmd_write(*pmd))
> > +                       return 0;
> > +       }
> > +
> > +       // TODO: handle COW PTE with swap
> > +       BUG_ON(is_swap_pmd(*pmd));
> > +       BUG_ON(pmd_trans_huge(*pmd));
> > +       BUG_ON(pmd_devmap(*pmd));
> > +
> > +       BUG_ON(pmd_none(*pmd));
> > +       BUG_ON(pmd_write(*pmd));
> 
> So many BUG_ON ? All this has a cost during the execution.

I will consider it again.

> > +
> > +       ptl = pte_lockptr(mm, pmd);
> > +       spin_lock(ptl);
> > +       if (!alloc)
> > +               ret = zap_cow_pte(vma, pmd, addr);
> > +       else
> > +               ret = break_cow_pte(vma, pmd, addr);
> 
> Better as
> 
> 	if (alloc)
> 		break_cow_pte()
> 	else
> 		zap_cow_pte()

Great!
Thanks.

> > +       spin_unlock(ptl);
> > +
> > +       return ret;
> > +}
> > +
> >   /*
> >    * These routines also need to handle stuff like marking pages dirty
> >    * and/or accessed for architectures that don't do it in hardware (most
> > @@ -4825,6 +5028,19 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >                                  return 0;
> >                          }
> >                  }
> > +
> > +               /* When the PMD entry is set with write protection, it needs to
> > +                * handle the on-demand PTE. It will allocate a new PTE and copy
> > +                * the old one, then set this entry writeable and decrease the
> > +                * reference count at COW PTE.
> > +                */
> > +               if (test_bit(MMF_COW_PGTABLE, &mm->flags) &&
> > +                   !pmd_none(vmf.orig_pmd) && !pmd_write(vmf.orig_pmd)) {
> > +                       if (handle_cow_pte(vmf.vma, vmf.pmd, vmf.real_address,
> > +                          (cow_pte_refcount_read(&vmf.orig_pmd) > 1) ?
> > +                          true : false) < 0)
> 
> (condition ? true : false) is exactly the same as (condition)
> 

I knew. ;-)

Again, thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table
  2022-05-21  4:02   ` Matthew Wilcox
@ 2022-05-21  5:01     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  5:01 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On Sat, May 21, 2022 at 05:02:43AM +0100, Matthew Wilcox wrote:
> On Fri, May 20, 2022 at 02:31:24AM +0800, Chih-En Lin wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 8834e38c06a4..5dcbd7f6c361 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -221,6 +221,7 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> >  #endif
> > +	pmd_t *cow_pte_owner; /* cow pte: pmd */
> 
> This is definitely the wrong place.  I think it could replace _pt_pad_1,
> since it's a pointer to a PMD and so the bottom bit will definitely
> be clear.
>

I will figure out how to use _pt_pad_1.
It seems relative to the compound page (or folio?).

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE
  2022-05-21  4:08   ` Matthew Wilcox
@ 2022-05-21  5:10     ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21  5:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, David Hildenbrand,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On Sat, May 21, 2022 at 05:08:09AM +0100, Matthew Wilcox wrote:
> On Fri, May 20, 2022 at 02:31:26AM +0800, Chih-En Lin wrote:
> > +++ b/include/linux/mm_types.h
> > @@ -221,6 +221,7 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> >  	int _last_cpupid;
> >  #endif
> > +	atomic_t cow_pgtable_refcount; /* COW page table */
> >  	pmd_t *cow_pte_owner; /* cow pte: pmd */
> >  } _struct_page_alignment;
> 
> Oh.  You need another 4 bytes.  Hmm.
> 
> Can you share _refcount?
> 
> Using _pt_pad_2 should be possible, but some care will be needed to make
> sure it's (a) in a union with an unsigned long to keep the alignment
> as expected, and (b) is definitely zero before the page is freed (or
> the page allocator will squawk at you).

_refcount may be better. I will try this at first, and if any other
thing let _refcount cannot be used, I will consider _pt_pad_2.

Thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (5 preceding siblings ...)
  2022-05-19 18:31 ` [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table Chih-En Lin
@ 2022-05-21  8:59 ` Qi Zheng
  2022-05-21 19:08   ` Chih-En Lin
  2022-05-21 16:07 ` David Hildenbrand
  7 siblings, 1 reply; 35+ messages in thread
From: Qi Zheng @ 2022-05-21  8:59 UTC (permalink / raw)
  To: Chih-En Lin, David Hildenbrand
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, linux-kernel,
	Kaiyang Zhao, Huichun Feng, Jim Huang, Andrew Morton, linux-mm



On 2022/5/20 2:31 AM, Chih-En Lin wrote:
> When creating the user process, it usually uses the Copy-On-Write (COW)
> mechanism to save the memory usage and the cost of time for copying.
> COW defers the work of copying private memory and shares it across the
> processes as read-only. If either process wants to write in these
> memories, it will page fault and copy the shared memory, so the process
> will now get its private memory right here, which is called break COW.
> 
> Presently this kind of technology is only used as the mapping memory.
> It still needs to copy the entire page table from the parent.
> It might cost a lot of time and memory to copy each page table when the
> parent already has a lot of page tables allocated. For example, here is
> the state table for mapping the 1 GB memory of forking.
> 
> 	    mmap before fork         mmap after fork
> MemTotal:       32746776 kB             32746776 kB
> MemFree:        31468152 kB             31463244 kB
> AnonPages:       1073836 kB              1073628 kB
> Mapped:            39520 kB                39992 kB
> PageTables:         3356 kB                 5432 kB
> 
> This patch introduces Copy-On-Write to the page table. This patch only
> implements the COW on the PTE level. It's based on the paper
> On-Demand Fork [1]. Summary of the implementation for the paper:
> 
> - Only implements the COW to the anonymous mapping
> - Only do COW to the PTE table which the range is all covered by a
>    single VMA.
> - Use the reference count to control the COW PTE table lifetime.
>    Decrease the counter when breaking COW or dereference the COW PTE
>    table. When the counter reduces to zero, free the PTE table.
> 

Hi,

To reduce the empty user PTE tables, I also introduced a reference
count (pte_ref) for user PTE tables in my patch[1][2], It is used
to track the usage of each user PTE tables.

The following people will hold a pte_ref:
  - The !pte_none() entry, such as regular page table entry that map
    physical pages, or swap entry, or migrate entry, etc.
  - Visitor to the PTE page table entries, such as page table walker.

With COW PTE, a new holder (the process using the COW PTE) is added.

It's funny, it leads me to see more meaning of pte_ref.

Thanks,
Qi

[1] [RFC PATCH 00/18] Try to free user PTE page table pages
     link: 
https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
     (percpu_ref version)

[2] [PATCH v3 00/15] Free user PTE page table pages
     link: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
     (atomic count version)

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
                   ` (6 preceding siblings ...)
  2022-05-21  8:59 ` [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Qi Zheng
@ 2022-05-21 16:07 ` David Hildenbrand
  2022-05-21 18:50   ` Chih-En Lin
  2022-05-21 20:12   ` Matthew Wilcox
  7 siblings, 2 replies; 35+ messages in thread
From: David Hildenbrand @ 2022-05-21 16:07 UTC (permalink / raw)
  To: Chih-En Lin, Andrew Morton, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, linux-kernel,
	Kaiyang Zhao, Huichun Feng, Jim Huang

On 19.05.22 20:31, Chih-En Lin wrote:
> When creating the user process, it usually uses the Copy-On-Write (COW)
> mechanism to save the memory usage and the cost of time for copying.
> COW defers the work of copying private memory and shares it across the
> processes as read-only. If either process wants to write in these
> memories, it will page fault and copy the shared memory, so the process
> will now get its private memory right here, which is called break COW.

Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
resulted in PageAnonExclusive, which should hit upstream soon), and
hearing about COW of page tables (and wondering how it will interact
with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
me feel a bit uneasy :)

> 
> Presently this kind of technology is only used as the mapping memory.
> It still needs to copy the entire page table from the parent.
> It might cost a lot of time and memory to copy each page table when the
> parent already has a lot of page tables allocated. For example, here is
> the state table for mapping the 1 GB memory of forking.
> 
> 	    mmap before fork         mmap after fork
> MemTotal:       32746776 kB             32746776 kB
> MemFree:        31468152 kB             31463244 kB
> AnonPages:       1073836 kB              1073628 kB
> Mapped:            39520 kB                39992 kB
> PageTables:         3356 kB                 5432 kB


I'm missing the most important point: why do we care and why should we
care to make our COW/fork implementation even more complicated?

Yes, we might save some page tables and we might reduce the fork() time,
however, which specific workload really benefits from this and why do we
really care about that workload? Without even hearing about an example
user in this cover letter (unless I missed it), I naturally wonder about
relevance in practice.

I assume it really only matters if we fork() realtively large processes,
like databases for snapshotting. However, fork() is already a pretty
sever performance hit due to COW, and there are alternatives getting
developed as a replacement for such use cases (e.g., uffd-wp).

I'm also missing a performance evaluation: I'd expect some simple
workloads that use fork() might be even slower after fork() with this
change.

(I don't have time to read the paper, I'd expect an independent summary
in the cover letter)


I have tons of questions regarding rmap, accounting, GUP, page table
walkers, OOM situations in page walkers, but at this point I am not
(yet) convinced that the added complexity is really worth it. So I'd
appreciate some additional information.



[...]

> TODO list:
> - Handle the swap

Scary if that's not easy to handle :/

> - Rewrite the TLB flush for zapping the COW PTE table.
> - Experiment COW to the entire page table. (Now just for PTE level)
> - Bug in some case from copy_pte_range()::vm_normal_page()::print_bad_pte().
> - Bug of Bad RSS counter in multiple times COW PTE table forking.



-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21 16:07 ` David Hildenbrand
@ 2022-05-21 18:50   ` Chih-En Lin
  2022-05-21 20:28     ` David Hildenbrand
  2022-05-21 20:12   ` Matthew Wilcox
  1 sibling, 1 reply; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21 18:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, linux-kernel,
	Kaiyang Zhao, Huichun Feng, Jim Huang

On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
> On 19.05.22 20:31, Chih-En Lin wrote:
> > When creating the user process, it usually uses the Copy-On-Write (COW)
> > mechanism to save the memory usage and the cost of time for copying.
> > COW defers the work of copying private memory and shares it across the
> > processes as read-only. If either process wants to write in these
> > memories, it will page fault and copy the shared memory, so the process
> > will now get its private memory right here, which is called break COW.
> 
> Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
> resulted in PageAnonExclusive, which should hit upstream soon), and
> hearing about COW of page tables (and wondering how it will interact
> with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
> me feel a bit uneasy :)

I saw the series patch of this and knew how complicated handling COW of
the physical page was [1][2][3][4]. So the COW page table will tend to
restrict the sharing only to the page table. This means any modification
to the physical page will trigger the break COW of page table.

Presently implementation will only update the physical page information
to the RSS of the owner process of COW PTE. Generally owner is the
parent process. And the state of the page, like refcount and mapcount,
will not change under the COW page table.

But if any situations will lead to the COW page table needs to consider
the state of physical page, it might be fretful. ;-)

> > 
> > Presently this kind of technology is only used as the mapping memory.
> > It still needs to copy the entire page table from the parent.
> > It might cost a lot of time and memory to copy each page table when the
> > parent already has a lot of page tables allocated. For example, here is
> > the state table for mapping the 1 GB memory of forking.
> > 
> > 	    mmap before fork         mmap after fork
> > MemTotal:       32746776 kB             32746776 kB
> > MemFree:        31468152 kB             31463244 kB
> > AnonPages:       1073836 kB              1073628 kB
> > Mapped:            39520 kB                39992 kB
> > PageTables:         3356 kB                 5432 kB
> 
> 
> I'm missing the most important point: why do we care and why should we
> care to make our COW/fork implementation even more complicated?
> 
> Yes, we might save some page tables and we might reduce the fork() time,
> however, which specific workload really benefits from this and why do we
> really care about that workload? Without even hearing about an example
> user in this cover letter (unless I missed it), I naturally wonder about
> relevance in practice.
> 
> I assume it really only matters if we fork() realtively large processes,
> like databases for snapshotting. However, fork() is already a pretty
> sever performance hit due to COW, and there are alternatives getting
> developed as a replacement for such use cases (e.g., uffd-wp).
> 
> I'm also missing a performance evaluation: I'd expect some simple
> workloads that use fork() might be even slower after fork() with this
> change.
> 

The paper mentioned a list of benchmarks of the time cost for On-Demand
fork. For example, on Redis, the meantime of fork when taking the
snapshot. Default fork() got 7.40 ms; On-demand Fork (COW PTE table) got
0.12 ms. But there are some other cases, like the Response latency
distribution of Apache HTTP Server, are not have significant benefits
from their On-demand fork.

For the COW page table from this patch, I also take the perf to analyze
the cost time. But it looks like not different from the default fork.

Here is the report, the mmap-sfork is COW page table version:

 Performance counter stats for './mmap-fork' (100 runs):

            373.92 msec task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
                 1      context-switches          #    2.656 /sec                     ( +-  6.03% )
                 0      cpu-migrations            #    0.000 /sec
               881      page-faults               #    2.340 K/sec                    ( +-  0.02% )
     1,860,460,792      cycles                    #    4.941 GHz                      ( +-  0.08% )
     1,451,024,912      instructions              #    0.78  insn per cycle           ( +-  0.00% )
       310,129,843      branches                  #  823.559 M/sec                    ( +-  0.01% )
         1,552,469      branch-misses             #    0.50% of all branches          ( +-  0.38% )

          0.377007 +- 0.000480 seconds time elapsed  ( +-  0.13% )

 Performance counter stats for './mmap-sfork' (100 runs):

            373.04 msec task-clock                #    0.992 CPUs utilized            ( +-  0.10% )
                 1      context-switches          #    2.660 /sec                     ( +-  6.58% )
                 0      cpu-migrations            #    0.000 /sec
               877      page-faults               #    2.333 K/sec                    ( +-  0.08% )
     1,851,843,683      cycles                    #    4.926 GHz                      ( +-  0.08% )
     1,451,763,414      instructions              #    0.78  insn per cycle           ( +-  0.00% )
       310,270,268      branches                  #  825.352 M/sec                    ( +-  0.01% )
         1,649,486      branch-misses             #    0.53% of all branches          ( +-  0.49% )

          0.376095 +- 0.000478 seconds time elapsed  ( +-  0.13% )

So, the COW of the page table may reduce the time of forking. But it
builds on the transfer of the copy work to other modified operations
to the physical page.

> (I don't have time to read the paper, I'd expect an independent summary
> in the cover letter)

Sure, I will add more performance evaluations and descriptions in the
next version.

> I have tons of questions regarding rmap, accounting, GUP, page table
> walkers, OOM situations in page walkers, but at this point I am not
> (yet) convinced that the added complexity is really worth it. So I'd
> appreciate some additional information.

It seems like I have a lot of work to do. ;-)

> 
> [...]
> 
> > TODO list:
> > - Handle the swap
> 
> Scary if that's not easy to handle :/

;-)

> -- 
> Thanks,
> 
> David / dhildenb
>

Thanks!

[1] https://lore.kernel.org/all/20220131162940.210846-1-david@redhat.com/T/
[2] https://lore.kernel.org/linux-mm/20220315104741.63071-2-david@redhat.com/T/
[3] https://lore.kernel.org/linux-mm/51afa7a7-15c5-8769-78db-ed2d134792f4@redhat.com/T/
[4] https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21  8:59 ` [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Qi Zheng
@ 2022-05-21 19:08   ` Chih-En Lin
  0 siblings, 0 replies; 35+ messages in thread
From: Chih-En Lin @ 2022-05-21 19:08 UTC (permalink / raw)
  To: Qi Zheng
  Cc: David Hildenbrand, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, linux-kernel,
	Kaiyang Zhao, Huichun Feng, Jim Huang, Andrew Morton, linux-mm

On Sat, May 21, 2022 at 04:59:19PM +0800, Qi Zheng wrote:
> Hi,
> 
> To reduce the empty user PTE tables, I also introduced a reference
> count (pte_ref) for user PTE tables in my patch[1][2], It is used
> to track the usage of each user PTE tables.
> 
> The following people will hold a pte_ref:
>  - The !pte_none() entry, such as regular page table entry that map
>    physical pages, or swap entry, or migrate entry, etc.
>  - Visitor to the PTE page table entries, such as page table walker.
> 
> With COW PTE, a new holder (the process using the COW PTE) is added.
> 
> It's funny, it leads me to see more meaning of pte_ref.
> 
> Thanks,
> Qi
> 
> [1] [RFC PATCH 00/18] Try to free user PTE page table pages
>     link: https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
>     (percpu_ref version)
> 
> [2] [PATCH v3 00/15] Free user PTE page table pages
>     link: https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>     (atomic count version)
> 
> -- 
> Thanks,
> Qi

Hi,

I saw your patch a few months ago.
Actually, my school's independent study is tracing the page table. And
one of the topics is your patch. It is really helpful from your pte_ref.
It's great to see you have more ideas for your pte_ref.

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21 16:07 ` David Hildenbrand
  2022-05-21 18:50   ` Chih-En Lin
@ 2022-05-21 20:12   ` Matthew Wilcox
  2022-05-21 20:22     ` David Hildenbrand
  2022-05-21 22:19     ` Andy Lutomirski
  1 sibling, 2 replies; 35+ messages in thread
From: Matthew Wilcox @ 2022-05-21 20:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Chih-En Lin, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner, Vlastimil Babka,
	William Kucharski, John Hubbard, Yunsheng Lin, Arnd Bergmann,
	Suren Baghdasaryan, Colin Cross, Feng Tang, Eric W. Biederman,
	Mike Rapoport, Geert Uytterhoeven, Anshuman Khandual,
	Aneesh Kumar K.V, Daniel Axtens, Jonathan Marek,
	Christophe Leroy, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
> I'm missing the most important point: why do we care and why should we
> care to make our COW/fork implementation even more complicated?
> 
> Yes, we might save some page tables and we might reduce the fork() time,
> however, which specific workload really benefits from this and why do we
> really care about that workload? Without even hearing about an example
> user in this cover letter (unless I missed it), I naturally wonder about
> relevance in practice.

As I get older (and crankier), I get less convinced that fork() is
really the right solution for implementing system().  I feel that a
better model is to create a process with zero threads, but have an fd
to it.  Then manipulate the child process through its fd (eg mmap
ld.so, open new fds in that process's fdtable, etc).  Closing the fd
launches a new thread in the process (ensuring nobody has an fd to a
running process, particularly one which is setuid).

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21 20:12   ` Matthew Wilcox
@ 2022-05-21 20:22     ` David Hildenbrand
  2022-05-21 22:19     ` Andy Lutomirski
  1 sibling, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2022-05-21 20:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Chih-En Lin, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner, Vlastimil Babka,
	William Kucharski, John Hubbard, Yunsheng Lin, Arnd Bergmann,
	Suren Baghdasaryan, Colin Cross, Feng Tang, Eric W. Biederman,
	Mike Rapoport, Geert Uytterhoeven, Anshuman Khandual,
	Aneesh Kumar K.V, Daniel Axtens, Jonathan Marek,
	Christophe Leroy, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Andy Lutomirski, Sebastian Andrzej Siewior,
	Fenghua Yu, linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On 21.05.22 22:12, Matthew Wilcox wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
> 
> As I get older (and crankier), I get less convinced that fork() is
> really the right solution for implementing system().

Heh, I couldn't agree more. IMHO, fork() is mostly a blast from the
past. There *are* still a lot of user and there are a couple of sane use
cases.

Consequently, I am not convinced that it is something to optimize for,
especially if it adds additional complexity. For the use case of
snapshotting, we have better mechanisms nowadays (uffd-wp) that avoid
messing with copying address spaces.

Calling fork()/system() from a big, performance-sensitive process is
usually a bad idea.

Note: there is an (for me) interesting paper about this topic from 2019
("A fork() in the road"), although it might be a bit biased coming from
Microsoft research :). It comes to a similar conclusion regarding fork
and how it should or shouldn't dictate our OS design.

[1] https://www.microsoft.com/en-us/research/publication/a-fork-in-the-road/

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21 18:50   ` Chih-En Lin
@ 2022-05-21 20:28     ` David Hildenbrand
  0 siblings, 0 replies; 35+ messages in thread
From: David Hildenbrand @ 2022-05-21 20:28 UTC (permalink / raw)
  To: Chih-En Lin
  Cc: Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Christian Brauner,
	Matthew Wilcox (Oracle),
	Vlastimil Babka, William Kucharski, John Hubbard, Yunsheng Lin,
	Arnd Bergmann, Suren Baghdasaryan, Colin Cross, Feng Tang,
	Eric W. Biederman, Mike Rapoport, Geert Uytterhoeven,
	Anshuman Khandual, Aneesh Kumar K.V, Daniel Axtens,
	Jonathan Marek, Christophe Leroy, Pasha Tatashin, Peter Xu,
	Andrea Arcangeli, Thomas Gleixner, Andy Lutomirski,
	Sebastian Andrzej Siewior, Fenghua Yu, linux-kernel,
	Kaiyang Zhao, Huichun Feng, Jim Huang

On 21.05.22 20:50, Chih-En Lin wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> On 19.05.22 20:31, Chih-En Lin wrote:
>>> When creating the user process, it usually uses the Copy-On-Write (COW)
>>> mechanism to save the memory usage and the cost of time for copying.
>>> COW defers the work of copying private memory and shares it across the
>>> processes as read-only. If either process wants to write in these
>>> memories, it will page fault and copy the shared memory, so the process
>>> will now get its private memory right here, which is called break COW.
>>
>> Yes. Lately we've been dealing with advanced COW+GUP pinnings (which
>> resulted in PageAnonExclusive, which should hit upstream soon), and
>> hearing about COW of page tables (and wondering how it will interact
>> with the mapcount, refcount, PageAnonExclusive of anonymous pages) makes
>> me feel a bit uneasy :)
> 
> I saw the series patch of this and knew how complicated handling COW of
> the physical page was [1][2][3][4]. So the COW page table will tend to
> restrict the sharing only to the page table. This means any modification
> to the physical page will trigger the break COW of page table.
> 
> Presently implementation will only update the physical page information
> to the RSS of the owner process of COW PTE. Generally owner is the
> parent process. And the state of the page, like refcount and mapcount,
> will not change under the COW page table.
> 
> But if any situations will lead to the COW page table needs to consider
> the state of physical page, it might be fretful. ;-)

I haven't looked into the details of how GUP deals with these COW page
tables. But I suspect there might be problems with page pinning:
skipping copy_present_page() even for R/O pages is usually problematic
with R/O pinnings of pages. I might be just wrong.

> 
>>>
>>> Presently this kind of technology is only used as the mapping memory.
>>> It still needs to copy the entire page table from the parent.
>>> It might cost a lot of time and memory to copy each page table when the
>>> parent already has a lot of page tables allocated. For example, here is
>>> the state table for mapping the 1 GB memory of forking.
>>>
>>> 	    mmap before fork         mmap after fork
>>> MemTotal:       32746776 kB             32746776 kB
>>> MemFree:        31468152 kB             31463244 kB
>>> AnonPages:       1073836 kB              1073628 kB
>>> Mapped:            39520 kB                39992 kB
>>> PageTables:         3356 kB                 5432 kB
>>
>>
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
>>
>> I assume it really only matters if we fork() realtively large processes,
>> like databases for snapshotting. However, fork() is already a pretty
>> sever performance hit due to COW, and there are alternatives getting
>> developed as a replacement for such use cases (e.g., uffd-wp).
>>
>> I'm also missing a performance evaluation: I'd expect some simple
>> workloads that use fork() might be even slower after fork() with this
>> change.
>>
> 
> The paper mentioned a list of benchmarks of the time cost for On-Demand
> fork. For example, on Redis, the meantime of fork when taking the
> snapshot. Default fork() got 7.40 ms; On-demand Fork (COW PTE table) got
> 0.12 ms. But there are some other cases, like the Response latency
> distribution of Apache HTTP Server, are not have significant benefits
> from their On-demand fork.

Thanks. I expected that snapshotting would pop up and be one of the most
prominent users that could benefit. However, for that specific use case
I am convinced that uffd-wp is the better choice and fork() is just the
old way of doing it. having nothing better at hand. QEMU already
implements snapshotting of VMs that way and I remember that redis also
intended to implement support for uffd-wp. Not sure what happened with
that and if there is anything missing to make it work.

> 
> For the COW page table from this patch, I also take the perf to analyze
> the cost time. But it looks like not different from the default fork.

Interesting, thanks for sharing.

> 
> Here is the report, the mmap-sfork is COW page table version:
> 
>  Performance counter stats for './mmap-fork' (100 runs):
> 
>             373.92 msec task-clock                #    0.992 CPUs utilized            ( +-  0.09% )
>                  1      context-switches          #    2.656 /sec                     ( +-  6.03% )
>                  0      cpu-migrations            #    0.000 /sec
>                881      page-faults               #    2.340 K/sec                    ( +-  0.02% )
>      1,860,460,792      cycles                    #    4.941 GHz                      ( +-  0.08% )
>      1,451,024,912      instructions              #    0.78  insn per cycle           ( +-  0.00% )
>        310,129,843      branches                  #  823.559 M/sec                    ( +-  0.01% )
>          1,552,469      branch-misses             #    0.50% of all branches          ( +-  0.38% )
> 
>           0.377007 +- 0.000480 seconds time elapsed  ( +-  0.13% )
> 
>  Performance counter stats for './mmap-sfork' (100 runs):
> 
>             373.04 msec task-clock                #    0.992 CPUs utilized            ( +-  0.10% )
>                  1      context-switches          #    2.660 /sec                     ( +-  6.58% )
>                  0      cpu-migrations            #    0.000 /sec
>                877      page-faults               #    2.333 K/sec                    ( +-  0.08% )
>      1,851,843,683      cycles                    #    4.926 GHz                      ( +-  0.08% )
>      1,451,763,414      instructions              #    0.78  insn per cycle           ( +-  0.00% )
>        310,270,268      branches                  #  825.352 M/sec                    ( +-  0.01% )
>          1,649,486      branch-misses             #    0.53% of all branches          ( +-  0.49% )
> 
>           0.376095 +- 0.000478 seconds time elapsed  ( +-  0.13% )
> 
> So, the COW of the page table may reduce the time of forking. But it
> builds on the transfer of the copy work to other modified operations
> to the physical page.

Right.

> 
>> I have tons of questions regarding rmap, accounting, GUP, page table
>> walkers, OOM situations in page walkers, but at this point I am not
>> (yet) convinced that the added complexity is really worth it. So I'd
>> appreciate some additional information.
> 
> It seems like I have a lot of work to do. ;-)

Messing with page tables and COW is usually like opening a can of worms :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21 20:12   ` Matthew Wilcox
  2022-05-21 20:22     ` David Hildenbrand
@ 2022-05-21 22:19     ` Andy Lutomirski
  2022-05-22  0:31       ` Matthew Wilcox
  1 sibling, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2022-05-21 22:19 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: Chih-En Lin, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner, Vlastimil Babka,
	William Kucharski, John Hubbard, Yunsheng Lin, Arnd Bergmann,
	Suren Baghdasaryan, Colin Cross, Feng Tang, Eric W. Biederman,
	Mike Rapoport, Geert Uytterhoeven, Anshuman Khandual,
	Aneesh Kumar K.V, Daniel Axtens, Jonathan Marek,
	Christophe Leroy, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Sebastian Andrzej Siewior, Fenghua Yu,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On 5/21/22 13:12, Matthew Wilcox wrote:
> On Sat, May 21, 2022 at 06:07:27PM +0200, David Hildenbrand wrote:
>> I'm missing the most important point: why do we care and why should we
>> care to make our COW/fork implementation even more complicated?
>>
>> Yes, we might save some page tables and we might reduce the fork() time,
>> however, which specific workload really benefits from this and why do we
>> really care about that workload? Without even hearing about an example
>> user in this cover letter (unless I missed it), I naturally wonder about
>> relevance in practice.
> 
> As I get older (and crankier), I get less convinced that fork() is
> really the right solution for implementing system().  I feel that a
> better model is to create a process with zero threads, but have an fd
> to it.  Then manipulate the child process through its fd (eg mmap
> ld.so, open new fds in that process's fdtable, etc).  Closing the fd
> launches a new thread in the process (ensuring nobody has an fd to a
> running process, particularly one which is setuid).

Heh, I learned serious programming on Windows, and I thought fork() was 
entertaining, cool, and a bad idea when I first learned about it.  (I 
admit I did think the fact that POSIX fork and exec had many fewer 
arguments than CreateProcess was a good thing.)  Don't even get me 
started on setuid -- if I had my way, distros would set NO_NEW_PRIVS on 
boot for the entire system.

I can see a rather different use for this type of shared-pagetable 
technology, though: monstrous MAP_SHARED mappings.  For database and 
some VM users, multiple processes will map the same file.  If there was 
a way to ensure appropriate alignment (or at least encourage it) and a 
way to handle mappings that don't cover the whole file, then having 
multiple mappings share the same page tables could be a decent 
efficiently gain.  This doesn't even need COW -- it's "just" pagetable 
sharing.

It's probably a pipe dream, but I like to imagine that the bookkeeping 
that would enable this would also enable a much less ad-hoc concept of 
who owns which pagetable page.  Then things like x86's KPTI LDT mappings 
would be less disgusting under the hood.

Android would probably like a similar feature for MAP_ANONYMOUS or that 
could otherwise enable Zygote to share paging structures (ideally 
without fork(), although that's my dream, not necessarily Android's). 
This is more complex, since COW is involved.  Also possibly less 
valuable -- possibly the entire benefit and then some would be achieved 
by using huge pages for Zygote and arranging for CoWing one normal-size 
page out of a hugepage COW mapping to only COW the one page.

--Andy

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-21 22:19     ` Andy Lutomirski
@ 2022-05-22  0:31       ` Matthew Wilcox
  2022-05-22 15:20         ` Andy Lutomirski
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2022-05-22  0:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Hildenbrand, Chih-En Lin, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Christian Brauner, Vlastimil Babka,
	William Kucharski, John Hubbard, Yunsheng Lin, Arnd Bergmann,
	Suren Baghdasaryan, Colin Cross, Feng Tang, Eric W. Biederman,
	Mike Rapoport, Geert Uytterhoeven, Anshuman Khandual,
	Aneesh Kumar K.V, Daniel Axtens, Jonathan Marek,
	Christophe Leroy, Pasha Tatashin, Peter Xu, Andrea Arcangeli,
	Thomas Gleixner, Sebastian Andrzej Siewior, Fenghua Yu,
	linux-kernel, Kaiyang Zhao, Huichun Feng, Jim Huang

On Sat, May 21, 2022 at 03:19:24PM -0700, Andy Lutomirski wrote:
> I can see a rather different use for this type of shared-pagetable
> technology, though: monstrous MAP_SHARED mappings.  For database and some VM
> users, multiple processes will map the same file.  If there was a way to
> ensure appropriate alignment (or at least encourage it) and a way to handle
> mappings that don't cover the whole file, then having multiple mappings
> share the same page tables could be a decent efficiently gain.  This doesn't
> even need COW -- it's "just" pagetable sharing.

The mshare proposal did not get a warm reception at LSFMM ;-(

The conceptual model doesn't seem to work for the MM developers who were
in the room.  "Fear" was the most-used word.  Not sure how we're going
to get to a model of sharing page tables that doesn't scare people.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-22  0:31       ` Matthew Wilcox
@ 2022-05-22 15:20         ` Andy Lutomirski
  2022-05-22 19:40           ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Andy Lutomirski @ 2022-05-22 15:20 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: David Hildenbrand, Chih-En Lin, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Christian Brauner, Vlastimil Babka, William Kucharski,
	John Hubbard, Yunsheng Lin, Arnd Bergmann, Suren Baghdasaryan,
	Colin Cross, Feng Tang, Eric W. Biederman, Mike Rapoport,
	Geert Uytterhoeven, Anshuman Khandual, Aneesh Kumar K.V,
	Daniel Axtens, Jonathan Marek, Christophe Leroy, Pasha Tatashin,
	Peter Xu, Andrea Arcangeli, Thomas Gleixner,
	Sebastian Andrzej Siewior, Fenghua Yu, Linux Kernel Mailing List,
	Kaiyang Zhao, Huichun Feng, Jim Huang



On Sat, May 21, 2022, at 5:31 PM, Matthew Wilcox wrote:
> On Sat, May 21, 2022 at 03:19:24PM -0700, Andy Lutomirski wrote:
>> I can see a rather different use for this type of shared-pagetable
>> technology, though: monstrous MAP_SHARED mappings.  For database and some VM
>> users, multiple processes will map the same file.  If there was a way to
>> ensure appropriate alignment (or at least encourage it) and a way to handle
>> mappings that don't cover the whole file, then having multiple mappings
>> share the same page tables could be a decent efficiently gain.  This doesn't
>> even need COW -- it's "just" pagetable sharing.
>
> The mshare proposal did not get a warm reception at LSFMM ;-(
>
> The conceptual model doesn't seem to work for the MM developers who were
> in the room.  "Fear" was the most-used word.  Not sure how we're going
> to get to a model of sharing page tables that doesn't scare people.

FWIW, I didn’t like mshare.  mshare was weird: it seemed to have one mm own some page tables and other mms share them.  I’m talking about having a *file* own page tables and mms map them.  This seems less fear-inducing to me.  Circular dependencies are impossible, mmap calls don’t need to propagate, etc.

It would still be quite a change, though.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table
  2022-05-22 15:20         ` Andy Lutomirski
@ 2022-05-22 19:40           ` Matthew Wilcox
  0 siblings, 0 replies; 35+ messages in thread
From: Matthew Wilcox @ 2022-05-22 19:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Hildenbrand, Chih-En Lin, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra (Intel),
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Christian Brauner, Vlastimil Babka, William Kucharski,
	John Hubbard, Yunsheng Lin, Arnd Bergmann, Suren Baghdasaryan,
	Colin Cross, Feng Tang, Eric W. Biederman, Mike Rapoport,
	Geert Uytterhoeven, Anshuman Khandual, Aneesh Kumar K.V,
	Daniel Axtens, Jonathan Marek, Christophe Leroy, Pasha Tatashin,
	Peter Xu, Andrea Arcangeli, Thomas Gleixner,
	Sebastian Andrzej Siewior, Fenghua Yu, Linux Kernel Mailing List,
	Kaiyang Zhao, Huichun Feng, Jim Huang

On Sun, May 22, 2022 at 08:20:05AM -0700, Andy Lutomirski wrote:
> On Sat, May 21, 2022, at 5:31 PM, Matthew Wilcox wrote:
> > On Sat, May 21, 2022 at 03:19:24PM -0700, Andy Lutomirski wrote:
> >> I can see a rather different use for this type of shared-pagetable
> >> technology, though: monstrous MAP_SHARED mappings.  For database and some VM
> >> users, multiple processes will map the same file.  If there was a way to
> >> ensure appropriate alignment (or at least encourage it) and a way to handle
> >> mappings that don't cover the whole file, then having multiple mappings
> >> share the same page tables could be a decent efficiently gain.  This doesn't
> >> even need COW -- it's "just" pagetable sharing.
> >
> > The mshare proposal did not get a warm reception at LSFMM ;-(
> >
> > The conceptual model doesn't seem to work for the MM developers who were
> > in the room.  "Fear" was the most-used word.  Not sure how we're going
> > to get to a model of sharing page tables that doesn't scare people.
> 
> FWIW, I didn’t like mshare.  mshare was weird: it seemed to have
> one mm own some page tables and other mms share them.  I’m talking
> about having a *file* own page tables and mms map them.  This seems less
> fear-inducing to me.  Circular dependencies are impossible, mmap calls
> don’t need to propagate, etc.

OK, so that doesn't work for our use case.  We need an object to own page
tables that can be shared between different (co-operating) processes.
Because we need the property that calling mprotect() changes the
protection in all processes at the same time.

Obviously we want that object to be referenced by a file descriptor, and
it can also have a name.  That object doesn't have to be an mm_struct.
Maybe that would be enough of a change to remove the fear.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-05-22 19:43 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-19 18:31 [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 1/6] mm: Add a new mm flag for Copy-On-Write PTE table Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 2/6] mm: clone3: Add CLONE_COW_PGTABLE flag Chih-En Lin
2022-05-20 14:13   ` Christophe Leroy
2022-05-21  3:50     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 3/6] mm, pgtable: Add ownership for the PTE table Chih-En Lin
2022-05-19 23:07   ` kernel test robot
2022-05-20  0:08   ` kernel test robot
2022-05-20 14:15   ` Christophe Leroy
2022-05-21  4:03     ` Chih-En Lin
2022-05-21  4:02   ` Matthew Wilcox
2022-05-21  5:01     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 4/6] mm: Add COW PTE fallback function Chih-En Lin
2022-05-20  0:20   ` kernel test robot
2022-05-20 14:21   ` Christophe Leroy
2022-05-21  4:15     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 5/6] mm, pgtable: Add the reference counter for COW PTE Chih-En Lin
2022-05-20 14:30   ` Christophe Leroy
2022-05-21  4:22     ` Chih-En Lin
2022-05-21  4:08   ` Matthew Wilcox
2022-05-21  5:10     ` Chih-En Lin
2022-05-19 18:31 ` [RFC PATCH 6/6] mm: Expand Copy-On-Write to PTE table Chih-En Lin
2022-05-20 14:49   ` Christophe Leroy
2022-05-21  4:38     ` Chih-En Lin
2022-05-21  8:59 ` [External] [RFC PATCH 0/6] Introduce Copy-On-Write to Page Table Qi Zheng
2022-05-21 19:08   ` Chih-En Lin
2022-05-21 16:07 ` David Hildenbrand
2022-05-21 18:50   ` Chih-En Lin
2022-05-21 20:28     ` David Hildenbrand
2022-05-21 20:12   ` Matthew Wilcox
2022-05-21 20:22     ` David Hildenbrand
2022-05-21 22:19     ` Andy Lutomirski
2022-05-22  0:31       ` Matthew Wilcox
2022-05-22 15:20         ` Andy Lutomirski
2022-05-22 19:40           ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.