All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/15] Free user PTE page table pages
@ 2021-11-10  8:40 Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.6 MB
VmPTE                102640 kB                         248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads         before (pf/min)                     after (pf/min)
    1                32,085,255                         31,880,833 (-0.64%)
    8               101,674,967                        100,588,311 (-1.17%)
   16               113,207,000                        112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

And there are no obvious changes in perf hot spots:

before:
  19.29%  [kernel]  [k] clear_page_rep
  16.12%  [kernel]  [k] do_user_addr_fault
   9.57%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.16%  [kernel]  [k] get_page_from_freelist
   5.03%  [kernel]  [k] __handle_mm_fault
   3.53%  [kernel]  [k] __rcu_read_unlock
   3.45%  [kernel]  [k] handle_mm_fault
   3.38%  [kernel]  [k] down_read_trylock
   2.74%  [kernel]  [k] free_unref_page_list
   2.17%  [kernel]  [k] up_read
   1.93%  [kernel]  [k] charge_memcg
   1.73%  [kernel]  [k] try_charge_memcg
   1.71%  [kernel]  [k] __alloc_pages
   1.69%  [kernel]  [k] ___perf_sw_event
   1.44%  [kernel]  [k] get_mem_cgroup_from_mm

after:
  18.19%  [kernel]  [k] clear_page_rep
  16.28%  [kernel]  [k] do_user_addr_fault
   8.39%  [kernel]  [k] _raw_spin_unlock_irqrestore
   5.12%  [kernel]  [k] get_page_from_freelist
   4.81%  [kernel]  [k] __handle_mm_fault
   4.68%  [kernel]  [k] down_read_trylock
   3.80%  [kernel]  [k] handle_mm_fault
   3.59%  [kernel]  [k] get_mem_cgroup_from_mm
   2.49%  [kernel]  [k] free_unref_page_list
   2.41%  [kernel]  [k] up_read
   2.16%  [kernel]  [k] charge_memcg
   1.92%  [kernel]  [k] __rcu_read_unlock
   1.88%  [kernel]  [k] ___perf_sw_event
   1.70%  [kernel]  [k] pte_get_unless_zero

This series is based on next-20211108.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Changelog in v2 -> v3:
 - Refactored this patch series:
        - [PATCH v3 6/15]: Introduce the new dummy helpers first
        - [PATCH v3 7-12/15]: Convert each subsystem individually
        - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
   And thanks for the advice from David and Jason.
 - Add a document.

Changelog in v1 -> v2:
 - Change pte_install() to pmd_install().
 - Fix some typo and code style problems.
 - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
   and [PATCH v2 7/9].

Qi Zheng (15):
  mm: do code cleanups to filemap_map_pmd()
  mm: introduce is_huge_pmd() helper
  mm: move pte_offset_map_lock() to pgtable.h
  mm: rework the parameter of lock_page_or_retry()
  mm: add pmd_installed_type return for __pte_alloc() and other friends
  mm: introduce refcount for user PTE page table page
  mm/pte_ref: add support for user PTE page table page allocation
  mm/pte_ref: initialize the refcount of the withdrawn PTE page table
    page
  mm/pte_ref: add support for the map/unmap of user PTE page table page
  mm/pte_ref: add support for page fault path
  mm/pte_ref: take a refcount before accessing the PTE page table page
  mm/pte_ref: update the pmd entry in move_normal_pmd()
  mm/pte_ref: free user PTE page table pages
  Documentation: add document for pte_ref
  mm/pte_ref: use mmu_gather to free PTE page table pages

 Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
 arch/x86/Kconfig             |   2 +-
 fs/proc/task_mmu.c           |  24 +++-
 fs/userfaultfd.c             |   9 +-
 include/linux/huge_mm.h      |  10 +-
 include/linux/mm.h           | 170 ++++-------------------------
 include/linux/mm_types.h     |   6 +-
 include/linux/pagemap.h      |   8 +-
 include/linux/pgtable.h      | 152 +++++++++++++++++++++++++-
 include/linux/pte_ref.h      | 146 +++++++++++++++++++++++++
 include/linux/rmap.h         |   2 +
 kernel/events/uprobes.c      |   2 +
 mm/Kconfig                   |   4 +
 mm/Makefile                  |   4 +-
 mm/damon/vaddr.c             |  12 +-
 mm/debug_vm_pgtable.c        |   5 +-
 mm/filemap.c                 |  45 +++++---
 mm/gup.c                     |  25 ++++-
 mm/hmm.c                     |   5 +-
 mm/huge_memory.c             |   3 +-
 mm/internal.h                |   4 +-
 mm/khugepaged.c              |  21 +++-
 mm/ksm.c                     |   6 +-
 mm/madvise.c                 |  21 +++-
 mm/memcontrol.c              |  12 +-
 mm/memory-failure.c          |  11 +-
 mm/memory.c                  | 254 ++++++++++++++++++++++++++++++++-----------
 mm/mempolicy.c               |   6 +-
 mm/migrate.c                 |  54 ++++-----
 mm/mincore.c                 |   7 +-
 mm/mlock.c                   |   1 +
 mm/mmu_gather.c              |  40 +++----
 mm/mprotect.c                |  11 +-
 mm/mremap.c                  |  14 ++-
 mm/page_vma_mapped.c         |   4 +
 mm/pagewalk.c                |  15 ++-
 mm/pgtable-generic.c         |   1 +
 mm/pte_ref.c                 | 141 ++++++++++++++++++++++++
 mm/rmap.c                    |  10 ++
 mm/swapfile.c                |   3 +
 mm/userfaultfd.c             |  40 +++++--
 41 files changed, 1186 insertions(+), 340 deletions(-)
 create mode 100644 Documentation/vm/pte_ref.rst
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd()
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Currently we have two times the same few lines repeated
in filemap_map_pmd(). Deduplicate them and fix some code
style issues.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/filemap.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index daa0e23a6ee6..07c654202870 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3203,11 +3203,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	struct mm_struct *mm = vmf->vma->vm_mm;
 
 	/* Huge page is mapped? No need to proceed. */
-	if (pmd_trans_huge(*vmf->pmd)) {
-		unlock_page(page);
-		put_page(page);
-		return true;
-	}
+	if (pmd_trans_huge(*vmf->pmd))
+		goto out;
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
 		vm_fault_t ret = do_set_pmd(vmf, page);
@@ -3222,13 +3219,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
 	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd)) {
-		unlock_page(page);
-		put_page(page);
-		return true;
-	}
+	if (pmd_devmap_trans_unstable(vmf->pmd))
+		goto out;
 
 	return false;
+
+out:
+	unlock_page(page);
+	put_page(page);
+	return true;
 }
 
 static struct page *next_uptodate_page(struct page *page,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 02/15] mm: introduce is_huge_pmd() helper
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10 12:29   ` Jason Gunthorpe
  2021-11-10  8:40 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Currently we have some times the following judgments repeated in the
code:

	is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)

which is to determine whether the *pmd is a huge pmd, so introduce
is_huge_pmd() helper to deduplicate them.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/huge_mm.h | 10 +++++++---
 mm/huge_memory.c        |  3 +--
 mm/memory.c             |  5 ++---
 mm/mprotect.c           |  2 +-
 mm/mremap.c             |  3 +--
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..b37a89180846 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -199,8 +199,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
-					|| pmd_devmap(*____pmd))	\
+		if (is_huge_pmd(*____pmd))				\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
 	}  while (0)
@@ -232,11 +231,16 @@ static inline int is_swap_pmd(pmd_t pmd)
 	return !pmd_none(pmd) && !pmd_present(pmd);
 }
 
+static inline int is_huge_pmd(pmd_t pmd)
+{
+	return is_swap_pmd(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd);
+}
+
 /* mmap_lock must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_huge_pmd(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..e76ee2e1e423 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1832,8 +1832,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+	if (likely(is_huge_pmd(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 855486fff526..b00cd60fc368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1146,8 +1146,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_huge_pmd(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1441,7 +1440,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_huge_pmd(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e552f5e0ccbd..2d5064a4631c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -257,7 +257,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_huge_pmd(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index 002eec83e91e..c6e9da09dd0a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,8 +532,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
-		    pmd_devmap(*old_pmd)) {
+		if (is_huge_pmd(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
 					   old_pmd, new_pmd, need_rmap_locks))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

pte_offset_map() is in include/linux/pgtable.h, so move its
friend pte_offset_map_lock() to pgtable.h together.

pte_lockptr() is required for pte_offset_map_lock(), so
also move {pte,pmd,pud}_lockptr() to pgtable.h.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h      | 149 ------------------------------------------------
 include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+), 149 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..706da081b9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2284,70 +2284,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU */
 
-#if USE_SPLIT_PTE_PTLOCKS
-#if ALLOC_SPLIT_PTLOCKS
-void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
-extern void ptlock_free(struct page *page);
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
-	return page->ptl;
-}
-#else /* ALLOC_SPLIT_PTLOCKS */
-static inline void ptlock_cache_init(void)
-{
-}
-
-static inline bool ptlock_alloc(struct page *page)
-{
-	return true;
-}
-
-static inline void ptlock_free(struct page *page)
-{
-}
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
-	return &page->ptl;
-}
-#endif /* ALLOC_SPLIT_PTLOCKS */
-
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return ptlock_ptr(pmd_page(*pmd));
-}
-
-static inline bool ptlock_init(struct page *page)
-{
-	/*
-	 * prep_new_page() initialize page->private (and therefore page->ptl)
-	 * with 0. Make sure nobody took it in use in between.
-	 *
-	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache, which share storage with page->ptl.
-	 */
-	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
-	if (!ptlock_alloc(page))
-		return false;
-	spin_lock_init(ptlock_ptr(page));
-	return true;
-}
-
-#else	/* !USE_SPLIT_PTE_PTLOCKS */
-/*
- * We use mm->page_table_lock to guard all pagetable pages of the mm.
- */
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return &mm->page_table_lock;
-}
-static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
-static inline void ptlock_free(struct page *page) {}
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
 static inline void pgtable_init(void)
 {
 	ptlock_cache_init();
@@ -2370,20 +2306,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
-({							\
-	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
-	pte_t *__pte = pte_offset_map(pmd, address);	\
-	*(ptlp) = __ptl;				\
-	spin_lock(__ptl);				\
-	__pte;						\
-})
-
-#define pte_unmap_unlock(pte, ptl)	do {		\
-	spin_unlock(ptl);				\
-	pte_unmap(pte);					\
-} while (0)
-
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
 
 #define pte_alloc_map(mm, pmd, address)			\
@@ -2397,58 +2319,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
 		NULL: pte_offset_kernel(pmd, address))
 
-#if USE_SPLIT_PMD_PTLOCKS
-
-static struct page *pmd_to_page(pmd_t *pmd)
-{
-	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-	return virt_to_page((void *)((unsigned long) pmd & mask));
-}
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return ptlock_ptr(pmd_to_page(pmd));
-}
-
-static inline bool pmd_ptlock_init(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page->pmd_huge_pte = NULL;
-#endif
-	return ptlock_init(page);
-}
-
-static inline void pmd_ptlock_free(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
-#endif
-	ptlock_free(page);
-}
-
-#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
-
-#else
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return &mm->page_table_lock;
-}
-
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
-
-#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
-
-#endif
-
-static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
-{
-	spinlock_t *ptl = pmd_lockptr(mm, pmd);
-	spin_lock(ptl);
-	return ptl;
-}
-
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
 	if (!pmd_ptlock_init(page))
@@ -2465,25 +2335,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-/*
- * No scalability reason to split PUD locks yet, but follow the same pattern
- * as the PMD locks to make it easier if we decide to.  The VM should not be
- * considered ready to switch to split PUD locks yet; there may be places
- * which need to be converted from page_table_lock.
- */
-static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
-{
-	return &mm->page_table_lock;
-}
-
-static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
-{
-	spinlock_t *ptl = pud_lockptr(mm, pud);
-
-	spin_lock(ptl);
-	return ptl;
-}
-
 extern void __init pagecache_init(void);
 extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..c8f045705c1e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -84,6 +84,141 @@ static inline unsigned long pud_index(unsigned long address)
 #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 #endif
 
+#if USE_SPLIT_PTE_PTLOCKS
+#if ALLOC_SPLIT_PTLOCKS
+void __init ptlock_cache_init(void);
+extern bool ptlock_alloc(struct page *page);
+extern void ptlock_free(struct page *page);
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+	return page->ptl;
+}
+#else /* ALLOC_SPLIT_PTLOCKS */
+static inline void ptlock_cache_init(void)
+{
+}
+
+static inline bool ptlock_alloc(struct page *page)
+{
+	return true;
+}
+
+static inline void ptlock_free(struct page *page)
+{
+}
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+	return &page->ptl;
+}
+#endif /* ALLOC_SPLIT_PTLOCKS */
+
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_page(*pmd));
+}
+
+static inline bool ptlock_init(struct page *page)
+{
+	/*
+	 * prep_new_page() initialize page->private (and therefore page->ptl)
+	 * with 0. Make sure nobody took it in use in between.
+	 *
+	 * It can happen if arch try to use slab for page table allocation:
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
+	 */
+	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
+	if (!ptlock_alloc(page))
+		return false;
+	spin_lock_init(ptlock_ptr(page));
+	return true;
+}
+
+#else	/* !USE_SPLIT_PTE_PTLOCKS */
+/*
+ * We use mm->page_table_lock to guard all pagetable pages of the mm.
+ */
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return &mm->page_table_lock;
+}
+static inline void ptlock_cache_init(void) {}
+static inline bool ptlock_init(struct page *page) { return true; }
+static inline void ptlock_free(struct page *page) {}
+#endif /* USE_SPLIT_PTE_PTLOCKS */
+
+#if USE_SPLIT_PMD_PTLOCKS
+
+static struct page *pmd_to_page(pmd_t *pmd)
+{
+	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+	return virt_to_page((void *)((unsigned long) pmd & mask));
+}
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_to_page(pmd));
+}
+
+static inline bool pmd_ptlock_init(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page->pmd_huge_pte = NULL;
+#endif
+	return ptlock_init(page);
+}
+
+static inline void pmd_ptlock_free(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+#endif
+	ptlock_free(page);
+}
+
+#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
+
+#else
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return &mm->page_table_lock;
+}
+
+static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline void pmd_ptlock_free(struct page *page) {}
+
+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
+
+#endif
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl = pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
+	return ptl;
+}
+
+/*
+ * No scalability reason to split PUD locks yet, but follow the same pattern
+ * as the PMD locks to make it easier if we decide to.  The VM should not be
+ * considered ready to switch to split PUD locks yet; there may be places
+ * which need to be converted from page_table_lock.
+ */
+static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
+{
+	return &mm->page_table_lock;
+}
+
+static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
+{
+	spinlock_t *ptl = pud_lockptr(mm, pud);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
 #ifndef pte_offset_kernel
 static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 {
@@ -102,6 +237,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_unmap(pte) ((void)(pte))	/* NOP */
 #endif
 
+#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
+({							\
+	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
+	pte_t *__pte = pte_offset_map(pmd, address);	\
+	*(ptlp) = __ptl;				\
+	spin_lock(__ptl);				\
+	__pte;						\
+})
+
+#define pte_unmap_unlock(pte, ptl)	do {		\
+	spin_unlock(ptl);				\
+	pte_unmap(pte);					\
+} while (0)
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry()
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (2 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

We need the vmf in lock_page_or_retry() in the subsequent patch,
so pass in it directly.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pagemap.h | 8 +++-----
 mm/filemap.c            | 6 ++++--
 mm/memory.c             | 4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6a30916b76e5..94f9547b4411 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -709,8 +709,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-				unsigned int flags);
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -772,14 +771,13 @@ static inline int lock_page_killable(struct page *page)
  * Return value and mmap_lock implications depend on flags; see
  * __folio_lock_or_retry().
  */
-static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
-				     unsigned int flags)
+static inline bool lock_page_or_retry(struct page *page, struct vm_fault *vmf)
 {
 	struct folio *folio;
 	might_sleep();
 
 	folio = page_folio(page);
-	return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+	return folio_trylock(folio) || __folio_lock_or_retry(folio, vmf);
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 07c654202870..ff8d19b7ce1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1695,9 +1695,11 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
  * with the folio locked and the mmap_lock unperturbed.
  */
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-			 unsigned int flags)
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 {
+	unsigned int flags = vmf->flags;
+	struct mm_struct *mm = vmf->vma->vm_mm;
+
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index b00cd60fc368..bec6a5d5ee7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3443,7 +3443,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mmu_notifier_range range;
 
-	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+	if (!lock_page_or_retry(page, vmf))
 		return VM_FAULT_RETRY;
 	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
 				vma->vm_mm, vmf->address & PAGE_MASK,
@@ -3576,7 +3576,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+	locked = lock_page_or_retry(page, vmf);
 
 	delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
 	if (!locked) {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (3 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

When we call __pte_alloc() or other friends, a huge pmd might
be created from a different thread. This is why
pmd_trans_unstable() will now be called after __pte_alloc()
or other friends return.

This patch add pmd_installed_type return for __pte_alloc() and other
friends, then we can check the huge pmd through the return value
instead of calling pmd_trans_unstable() again.

This patch has no functional change, just some preparations
for the future patches.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h    | 20 +++++++++++++++++---
 mm/debug_vm_pgtable.c |  2 +-
 mm/filemap.c          | 11 +++++++----
 mm/gup.c              |  2 +-
 mm/internal.h         |  3 ++-
 mm/memory.c           | 39 ++++++++++++++++++++++++++-------------
 mm/migrate.c          | 17 ++---------------
 mm/mremap.c           |  2 +-
 mm/userfaultfd.c      | 24 +++++++++++++++---------
 9 files changed, 72 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 706da081b9f8..52f36fde2f11 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2306,13 +2306,27 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+enum pmd_installed_type {
+	INSTALLED_PTE,
+	INSTALLED_HUGE_PMD,
+};
+
+static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+{
+	if (unlikely(pmd_none(*(pmd))))
+		return __pte_alloc(mm, pmd);
+	if (unlikely(is_huge_pmd(*pmd)))
+		return INSTALLED_HUGE_PMD;
+
+	return INSTALLED_PTE;
+}
+#define pte_alloc pte_alloc
 
 #define pte_alloc_map(mm, pmd, address)			\
-	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+	(pte_alloc(mm, pmd) < 0 ? NULL : pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	(pte_alloc(mm, pmd) ?			\
+	(pte_alloc(mm, pmd) < 0 ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 228e3954b90c..b8322c55e65d 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1170,7 +1170,7 @@ static int __init init_args(struct pgtable_debug_args *args)
 	args->start_pmdp = pmd_offset(args->pudp, 0UL);
 	WARN_ON(!args->start_pmdp);
 
-	if (pte_alloc(args->mm, args->pmdp)) {
+	if (pte_alloc(args->mm, args->pmdp) < 0) {
 		pr_err("Failed to allocate pte entries\n");
 		ret = -ENOMEM;
 		goto error;
diff --git a/mm/filemap.c b/mm/filemap.c
index ff8d19b7ce1d..23363f8ddbbe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,12 +3217,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		}
 	}
 
-	if (pmd_none(*vmf->pmd))
-		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+	if (pmd_none(*vmf->pmd)) {
+		int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
-	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd))
+		if (unlikely(ret == INSTALLED_HUGE_PMD))
+			goto out;
+	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+		/* See comment in handle_pte_fault() */
 		goto out;
+	}
 
 	return false;
 
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..2def775232a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -699,7 +699,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		} else {
 			spin_unlock(ptl);
 			split_huge_pmd(vma, pmd, address);
-			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
+			ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
 		}
 
 		return ret ? ERR_PTR(ret) :
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..474d6e3443f8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -67,7 +67,8 @@ bool __folio_end_writeback(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+				    pgtable_t *pte);
 
 static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
diff --git a/mm/memory.c b/mm/memory.c
index bec6a5d5ee7c..8a39c0e58324 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -437,8 +437,10 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+				    pgtable_t *pte)
 {
+	int ret = INSTALLED_PTE;
 	spinlock_t *ptl = pmd_lock(mm, pmd);
 
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
@@ -459,20 +461,26 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 		pmd_populate(mm, pmd, *pte);
 		*pte = NULL;
+	} else if (is_huge_pmd(*pmd)) {
+		/* See comment in handle_pte_fault() */
+		ret = INSTALLED_HUGE_PMD;
 	}
 	spin_unlock(ptl);
+
+	return ret;
 }
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
+	enum pmd_installed_type ret;
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
-	pmd_install(mm, pmd, &new);
+	ret = pmd_install(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
-	return 0;
+	return ret;
 }
 
 int __pte_alloc_kernel(pmd_t *pmd)
@@ -1813,7 +1821,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 	/* Allocate the PTE if necessary; takes PMD lock once only. */
 	ret = -ENOMEM;
-	if (pte_alloc(mm, pmd))
+	if (pte_alloc(mm, pmd) < 0)
 		goto out;
 
 	while (pages_to_write_in_pmd) {
@@ -3713,6 +3721,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct page *page;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	int alloc_ret;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -3728,11 +3737,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 *
 	 * Here we only have mmap_read_lock(mm).
 	 */
-	if (pte_alloc(vma->vm_mm, vmf->pmd))
+	alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);
+	if (alloc_ret < 0)
 		return VM_FAULT_OOM;
-
 	/* See comment in handle_pte_fault() */
-	if (unlikely(pmd_trans_unstable(vmf->pmd)))
+	if (unlikely(alloc_ret == INSTALLED_HUGE_PMD))
 		return 0;
 
 	/* Use the zero-page for reads */
@@ -4023,6 +4032,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	}
 
 	if (pmd_none(*vmf->pmd)) {
+		int alloc_ret;
+
 		if (PageTransCompound(page)) {
 			ret = do_set_pmd(vmf, page);
 			if (ret != VM_FAULT_FALLBACK)
@@ -4030,14 +4041,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		}
 
 		if (vmf->prealloc_pte)
-			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
-		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
-			return VM_FAULT_OOM;
-	}
+			alloc_ret = pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+		else
+			alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);
 
-	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd))
+		if (unlikely(alloc_ret != INSTALLED_PTE))
+			return alloc_ret < 0 ? VM_FAULT_OOM : 0;
+	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+		/* See comment in handle_pte_fault() */
 		return 0;
+	}
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				      vmf->address, &vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00f03c8..bdfdfd3b50be 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2731,21 +2731,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
 		goto abort;
 
-	/*
-	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
-	 * pte_offset_map() on pmds where a huge pmd might be created
-	 * from a different thread.
-	 *
-	 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
-	 * parallel threads are excluded by other means.
-	 *
-	 * Here we only have mmap_read_lock(mm).
-	 */
-	if (pte_alloc(mm, pmdp))
-		goto abort;
-
-	/* See the comment in pte_alloc_one_map() */
-	if (unlikely(pmd_trans_unstable(pmdp)))
+	/* See the comment in do_anonymous_page() */
+	if (unlikely(pte_alloc(mm, pmdp) != INSTALLED_PTE))
 		goto abort;
 
 	if (unlikely(anon_vma_prepare(vma)))
diff --git a/mm/mremap.c b/mm/mremap.c
index c6e9da09dd0a..fc5c56858883 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -551,7 +551,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				continue;
 		}
 
-		if (pte_alloc(new_vma->vm_mm, new_pmd))
+		if (pte_alloc(new_vma->vm_mm, new_pmd) < 0)
 			break;
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
 			  new_pmd, new_addr, need_rmap_locks);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..2cea08e7f076 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -592,15 +592,21 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			err = -EEXIST;
 			break;
 		}
-		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
-			err = -ENOMEM;
-			break;
-		}
-		/* If an huge pmd materialized from under us fail */
-		if (unlikely(pmd_trans_huge(*dst_pmd))) {
-			err = -EFAULT;
-			break;
+
+		if (unlikely(pmd_none(dst_pmdval))) {
+			int ret = __pte_alloc(dst_mm, dst_pmd);
+
+			/*
+			 * If there is not enough memory or an huge pmd
+			 * materialized from under us
+			 */
+			if (unlikely(ret < 0)) {
+				err = -ENOMEM;
+				break;
+			} else if (unlikely(ret == INSTALLED_HUGE_PMD)) {
+				err = -EFAULT;
+				break;
+			}
 		}
 
 		BUG_ON(pmd_none(*dst_pmd));
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 06/15] mm: introduce refcount for user PTE page table page
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (4 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

1. Preface
==========

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::

 First of all, we should hold as few write locks of mmap_lock as possible,
 since the mmap_lock semaphore has long been a contention point in the
 memory management subsystem. The mmap()/munmap() hold the write lock, and
 the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
 madvise() instead of munmap() to released physical memory can reduce the
 competition of the mmap_lock.

 Secondly, after using madvise() to release physical memory, there is no
 need to build vma and allocate page tables again when accessing the same
 virtual address again, which can also save some time.

The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.

+---------------------------+--------+---------+
|                           | 32-bit | 64-bit  |
+===========================+========+=========+
| user PTE page table pages | 3 MiB  | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB  | 1 GiB   |
+---------------------------+--------+---------+

(for 32-bit, take 3G user address space, 4K page size as an example;
 for 64-bit, take 48-bit address width, 4K page size as an example.)

After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.

Therefore, we decided to introduce reference count to manage the PTE page
table life cycle, so that some free PTE page table memory in the system
can be dynamically released.

2. The reference count of user PTE page table pages
===================================================

We introduce two members for the struct page of the user PTE page table
page::

 union {
	pgtable_t pmd_huge_pte; /* protected by page->ptl */
	pmd_t *pmd;             /* PTE page only */
 };
 union {
	struct mm_struct *pt_mm; /* x86 pgds only */
	atomic_t pt_frag_refcount; /* powerpc */
	atomic_t pte_refcount;  /* PTE page only */
 };

The pmd member record the pmd entry that maps the user PTE page table page,
the pte_refcount member keep track of how many references to the user PTE
page table page.

The following people will hold a reference on the user PTE page table
page::

 The !pte_none() entry, such as regular page table entry that map physical
 pages, or swap entry, or migrate entry, etc.

 Visitor to the PTE page table entries, such as page table walker.

Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the ``pte_refcount`` is reduced to 0, it means
that no one is using the PTE page table page, then this free PTE page
table page can be released back to the system at this time.

3. Helpers
==========

+---------------------+-------------------------------------------------+
| pte_ref_init        | Initialize the pte_refcount and pmd             |
+---------------------+-------------------------------------------------+
| pte_to_pmd          | Get the corresponding pmd                       |
+---------------------+-------------------------------------------------+
| pte_update_pmd      | Update the corresponding pmd                    |
+---------------------+-------------------------------------------------+
| pte_get             | Increment a pte_refcount                        |
+---------------------+-------------------------------------------------+
| pte_get_many        | Add a value to a pte_refcount                   |
+---------------------+-------------------------------------------------+
| pte_get_unless_zero | Increment a pte_refcount unless it is 0         |
+---------------------+-------------------------------------------------+
| pte_try_get         | Try to increment a pte_refcount                 |
+---------------------+-------------------------------------------------+
| pte_tryget_map      | Try to increment a pte_refcount before          |
|                     | pte_offset_map()                                |
+---------------------+-------------------------------------------------+
| pte_tryget_map_lock | Try to increment a pte_refcount before          |
|                     | pte_offset_map_lock()                           |
+---------------------+-------------------------------------------------+
| pte_put             | Decrement a pte_refcount                        |
+---------------------+-------------------------------------------------+
| pte_put_many        | Sub a value to a pte_refcount                   |
+---------------------+-------------------------------------------------+
| pte_put_vmf         | Decrement a pte_refcount in the page fault path |
+---------------------+-------------------------------------------------+

4. About this commit
====================
This commit just introduces some dummy helpers, the actual logic will
be implemented in future commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_types.h |  6 +++-
 include/linux/pte_ref.h  | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/Makefile              |  4 +--
 mm/pte_ref.c             | 55 ++++++++++++++++++++++++++++++
 4 files changed, 149 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb8c6f5f19bc..c599008d54fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -149,11 +149,15 @@ struct page {
 		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
-			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+			union {
+				pgtable_t pmd_huge_pte; /* protected by page->ptl */
+				pmd_t *pmd;             /* PTE page only */
+			};
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
+				atomic_t pte_refcount;  /* PTE page only */
 			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..b6d8335bdc59
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#include <linux/pgtable.h>
+
+enum pte_tryget_type {
+	TRYGET_SUCCESSED,
+	TRYGET_FAILED_ZERO,
+	TRYGET_FAILED_NONE,
+	TRYGET_FAILED_HUGE_PMD,
+};
+
+bool pte_get_unless_zero(pmd_t *pmd);
+enum pte_tryget_type pte_try_get(pmd_t *pmd);
+void pte_put_vmf(struct vm_fault *vmf);
+
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+	return NULL;
+}
+
+static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+}
+
+static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
+{
+}
+
+/*
+ * pte_get - Increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Similar to the mechanism of page refcount, the user of PTE page table
+ * should hold a refcount to it before accessing.
+ */
+static inline void pte_get(pmd_t *pmd)
+{
+	pte_get_many(pmd, 1);
+}
+
+static inline pte_t *pte_tryget_map(pmd_t *pmd, unsigned long address)
+{
+	if (pte_try_get(pmd))
+		return NULL;
+
+	return pte_offset_map(pmd, address);
+}
+
+static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
+					 unsigned long address, spinlock_t **ptlp)
+{
+	if (pte_try_get(pmd))
+		return NULL;
+
+	return pte_offset_map_lock(mm, pmd, address, ptlp);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, unsigned int nr)
+{
+}
+
+/*
+ * pte_put - Decrement refcount for the PTE page table.
+ * @mm: the mm_struct of the target address space.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ * @addr: the start address of the tlb range to be flushed.
+ *
+ * The PTE page table page will be freed when the last refcount is dropped.
+ */
+static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+	pte_put_many(mm, pmd, addr, 1);
+}
+
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index d6c0042e3aa0..ea679bf75a5f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,8 +38,8 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o
-
+			   pgtable-generic.o rmap.o vmalloc.o \
+			   pte_ref.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..de109905bc8f
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+
+#include <linux/pte_ref.h>
+#include <linux/mm.h>
+
+/*
+ * pte_get_unless_zero - Increment refcount for the PTE page table
+ *			 unless it is zero.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ */
+bool pte_get_unless_zero(pmd_t *pmd)
+{
+	return true;
+}
+
+/*
+ * pte_try_get - Try to increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Return true if the increment succeeded. Otherwise return false.
+ *
+ * Before Operating the PTE page table, we need to hold a refcount
+ * to protect against the concurrent release of the PTE page table.
+ * But we will fail in the following case:
+ * 	- The content mapped in @pmd is not a PTE page
+ * 	- The refcount of the PTE page table is zero, it will be freed
+ */
+enum pte_tryget_type pte_try_get(pmd_t *pmd)
+{
+	if (unlikely(pmd_none(*pmd)))
+		return TRYGET_FAILED_NONE;
+	if (unlikely(is_huge_pmd(*pmd)))
+		return TRYGET_FAILED_HUGE_PMD;
+
+	return TRYGET_SUCCESSED;
+}
+
+/*
+ * pte_put_vmf - Decrement refcount for the PTE page table.
+ * @vmf: fault information
+ *
+ * The mmap_lock may be unlocked in advance in some cases
+ * in handle_pte_fault(), then the pmd entry will no longer
+ * be stable. For example, the corresponds of the PTE page may
+ * be replaced(e.g. mremap), so we should ensure the pte_put()
+ * is performed in the critical section of the mmap_lock.
+ */
+void pte_put_vmf(struct vm_fault *vmf)
+{
+}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (5 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

When the PTE page table page is allocated and installed into the
pmd entry, it needs to take an initial reference count to prevent
the release of PTE page table page by other threads, and the caller
of pte_alloc()(or other friends) needs to reduce this reference count.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h    |  7 +++++--
 mm/debug_vm_pgtable.c |  1 +
 mm/filemap.c          |  8 ++++++--
 mm/gup.c              | 10 +++++++---
 mm/memory.c           | 51 +++++++++++++++++++++++++++++++++++++++++----------
 mm/migrate.c          |  9 ++++++---
 mm/mlock.c            |  1 +
 mm/mremap.c           |  1 +
 mm/userfaultfd.c      | 16 +++++++++++++++-
 9 files changed, 83 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52f36fde2f11..753a9435e0d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
 #include <linux/err.h>
 #include <linux/page-flags.h>
 #include <linux/page_ref.h>
+#include <linux/pte_ref.h>
 #include <linux/memremap.h>
 #include <linux/overflow.h>
 #include <linux/sizes.h>
@@ -2313,9 +2314,11 @@ enum pmd_installed_type {
 
 static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
-	if (unlikely(pmd_none(*(pmd))))
+	enum pte_tryget_type ret = pte_try_get(pmd);
+
+	if (ret == TRYGET_FAILED_NONE || ret == TRYGET_FAILED_ZERO)
 		return __pte_alloc(mm, pmd);
-	if (unlikely(is_huge_pmd(*pmd)))
+	else if (ret == TRYGET_FAILED_HUGE_PMD)
 		return INSTALLED_HUGE_PMD;
 
 	return INSTALLED_PTE;
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index b8322c55e65d..52f006654664 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1048,6 +1048,7 @@ static void __init destroy_args(struct pgtable_debug_args *args)
 
 	/* Free page table entries */
 	if (args->start_ptep) {
+		pte_put(args->mm, args->start_pmdp, args->vaddr);
 		pte_free(args->mm, args->start_ptep);
 		mm_dec_nr_ptes(args->mm);
 	}
diff --git a/mm/filemap.c b/mm/filemap.c
index 23363f8ddbbe..1e7e9e4fd759 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,6 +3217,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		}
 	}
 
+retry:
 	if (pmd_none(*vmf->pmd)) {
 		int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
@@ -3225,6 +3226,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
 		/* See comment in handle_pte_fault() */
 		goto out;
+	} else if (pte_try_get(vmf->pmd) == TRYGET_FAILED_ZERO) {
+		goto retry;
 	}
 
 	return false;
@@ -3301,7 +3304,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t last_pgoff = start_pgoff;
-	unsigned long addr;
+	unsigned long addr, start;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct page *head, *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
@@ -3317,7 +3320,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	start = addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
 		page = find_subpage(head, xas.xa_index);
@@ -3348,6 +3351,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		put_page(head);
 	} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_put(vma->vm_mm, vmf->pmd, start);
 out:
 	rcu_read_unlock();
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
diff --git a/mm/gup.c b/mm/gup.c
index 2def775232a3..e084111103f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -694,7 +694,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			spin_unlock(ptl);
 			ret = 0;
 			split_huge_pmd(vma, pmd, address);
-			if (pmd_trans_unstable(pmd))
+			if (pte_try_get(pmd) == TRYGET_FAILED_HUGE_PMD)
 				ret = -EBUSY;
 		} else {
 			spin_unlock(ptl);
@@ -702,8 +702,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
 		}
 
-		return ret ? ERR_PTR(ret) :
-			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		if (ret)
+			return ERR_PTR(ret);
+
+		page = follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		pte_put(mm, pmd, address);
+		return page;
 	}
 	page = follow_trans_huge_pmd(vma, address, pmd, flags);
 	spin_unlock(ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 8a39c0e58324..0b9af38cfa11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -441,10 +441,13 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
 				    pgtable_t *pte)
 {
 	int ret = INSTALLED_PTE;
-	spinlock_t *ptl = pmd_lock(mm, pmd);
+	spinlock_t *ptl;
 
+retry:
+	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm_inc_nr_ptes(mm);
+		pte_ref_init(*pte, pmd, 1);
 		/*
 		 * Ensure all pte setup (eg. pte page lock and page clearing) are
 		 * visible before the pte is made visible to other CPUs by being
@@ -464,6 +467,9 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
 	} else if (is_huge_pmd(*pmd)) {
 		/* See comment in handle_pte_fault() */
 		ret = INSTALLED_HUGE_PMD;
+	} else if (!pte_get_unless_zero(pmd)) {
+		spin_unlock(ptl);
+		goto retry;
 	}
 	spin_unlock(ptl);
 
@@ -1028,6 +1034,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
 	struct page *prealloc = NULL;
+	unsigned long start = addr;
 
 again:
 	progress = 0;
@@ -1108,6 +1115,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
+	pte_put(dst_mm, dst_pmd, start);
 	cond_resched();
 
 	if (ret == -EIO) {
@@ -1778,6 +1786,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 		goto out;
 	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pte_to_pmd(pte), addr);
 out:
 	return retval;
 }
@@ -1810,6 +1819,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 	unsigned long remaining_pages_total = *num;
 	unsigned long pages_to_write_in_pmd;
 	int ret;
+	unsigned long start = addr;
 more:
 	ret = -EFAULT;
 	pmd = walk_to_pmd(mm, addr);
@@ -1836,7 +1846,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 				pte_unmap_unlock(start_pte, pte_lock);
 				ret = err;
 				remaining_pages_total -= pte_idx;
-				goto out;
+				goto put;
 			}
 			addr += PAGE_SIZE;
 			++curr_page_idx;
@@ -1845,9 +1855,13 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 		pages_to_write_in_pmd -= batch_size;
 		remaining_pages_total -= batch_size;
 	}
-	if (remaining_pages_total)
+	if (remaining_pages_total) {
+		pte_put(mm, pmd, start);
 		goto more;
+	}
 	ret = 0;
+put:
+	pte_put(mm, pmd, start);
 out:
 	*num = remaining_pages_total;
 	return ret;
@@ -2075,6 +2089,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pte_to_pmd(pte), addr);
 	return VM_FAULT_NOPAGE;
 }
 
@@ -2275,6 +2290,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
+	unsigned long start = addr;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 	int err = 0;
@@ -2294,6 +2310,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_put(mm, pmd, start);
 	return err;
 }
 
@@ -2503,6 +2520,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     pte_fn_t fn, void *data, bool create,
 				     pgtbl_mod_mask *mask)
 {
+	unsigned long start = addr;
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
@@ -2536,8 +2554,11 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm != &init_mm) {
 		pte_unmap_unlock(mapped_pte, ptl);
+		if (create)
+			pte_put(mm, pmd, start);
+	}
 	return err;
 }
 
@@ -3761,7 +3782,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			return handle_userfault(vmf, VM_UFFD_MISSING);
+			ret = handle_userfault(vmf, VM_UFFD_MISSING);
+			goto put;
 		}
 		goto setpte;
 	}
@@ -3804,7 +3826,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		put_page(page);
-		return handle_userfault(vmf, VM_UFFD_MISSING);
+		ret = handle_userfault(vmf, VM_UFFD_MISSING);
+		goto put;
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3817,14 +3840,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return ret;
+	goto put;
 release:
 	put_page(page);
 	goto unlock;
 oom_free_page:
 	put_page(page);
 oom:
-	return VM_FAULT_OOM;
+	ret = VM_FAULT_OOM;
+put:
+	pte_put(vma->vm_mm, vmf->pmd, vmf->address);
+	return ret;
 }
 
 /*
@@ -4031,7 +4057,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
-	if (pmd_none(*vmf->pmd)) {
+retry:
+	ret = pte_try_get(vmf->pmd);
+	if (ret == TRYGET_FAILED_NONE) {
 		int alloc_ret;
 
 		if (PageTransCompound(page)) {
@@ -4047,9 +4075,11 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 		if (unlikely(alloc_ret != INSTALLED_PTE))
 			return alloc_ret < 0 ? VM_FAULT_OOM : 0;
-	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+	} else if (ret == TRYGET_FAILED_HUGE_PMD) {
 		/* See comment in handle_pte_fault() */
 		return 0;
+	} else if (ret == TRYGET_FAILED_ZERO) {
+		goto retry;
 	}
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -4063,6 +4093,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 	update_mmu_tlb(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_put(vma->vm_mm, vmf->pmd, vmf->address);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index bdfdfd3b50be..26f16a4836d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2736,9 +2736,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	if (unlikely(anon_vma_prepare(vma)))
-		goto abort;
+		goto put;
 	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
-		goto abort;
+		goto put;
 
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
@@ -2764,7 +2764,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 			 * device memory.
 			 */
 			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
-			goto abort;
+			goto put;
 		}
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
@@ -2811,11 +2811,14 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	}
 
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmdp, addr);
 	*src = MIGRATE_PFN_MIGRATE;
 	return;
 
 unlock_abort:
 	pte_unmap_unlock(ptep, ptl);
+put:
+	pte_put(mm, pmdp, addr);
 abort:
 	*src &= ~MIGRATE_PFN_MIGRATE;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index e263d62ae2d0..a4ef20ba9627 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -398,6 +398,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			break;
 	}
 	pte_unmap_unlock(pte, ptl);
+	pte_put(vma->vm_mm, pte_to_pmd(pte), start);
 	return start;
 }
 
diff --git a/mm/mremap.c b/mm/mremap.c
index fc5c56858883..f80c628db25d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -555,6 +555,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			break;
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
 			  new_pmd, new_addr, need_rmap_locks);
+		pte_put(new_vma->vm_mm, new_pmd, new_addr);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 2cea08e7f076..37df899a1b9d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -574,6 +574,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
+		enum pte_tryget_type tryget_type;
 
 		BUG_ON(dst_addr >= dst_start + len);
 
@@ -583,6 +584,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 		}
 
+again:
+		/*
+		 * After the management of the PTE page changes to the refcount
+		 * mode, the PTE page may be released by another thread(rcu mode),
+		 * so the rcu lock is held here to prevent the PTE page from
+		 * being released.
+		 */
+		rcu_read_lock();
 		dst_pmdval = pmd_read_atomic(dst_pmd);
 		/*
 		 * If the dst_pmd is mapped as THP don't
@@ -593,7 +602,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 		}
 
-		if (unlikely(pmd_none(dst_pmdval))) {
+		tryget_type = pte_try_get(&dst_pmdval);
+		rcu_read_unlock();
+		if (unlikely(tryget_type == TRYGET_FAILED_NONE)) {
 			int ret = __pte_alloc(dst_mm, dst_pmd);
 
 			/*
@@ -607,6 +618,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 				err = -EFAULT;
 				break;
 			}
+		} else if (unlikely(tryget_type == TRYGET_FAILED_ZERO)) {
+			goto again;
 		}
 
 		BUG_ON(pmd_none(*dst_pmd));
@@ -614,6 +627,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
 				       src_addr, &page, mcopy_mode, wp_copy);
+		pte_put(dst_mm, dst_pmd, dst_addr);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (6 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:40 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
  2021-11-10  8:52 ` [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

When we split the PMD-mapped THP to the PTE-mapped THP, we should
initialize the refcount of the withdrawn PTE page table page to
HPAGE_PMD_NR, which ensures that we can release the PTE page table
page when it is free(the refcount is 0).

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/pgtable-generic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..523053e09dfa 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -186,6 +186,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 							  struct page, lru);
 	if (pmd_huge_pte(mm, pmdp))
 		list_del(&pgtable->lru);
+	pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR);
 	return pgtable;
 }
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user PTE page table page
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (7 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
@ 2021-11-10  8:40 ` Qi Zheng
  2021-11-10  8:52 ` [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:40 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The !pte_none() entry will take a reference on the user PTE page
table page, such as regular page table entry that map physical
pages, or swap entry, or migrate entry, etc.

So a pte_none() entry is mapped, it needs to increase the refcount
of the PTE page table page. When a !pte_none() entry becomes none,
the refcount of the PTE page table page needs to be decreased.

For swap or migrate cases, which only change the content of
the PTE entry, we keep the refcount unchanged.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 kernel/events/uprobes.c |  2 ++
 mm/filemap.c            |  3 +++
 mm/madvise.c            |  5 +++++
 mm/memory.c             | 42 +++++++++++++++++++++++++++++++++++-------
 mm/migrate.c            |  1 +
 mm/mremap.c             |  7 +++++++
 mm/rmap.c               | 10 ++++++++++
 mm/userfaultfd.c        |  2 ++
 8 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..96dd2959e1ac 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -200,6 +200,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	if (new_page)
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
+	else
+		pte_put(mm, pte_to_pmd(pvmw.pte), addr);
 
 	page_remove_rmap(old_page, false);
 	if (!page_mapped(old_page))
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e7e9e4fd759..aa47ee11a3d8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3309,6 +3309,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct page *head, *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	unsigned int nr_get = 0;
 
 	rcu_read_lock();
 	head = first_map_page(mapping, &xas, end_pgoff);
@@ -3342,6 +3343,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			ret = VM_FAULT_NOPAGE;
 
 		do_set_pte(vmf, page, addr);
+		nr_get++;
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, addr, vmf->pte);
 		unlock_page(head);
@@ -3351,6 +3353,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		put_page(head);
 	} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_get_many(vmf->pmd, nr_get);
 	pte_put(vma->vm_mm, vmf->pmd, start);
 out:
 	rcu_read_unlock();
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..82fc40b6dcbf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -580,6 +580,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	int nr_swap = 0;
 	unsigned long next;
+	unsigned int nr_put = 0;
+	unsigned long start = addr;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -612,6 +614,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			nr_swap--;
 			free_swap_and_cache(entry);
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			nr_put++;
 			continue;
 		}
 
@@ -696,6 +699,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+	if (nr_put)
+		pte_put_many(mm, pmd, start, nr_put);
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 0b9af38cfa11..ea4d651ac8c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -878,6 +878,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_swp_clear_uffd_wp(pte);
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -946,6 +947,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		/* Uffd-wp needs to be delivered to dest pte as well */
 		pte = pte_wrprotect(pte_mkuffd_wp(pte));
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -998,6 +1000,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		pte = pte_clear_uffd_wp(pte);
 
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -1335,6 +1338,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	unsigned int nr_put = 0;
+	unsigned long start = addr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1359,6 +1364,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			nr_put++;
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1392,6 +1398,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			nr_put++;
 			rss[mm_counter(page)]--;
 
 			if (is_device_private_entry(entry))
@@ -1416,6 +1423,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		nr_put++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -1442,6 +1450,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto again;
 	}
 
+	if (nr_put)
+		pte_put_many(mm, pmd, start, nr_put);
+
 	return addr;
 }
 
@@ -1759,6 +1770,7 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 	inc_mm_counter_fast(mm, mm_counter_file(page));
 	page_add_file_rmap(page, false);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_get(pte_to_pmd(pte));
 	return 0;
 }
 
@@ -2085,6 +2097,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	set_pte_at(mm, addr, pte, entry);
+	pte_get(pte_to_pmd(pte));
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
 out_unlock:
@@ -2291,6 +2304,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long pfn, pgprot_t prot)
 {
 	unsigned long start = addr;
+	unsigned int nr_get = 0;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 	int err = 0;
@@ -2306,10 +2320,12 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			break;
 		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
+		nr_get++;
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_get_many(pmd, nr_get);
 	pte_put(mm, pmd, start);
 	return err;
 }
@@ -2524,6 +2540,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
+	unsigned int nr_put = 0, nr_get = 0;
 
 	if (create) {
 		mapped_pte = pte = (mm == &init_mm) ?
@@ -2531,6 +2548,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			pte_alloc_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
 			return -ENOMEM;
+		nr_put++;
 	} else {
 		mapped_pte = pte = (mm == &init_mm) ?
 			pte_offset_kernel(pmd, addr) :
@@ -2543,11 +2561,17 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	if (fn) {
 		do {
-			if (create || !pte_none(*pte)) {
+			if (create) {
 				err = fn(pte++, addr, data);
-				if (err)
-					break;
+				if (mm != &init_mm && !pte_none(*(pte-1)))
+					nr_get++;
+			} else if (!pte_none(*pte)) {
+				err = fn(pte++, addr, data);
+				if (mm != &init_mm && pte_none(*(pte-1)))
+					nr_put++;
 			}
+			if (err)
+				break;
 		} while (addr += PAGE_SIZE, addr != end);
 	}
 	*mask |= PGTBL_PTE_MODIFIED;
@@ -2556,8 +2580,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	if (mm != &init_mm) {
 		pte_unmap_unlock(mapped_pte, ptl);
-		if (create)
-			pte_put(mm, pmd, start);
+		pte_get_many(pmd, nr_get);
+		if (nr_put)
+			pte_put_many(mm, pmd, start, nr_put);
 	}
 	return err;
 }
@@ -3835,6 +3860,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	pte_get(vmf->pmd);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, vmf->address, vmf->pte);
@@ -4086,10 +4112,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_none(*vmf->pte))) {
 		do_set_pte(vmf, page, vmf->address);
-	else
+		pte_get(vmf->pmd);
+	} else {
 		ret = VM_FAULT_NOPAGE;
+	}
 
 	update_mmu_tlb(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index 26f16a4836d8..c03ac25f42a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2807,6 +2807,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	} else {
 		/* No need to invalidate - it was non-present before */
 		set_pte_at(mm, addr, ptep, entry);
+		pte_get(pmdp);
 		update_mmu_cache(vma, addr, ptep);
 	}
 
diff --git a/mm/mremap.c b/mm/mremap.c
index f80c628db25d..088a7a75cb4b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -141,6 +141,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
 	unsigned long len = old_end - old_addr;
+	unsigned int nr_put = 0, nr_get = 0;
+	unsigned long old_start = old_addr;
 
 	/*
 	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -181,6 +183,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 			continue;
 
 		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		nr_put++;
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -197,6 +200,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 		set_pte_at(mm, new_addr, new_pte, pte);
+		nr_get++;
 	}
 
 	arch_leave_lazy_mmu_mode();
@@ -206,6 +210,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	pte_get_many(new_pmd, nr_get);
+	if (nr_put)
+		pte_put_many(mm, old_pmd, old_start, nr_put);
 	if (need_rmap_locks)
 		drop_rmap_locks(vma);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 2908d637bcad..630ce8a036b5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1404,6 +1404,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	unsigned int nr_put = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1551,6 +1552,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
+			nr_put++;
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1564,6 +1566,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				/* We have to invalidate as we cleared the pte */
 				mmu_notifier_invalidate_range(mm, address,
 							address + PAGE_SIZE);
+				pte_put(mm, pvmw.pmd, address);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1575,6 +1578,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
+					nr_put++;
 					goto discard;
 				}
 
@@ -1630,6 +1634,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * See Documentation/vm/mmu_notifier.rst
 			 */
 			dec_mm_counter(mm, mm_counter_file(page));
+			nr_put++;
 		}
 discard:
 		/*
@@ -1641,6 +1646,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
+		if (nr_put) {
+			pte_put_many(mm, pvmw.pmd, address, nr_put);
+			nr_put = 0;
+		}
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
@@ -1871,6 +1880,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
+			pte_put(mm, pvmw.pmd, address);
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 37df899a1b9d..b87c61b94065 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -110,6 +110,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	pte_get(dst_pmd);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -204,6 +205,7 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	pte_get(dst_pmd);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 	ret = 0;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (8 preceding siblings ...)
  2021-11-10  8:40 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
@ 2021-11-10  8:52 ` Qi Zheng
  9 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10  8:52 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming

Hi all,

I’m sorry, something went wrong when sending this patch set, I will 
resend the whole patch later.

Thanks,
Qi

On 11/10/21 4:40 PM, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to free user PTE page table pages when all PTE entries
> are empty.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
> They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.
> 
> Testing:
> 
> The following code snippet can show the effect of optimization:
> 
>          mmap 50G
>          while (1) {
>                  for (; i < 1024 * 25; i++) {
>                          touch 2M memory
>                          madvise MADV_DONTNEED 2M
>                  }
>          }
> 
> As we can see, the memory usage of VmPTE is reduced:
> 
>                          before                          after
> VIRT                   50.0 GB                        50.0 GB
> RES                     3.1 MB                         3.6 MB
> VmPTE                102640 kB                         248 kB
> 
> I also have tested the stability by LTP[1] for several weeks. I have not seen
> any crash so far.
> 
> The performance of page fault can be affected because of the allocation/freeing
> of PTE page table pages. The following is the test result by using a micro
> benchmark[2]:
> 
> root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:
> 
> threads         before (pf/min)                     after (pf/min)
>      1                32,085,255                         31,880,833 (-0.64%)
>      8               101,674,967                        100,588,311 (-1.17%)
>     16               113,207,000                        112,801,832 (-0.36%)
> 
> (The "pfn/min" means how many page faults in one minute.)
> 
> The performance of page fault is ~1% slower than before.
> 
> And there are no obvious changes in perf hot spots:
> 
> before:
>    19.29%  [kernel]  [k] clear_page_rep
>    16.12%  [kernel]  [k] do_user_addr_fault
>     9.57%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     6.16%  [kernel]  [k] get_page_from_freelist
>     5.03%  [kernel]  [k] __handle_mm_fault
>     3.53%  [kernel]  [k] __rcu_read_unlock
>     3.45%  [kernel]  [k] handle_mm_fault
>     3.38%  [kernel]  [k] down_read_trylock
>     2.74%  [kernel]  [k] free_unref_page_list
>     2.17%  [kernel]  [k] up_read
>     1.93%  [kernel]  [k] charge_memcg
>     1.73%  [kernel]  [k] try_charge_memcg
>     1.71%  [kernel]  [k] __alloc_pages
>     1.69%  [kernel]  [k] ___perf_sw_event
>     1.44%  [kernel]  [k] get_mem_cgroup_from_mm
> 
> after:
>    18.19%  [kernel]  [k] clear_page_rep
>    16.28%  [kernel]  [k] do_user_addr_fault
>     8.39%  [kernel]  [k] _raw_spin_unlock_irqrestore
>     5.12%  [kernel]  [k] get_page_from_freelist
>     4.81%  [kernel]  [k] __handle_mm_fault
>     4.68%  [kernel]  [k] down_read_trylock
>     3.80%  [kernel]  [k] handle_mm_fault
>     3.59%  [kernel]  [k] get_mem_cgroup_from_mm
>     2.49%  [kernel]  [k] free_unref_page_list
>     2.41%  [kernel]  [k] up_read
>     2.16%  [kernel]  [k] charge_memcg
>     1.92%  [kernel]  [k] __rcu_read_unlock
>     1.88%  [kernel]  [k] ___perf_sw_event
>     1.70%  [kernel]  [k] pte_get_unless_zero
> 
> This series is based on next-20211108.
> 
> Comments and suggestions are welcome.
> 
> Thanks,
> Qi.
> 
> [1] https://github.com/linux-test-project/ltp
> [2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c
> 
> Changelog in v2 -> v3:
>   - Refactored this patch series:
>          - [PATCH v3 6/15]: Introduce the new dummy helpers first
>          - [PATCH v3 7-12/15]: Convert each subsystem individually
>          - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
>     And thanks for the advice from David and Jason.
>   - Add a document.
> 
> Changelog in v1 -> v2:
>   - Change pte_install() to pmd_install().
>   - Fix some typo and code style problems.
>   - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
>     and [PATCH v2 7/9].
> 
> Qi Zheng (15):
>    mm: do code cleanups to filemap_map_pmd()
>    mm: introduce is_huge_pmd() helper
>    mm: move pte_offset_map_lock() to pgtable.h
>    mm: rework the parameter of lock_page_or_retry()
>    mm: add pmd_installed_type return for __pte_alloc() and other friends
>    mm: introduce refcount for user PTE page table page
>    mm/pte_ref: add support for user PTE page table page allocation
>    mm/pte_ref: initialize the refcount of the withdrawn PTE page table
>      page
>    mm/pte_ref: add support for the map/unmap of user PTE page table page
>    mm/pte_ref: add support for page fault path
>    mm/pte_ref: take a refcount before accessing the PTE page table page
>    mm/pte_ref: update the pmd entry in move_normal_pmd()
>    mm/pte_ref: free user PTE page table pages
>    Documentation: add document for pte_ref
>    mm/pte_ref: use mmu_gather to free PTE page table pages
> 
>   Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
>   arch/x86/Kconfig             |   2 +-
>   fs/proc/task_mmu.c           |  24 +++-
>   fs/userfaultfd.c             |   9 +-
>   include/linux/huge_mm.h      |  10 +-
>   include/linux/mm.h           | 170 ++++-------------------------
>   include/linux/mm_types.h     |   6 +-
>   include/linux/pagemap.h      |   8 +-
>   include/linux/pgtable.h      | 152 +++++++++++++++++++++++++-
>   include/linux/pte_ref.h      | 146 +++++++++++++++++++++++++
>   include/linux/rmap.h         |   2 +
>   kernel/events/uprobes.c      |   2 +
>   mm/Kconfig                   |   4 +
>   mm/Makefile                  |   4 +-
>   mm/damon/vaddr.c             |  12 +-
>   mm/debug_vm_pgtable.c        |   5 +-
>   mm/filemap.c                 |  45 +++++---
>   mm/gup.c                     |  25 ++++-
>   mm/hmm.c                     |   5 +-
>   mm/huge_memory.c             |   3 +-
>   mm/internal.h                |   4 +-
>   mm/khugepaged.c              |  21 +++-
>   mm/ksm.c                     |   6 +-
>   mm/madvise.c                 |  21 +++-
>   mm/memcontrol.c              |  12 +-
>   mm/memory-failure.c          |  11 +-
>   mm/memory.c                  | 254 ++++++++++++++++++++++++++++++++-----------
>   mm/mempolicy.c               |   6 +-
>   mm/migrate.c                 |  54 ++++-----
>   mm/mincore.c                 |   7 +-
>   mm/mlock.c                   |   1 +
>   mm/mmu_gather.c              |  40 +++----
>   mm/mprotect.c                |  11 +-
>   mm/mremap.c                  |  14 ++-
>   mm/page_vma_mapped.c         |   4 +
>   mm/pagewalk.c                |  15 ++-
>   mm/pgtable-generic.c         |   1 +
>   mm/pte_ref.c                 | 141 ++++++++++++++++++++++++
>   mm/rmap.c                    |  10 ++
>   mm/swapfile.c                |   3 +
>   mm/userfaultfd.c             |  40 +++++--
>   41 files changed, 1186 insertions(+), 340 deletions(-)
>   create mode 100644 Documentation/vm/pte_ref.rst
>   create mode 100644 include/linux/pte_ref.h
>   create mode 100644 mm/pte_ref.c
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper
  2021-11-10  8:40 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
@ 2021-11-10 12:29   ` Jason Gunthorpe
  2021-11-10 12:58     ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 12:29 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 04:40:44PM +0800, Qi Zheng wrote:
> Currently we have some times the following judgments repeated in the
> code:
> 
> 	is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)
> 
> which is to determine whether the *pmd is a huge pmd, so introduce
> is_huge_pmd() helper to deduplicate them.

Isn't this pmd_leaf() ?

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper
  2021-11-10 12:29   ` Jason Gunthorpe
@ 2021-11-10 12:58     ` Qi Zheng
  2021-11-10 12:59       ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-10 12:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming


On 11/10/21 8:29 PM, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 04:40:44PM +0800, Qi Zheng wrote:
>> Currently we have some times the following judgments repeated in the
>> code:
>>
>> 	is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)
>>
>> which is to determine whether the *pmd is a huge pmd, so introduce
>> is_huge_pmd() helper to deduplicate them.
> 
> Isn't this pmd_leaf() ?

Currently, the implementation of pmd_leaf() does not include
pmd_devmap() checks. But considering the semantics of pmd_leaf(),
the "devmap" pmd should also belong to "leaf" pmd. Maybe we should
modify pmd_leaf() to make it more semantically consistent?

By the way, something went wrong when sending this patchset, and
I have re-sent the complete patchset, please comment over there.

Thanks,
Qi

> 
> Jason
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper
  2021-11-10 12:58     ` Qi Zheng
@ 2021-11-10 12:59       ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 12:59 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 08:58:35PM +0800, Qi Zheng wrote:
> 
> On 11/10/21 8:29 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 10, 2021 at 04:40:44PM +0800, Qi Zheng wrote:
> > > Currently we have some times the following judgments repeated in the
> > > code:
> > > 
> > > 	is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)
> > > 
> > > which is to determine whether the *pmd is a huge pmd, so introduce
> > > is_huge_pmd() helper to deduplicate them.
> > 
> > Isn't this pmd_leaf() ?
> 
> Currently, the implementation of pmd_leaf() does not include
> pmd_devmap() checks. 

Are you sure? I thought x86 did via the tricky bit checks?

> But considering the semantics of pmd_leaf(), the "devmap" pmd should
> also belong to "leaf" pmd. Maybe we should modify pmd_leaf() to make
> it more semantically consistent?

I would prefer that..

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:51                           ` David Hildenbrand
@ 2021-11-11 13:01                             ` Qi Zheng
  0 siblings, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-11 13:01 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 8:51 PM, David Hildenbrand wrote:

>>>>
>>>> In the performance test shown on the cover, we repeatedly performed
>>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>>> you said above.
>>>>
>>>> We did find a small amount of performance regression, but I think it is
>>>> acceptable, and no new perf hotspots have been added.
>>>
>>> That test always accesses 2MiB and does it from a single thread. Things
>>> might (IMHO will) look different when only accessing individual pages
>>> and doing the access from one/multiple separate threads (that's what
>>
>> No, it includes multi-threading:
>>
> 
> Oh sorry, I totally skipped [2].
> 
>> 	while (1) {
>> 		char *c;
>> 		char *start = mmap_area[cpu];
>> 		char *end = mmap_area[cpu] + FAULT_LENGTH;
>> 		pthread_barrier_wait(&barrier);
>> 		//printf("fault into %p-%p\n",start, end);
>>
>> 		for (c = start; c < end; c += PAGE_SIZE)
>> 			*c = 0;
>>
>> 		pthread_barrier_wait(&barrier);
>> 		for (i = 0; cpu==0 && i < num; i++)
>> 			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
>> 		pthread_barrier_wait(&barrier);
>> 	}
>>
>> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
>> memory of threads on other cpu.
>>
> 
> I'll have a more detailed look at the benchmark. On a quick glimpse,

Thank you for your time :)

> looks like the threads are also accessing a full 2MiB range, one page at
> a time, and one thread is zapping the whole 2MiB range. A single CPU
> only accesses memory within one 2MiB range IIRC.
> 
> Having multiple threads just access individual pages within a single 2
> MiB region, and having one thread zap that memory (e.g., simulate
> swapout) could be another benchmark.

LGTM, I will simulate more scenarios for testing.

> 
> We have to make sure to run with THP disabled (e.g., using
> madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
> eventually), because otherwise you might just be populating+zapping THPs
> if they would otherwise be allowed in the environment.

Yes, I turned off THP during testing:

root@~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:32                         ` Qi Zheng
@ 2021-11-11 12:51                           ` David Hildenbrand
  2021-11-11 13:01                             ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2021-11-11 12:51 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 13:32, Qi Zheng wrote:
> 
> 
> On 11/11/21 8:20 PM, David Hildenbrand wrote:
>> On 11.11.21 13:00, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>>
>>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>>
>>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>>> left with empty page tables to reclaim.
>>>>>>>>
>>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>>> user space or automatically from the kernel.
>>>>>>>
>>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>>> table, there are two problems as follows:
>>>>>>>
>>>>>>> 	#1. When to trigger the scanning or releasing?
>>>>>>
>>>>>> For example when reclaiming memory, when scanning page tables in
>>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>>> originally looked into). But it certainly requires more locking thought
>>>>>> to avoid stopping essentially any page table walker.
>>>>>>
>>>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>>>> 	    entries need to be scanned.
>>>>>>
>>>>>> It would happen only when actually trigger reclaim of page tables
>>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>>
>>>>>> For example, khugepaged already scans the page tables either way.
>>>>>>
>>>>>>>
>>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>>
>>>>>>> For #2, refcount has advantages.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>>> to deal with complicated locking.
>>>>>>>>>
>>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>>
>>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>>> too..
>>>>>>>>>
>>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>>> refcounts :)
>>>>>>>>
>>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>>
>>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>>> space.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>>
>>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>>> have to wait for populating a fresh page table because the "page table
>>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>>> increased for a while, and only drop it after a while. But when? And
>>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>>
>>>>>
>>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>>> reallocation of physical pages. The overhead after introducing
>>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>>> So maybe the overhead is not big.
>>>>
>>>> The cases that come to my mind are
>>>>
>>>> a) Swapping on shared memory with concurrent access
>>>> b) Reclaim on file-backed memory with concurrent access
>>>> c) Free page reporting as implemented by virtio-balloon
>>>>
>>>> In all of these cases, you can have someone immediately re-access the
>>>> page table and re-populate it.
>>>
>>> In the performance test shown on the cover, we repeatedly performed
>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>> you said above.
>>>
>>> We did find a small amount of performance regression, but I think it is
>>> acceptable, and no new perf hotspots have been added.
>>
>> That test always accesses 2MiB and does it from a single thread. Things
>> might (IMHO will) look different when only accessing individual pages
>> and doing the access from one/multiple separate threads (that's what
> 
> No, it includes multi-threading:
> 

Oh sorry, I totally skipped [2].

> 	while (1) {
> 		char *c;
> 		char *start = mmap_area[cpu];
> 		char *end = mmap_area[cpu] + FAULT_LENGTH;
> 		pthread_barrier_wait(&barrier);
> 		//printf("fault into %p-%p\n",start, end);
> 
> 		for (c = start; c < end; c += PAGE_SIZE)
> 			*c = 0;
> 
> 		pthread_barrier_wait(&barrier);
> 		for (i = 0; cpu==0 && i < num; i++)
> 			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
> 		pthread_barrier_wait(&barrier);
> 	}
> 
> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
> memory of threads on other cpu.
> 

I'll have a more detailed look at the benchmark. On a quick glimpse,
looks like the threads are also accessing a full 2MiB range, one page at
a time, and one thread is zapping the whole 2MiB range. A single CPU
only accesses memory within one 2MiB range IIRC.

Having multiple threads just access individual pages within a single 2
MiB region, and having one thread zap that memory (e.g., simulate
swapout) could be another benchmark.

We have to make sure to run with THP disabled (e.g., using
madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
eventually), because otherwise you might just be populating+zapping THPs
if they would otherwise be allowed in the environment.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:20                       ` David Hildenbrand
@ 2021-11-11 12:32                         ` Qi Zheng
  2021-11-11 12:51                           ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-11 12:32 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 8:20 PM, David Hildenbrand wrote:
> On 11.11.21 13:00, Qi Zheng wrote:
>>
>>
>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>
>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>
>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>> left with empty page tables to reclaim.
>>>>>>>
>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>> user space or automatically from the kernel.
>>>>>>
>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>> table, there are two problems as follows:
>>>>>>
>>>>>> 	#1. When to trigger the scanning or releasing?
>>>>>
>>>>> For example when reclaiming memory, when scanning page tables in
>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>> originally looked into). But it certainly requires more locking thought
>>>>> to avoid stopping essentially any page table walker.
>>>>>
>>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>>> 	    entries need to be scanned.
>>>>>
>>>>> It would happen only when actually trigger reclaim of page tables
>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>
>>>>> For example, khugepaged already scans the page tables either way.
>>>>>
>>>>>>
>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>
>>>>>> For #2, refcount has advantages.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>> to deal with complicated locking.
>>>>>>>>
>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>
>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>> too..
>>>>>>>>
>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>> refcounts :)
>>>>>>>
>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>
>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>> space.
>>>>>>>
>>>>>>
>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>
>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>> have to wait for populating a fresh page table because the "page table
>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>> increased for a while, and only drop it after a while. But when? And
>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>
>>>>
>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>> reallocation of physical pages. The overhead after introducing
>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>> So maybe the overhead is not big.
>>>
>>> The cases that come to my mind are
>>>
>>> a) Swapping on shared memory with concurrent access
>>> b) Reclaim on file-backed memory with concurrent access
>>> c) Free page reporting as implemented by virtio-balloon
>>>
>>> In all of these cases, you can have someone immediately re-access the
>>> page table and re-populate it.
>>
>> In the performance test shown on the cover, we repeatedly performed
>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>> you said above.
>>
>> We did find a small amount of performance regression, but I think it is
>> acceptable, and no new perf hotspots have been added.
> 
> That test always accesses 2MiB and does it from a single thread. Things
> might (IMHO will) look different when only accessing individual pages
> and doing the access from one/multiple separate threads (that's what

No, it includes multi-threading:

	while (1) {
		char *c;
		char *start = mmap_area[cpu];
		char *end = mmap_area[cpu] + FAULT_LENGTH;
		pthread_barrier_wait(&barrier);
		//printf("fault into %p-%p\n",start, end);

		for (c = start; c < end; c += PAGE_SIZE)
			*c = 0;

		pthread_barrier_wait(&barrier);
		for (i = 0; cpu==0 && i < num; i++)
			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
		pthread_barrier_wait(&barrier);
	}

Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
memory of threads on other cpu.

> a),b) and c) essentially do, they don't do it in the pattern you
> measured. what you measured matches rather a typical memory allocator).
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:00                     ` Qi Zheng
@ 2021-11-11 12:20                       ` David Hildenbrand
  2021-11-11 12:32                         ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2021-11-11 12:20 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 13:00, Qi Zheng wrote:
> 
> 
> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>> On 11.11.21 12:08, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>
>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>
>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>> left with empty page tables to reclaim.
>>>>>>
>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>> user space or automatically from the kernel.
>>>>>
>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>> table, there are two problems as follows:
>>>>>
>>>>> 	#1. When to trigger the scanning or releasing?
>>>>
>>>> For example when reclaiming memory, when scanning page tables in
>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>> originally looked into). But it certainly requires more locking thought
>>>> to avoid stopping essentially any page table walker.
>>>>
>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>> 	    entries need to be scanned.
>>>>
>>>> It would happen only when actually trigger reclaim of page tables
>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>
>>>> For example, khugepaged already scans the page tables either way.
>>>>
>>>>>
>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>
>>>>> For #2, refcount has advantages.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>> to deal with complicated locking.
>>>>>>>
>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>
>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>> too..
>>>>>>>
>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>> least let's have thought about a locking solution before merging
>>>>>>> refcounts :)
>>>>>>
>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>
>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>> space.
>>>>>>
>>>>>
>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>> that the release of page table pages can be delayed in some cases.
>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>
>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>> no more entries), the longer you wait with reclaim, the longer others
>>>> have to wait for populating a fresh page table because the "page table
>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>> increased for a while, and only drop it after a while. But when? And
>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>
>>>
>>> For running VMs with memory ballooning after inflating the balloon, is
>>> this a hot behavior? Even if it is, it is already facing the release and
>>> reallocation of physical pages. The overhead after introducing
>>> pte_refcount is that we need to release and re-allocate page table page.
>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>> So maybe the overhead is not big.
>>
>> The cases that come to my mind are
>>
>> a) Swapping on shared memory with concurrent access
>> b) Reclaim on file-backed memory with concurrent access
>> c) Free page reporting as implemented by virtio-balloon
>>
>> In all of these cases, you can have someone immediately re-access the
>> page table and re-populate it.
> 
> In the performance test shown on the cover, we repeatedly performed
> touch and madvise(MADV_DONTNEED) actions, which simulated the case
> you said above.
> 
> We did find a small amount of performance regression, but I think it is
> acceptable, and no new perf hotspots have been added.

That test always accesses 2MiB and does it from a single thread. Things
might (IMHO will) look different when only accessing individual pages
and doing the access from one/multiple separate threads (that's what
a),b) and c) essentially do, they don't do it in the pattern you
measured. what you measured matches rather a typical memory allocator).


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 11:19                   ` David Hildenbrand
@ 2021-11-11 12:00                     ` Qi Zheng
  2021-11-11 12:20                       ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-11 12:00 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 7:19 PM, David Hildenbrand wrote:
> On 11.11.21 12:08, Qi Zheng wrote:
>>
>>
>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>
>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>
>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>> left with empty page tables to reclaim.
>>>>>
>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>> user space or automatically from the kernel.
>>>>
>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>> table, there are two problems as follows:
>>>>
>>>> 	#1. When to trigger the scanning or releasing?
>>>
>>> For example when reclaiming memory, when scanning page tables in
>>> khugepaged, or triggered by user space (note that this is the approach I
>>> originally looked into). But it certainly requires more locking thought
>>> to avoid stopping essentially any page table walker.
>>>
>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>> 	    entries need to be scanned.
>>>
>>> It would happen only when actually trigger reclaim of page tables
>>> (again, someone has to trigger it), so it's barely an issue.
>>>
>>> For example, khugepaged already scans the page tables either way.
>>>
>>>>
>>>> For #1, if the scanning is triggered manually from user space, the
>>>> kernel is relatively passive, and the user does not fully know the best
>>>> timing to scan. If the scanning is triggered automatically from the
>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>
>>>> For #2, refcount has advantages.
>>>>
>>>>>
>>>>>>
>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>> to deal with complicated locking.
>>>>>>
>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>
>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>> too..
>>>>>>
>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>> least let's have thought about a locking solution before merging
>>>>>> refcounts :)
>>>>>
>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>
>>>>> Also, it adds complexity for something that is only a problem in some
>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>> might be good with just triggering page table reclaim manually from user
>>>>> space.
>>>>>
>>>>
>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>> that the release of page table pages can be delayed in some cases.
>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>
>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>> no more entries), the longer you wait with reclaim, the longer others
>>> have to wait for populating a fresh page table because the "page table
>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>> increased for a while, and only drop it after a while. But when? And
>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>
>>
>> For running VMs with memory ballooning after inflating the balloon, is
>> this a hot behavior? Even if it is, it is already facing the release and
>> reallocation of physical pages. The overhead after introducing
>> pte_refcount is that we need to release and re-allocate page table page.
>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>> So maybe the overhead is not big.
> 
> The cases that come to my mind are
> 
> a) Swapping on shared memory with concurrent access
> b) Reclaim on file-backed memory with concurrent access
> c) Free page reporting as implemented by virtio-balloon
> 
> In all of these cases, you can have someone immediately re-access the
> page table and re-populate it.

In the performance test shown on the cover, we repeatedly performed
touch and madvise(MADV_DONTNEED) actions, which simulated the case
you said above.

We did find a small amount of performance regression, but I think it is
acceptable, and no new perf hotspots have been added.

> 
> For something mostly static (balloon inflation, memory allocator), it's
> not that big of a deal I guess.
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 11:08                 ` Qi Zheng
@ 2021-11-11 11:19                   ` David Hildenbrand
  2021-11-11 12:00                     ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2021-11-11 11:19 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 12:08, Qi Zheng wrote:
> 
> 
> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>> On 11.11.21 04:58, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>> table in your process you have exclude each and every page table walker.
>>>>>> Or did I mis-interpret what you were saying?
>>>>>
>>>>> That is one possible design, it favours fast walking and penalizes
>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>> leaf pointer" so it isn't an every day activity..
>>>>
>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>> during writeback/swapping. So while writing back / swapping you might be
>>>> left with empty page tables to reclaim.
>>>>
>>>> Of course, this is the current approach. Another approach that doesn't
>>>> require additional refcounts is scanning page tables for empty ones and
>>>> reclaiming them. This scanning can either be triggered manually from
>>>> user space or automatically from the kernel.
>>>
>>> Whether it is introducing a special rwsem or scanning an empty page
>>> table, there are two problems as follows:
>>>
>>> 	#1. When to trigger the scanning or releasing?
>>
>> For example when reclaiming memory, when scanning page tables in
>> khugepaged, or triggered by user space (note that this is the approach I
>> originally looked into). But it certainly requires more locking thought
>> to avoid stopping essentially any page table walker.
>>
>>> 	#2. Every time to release a 4K page table page, 512 page table
>>> 	    entries need to be scanned.
>>
>> It would happen only when actually trigger reclaim of page tables
>> (again, someone has to trigger it), so it's barely an issue.
>>
>> For example, khugepaged already scans the page tables either way.
>>
>>>
>>> For #1, if the scanning is triggered manually from user space, the
>>> kernel is relatively passive, and the user does not fully know the best
>>> timing to scan. If the scanning is triggered automatically from the
>>> kernel, that is great. But the timing is not easy to confirm, is it
>>> scanned and reclaimed every time zap or try_to_unmap?
>>>
>>> For #2, refcount has advantages.
>>>
>>>>
>>>>>
>>>>> There is some advantage with this thinking because it harmonizes well
>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>> to deal with complicated locking.
>>>>>
>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>
>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>> too..
>>>>>
>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>> least let's have thought about a locking solution before merging
>>>>> refcounts :)
>>>>
>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>> it just reclaims "automatically" once possible -- page table empty and
>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>> to reclaim an empty page table immediately once it turns empty.
>>>>
>>>> Also, it adds complexity for something that is only a problem in some
>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>> memory allocators after freeing a lot of memory or running VMs with
>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>> might be good with just triggering page table reclaim manually from user
>>>> space.
>>>>
>>>
>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>> that the release of page table pages can be delayed in some cases.
>>> Similar to the lazyfree mechanism in MADV_FREE?
>>
>> The issue AFAIU is that once your refcount hits 0 (no more references,
>> no more entries), the longer you wait with reclaim, the longer others
>> have to wait for populating a fresh page table because the "page table
>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>> increased for a while, and only drop it after a while. But when? And
>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>
> 
> For running VMs with memory ballooning after inflating the balloon, is
> this a hot behavior? Even if it is, it is already facing the release and
> reallocation of physical pages. The overhead after introducing
> pte_refcount is that we need to release and re-allocate page table page.
> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
> So maybe the overhead is not big.

The cases that come to my mind are

a) Swapping on shared memory with concurrent access
b) Reclaim on file-backed memory with concurrent access
c) Free page reporting as implemented by virtio-balloon

In all of these cases, you can have someone immediately re-access the
page table and re-populate it.

For something mostly static (balloon inflation, memory allocator), it's
not that big of a deal I guess.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11  9:22               ` David Hildenbrand
@ 2021-11-11 11:08                 ` Qi Zheng
  2021-11-11 11:19                   ` David Hildenbrand
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-11 11:08 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 5:22 PM, David Hildenbrand wrote:
> On 11.11.21 04:58, Qi Zheng wrote:
>>
>>
>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>> is a step into the right direction. If you want to modify *some* page
>>>>> table in your process you have exclude each and every page table walker.
>>>>> Or did I mis-interpret what you were saying?
>>>>
>>>> That is one possible design, it favours fast walking and penalizes
>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>> refcount) and still logically be using a lock instead of a refcount
>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>> leaf pointer" so it isn't an every day activity..
>>>
>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>> as soon as it turns empty. This not only happens when zapping, but also
>>> during writeback/swapping. So while writing back / swapping you might be
>>> left with empty page tables to reclaim.
>>>
>>> Of course, this is the current approach. Another approach that doesn't
>>> require additional refcounts is scanning page tables for empty ones and
>>> reclaiming them. This scanning can either be triggered manually from
>>> user space or automatically from the kernel.
>>
>> Whether it is introducing a special rwsem or scanning an empty page
>> table, there are two problems as follows:
>>
>> 	#1. When to trigger the scanning or releasing?
> 
> For example when reclaiming memory, when scanning page tables in
> khugepaged, or triggered by user space (note that this is the approach I
> originally looked into). But it certainly requires more locking thought
> to avoid stopping essentially any page table walker.
> 
>> 	#2. Every time to release a 4K page table page, 512 page table
>> 	    entries need to be scanned.
> 
> It would happen only when actually trigger reclaim of page tables
> (again, someone has to trigger it), so it's barely an issue.
> 
> For example, khugepaged already scans the page tables either way.
> 
>>
>> For #1, if the scanning is triggered manually from user space, the
>> kernel is relatively passive, and the user does not fully know the best
>> timing to scan. If the scanning is triggered automatically from the
>> kernel, that is great. But the timing is not easy to confirm, is it
>> scanned and reclaimed every time zap or try_to_unmap?
>>
>> For #2, refcount has advantages.
>>
>>>
>>>>
>>>> There is some advantage with this thinking because it harmonizes well
>>>> with the other stuff that wants to convert tables into leafs, but has
>>>> to deal with complicated locking.
>>>>
>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>> help with freeing pages. It also puts more atomics in normal fast
>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>
>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>> means a table cannot be come a leaf. I don't know if there is space
>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>> too..
>>>>
>>>> I wouldn't be so quick to say one is better than the other, but at
>>>> least let's have thought about a locking solution before merging
>>>> refcounts :)
>>>
>>> Yes, absolutely. I can see the beauty in the current approach, because
>>> it just reclaims "automatically" once possible -- page table empty and
>>> nobody is walking it. The downside is that it doesn't always make sense
>>> to reclaim an empty page table immediately once it turns empty.
>>>
>>> Also, it adds complexity for something that is only a problem in some
>>> corner cases -- sparse memory mappings, especially relevant for some
>>> memory allocators after freeing a lot of memory or running VMs with
>>> memory ballooning after inflating the balloon. Some of these use cases
>>> might be good with just triggering page table reclaim manually from user
>>> space.
>>>
>>
>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>> that the release of page table pages can be delayed in some cases.
>> Similar to the lazyfree mechanism in MADV_FREE?
> 
> The issue AFAIU is that once your refcount hits 0 (no more references,
> no more entries), the longer you wait with reclaim, the longer others
> have to wait for populating a fresh page table because the "page table
> to be reclaimed" is still stuck around. You'd have to keep the refcount
> increased for a while, and only drop it after a while. But when? And
> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
> 

For running VMs with memory ballooning after inflating the balloon, is
this a hot behavior? Even if it is, it is already facing the release and
reallocation of physical pages. The overhead after introducing
pte_refcount is that we need to release and re-allocate page table page.
But 2MB physical pages only corresponds to 4KiB of PTE page table page.
So maybe the overhead is not big.

In fact, the performance test shown on the cover letter is this case:

	test program: 
https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Thanks,
Qi

> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11  3:58             ` Qi Zheng
@ 2021-11-11  9:22               ` David Hildenbrand
  2021-11-11 11:08                 ` Qi Zheng
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2021-11-11  9:22 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 04:58, Qi Zheng wrote:
> 
> 
> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>> is a step into the right direction. If you want to modify *some* page
>>>> table in your process you have exclude each and every page table walker.
>>>> Or did I mis-interpret what you were saying?
>>>
>>> That is one possible design, it favours fast walking and penalizes
>>> mutation. We could also stick a lock in the PMD (instead of a
>>> refcount) and still logically be using a lock instead of a refcount
>>> scheme. Remember modify here is "want to change a table pointer into a
>>> leaf pointer" so it isn't an every day activity..
>>
>> It will be if we somewhat frequent when reclaim an empty PTE page table
>> as soon as it turns empty. This not only happens when zapping, but also
>> during writeback/swapping. So while writing back / swapping you might be
>> left with empty page tables to reclaim.
>>
>> Of course, this is the current approach. Another approach that doesn't
>> require additional refcounts is scanning page tables for empty ones and
>> reclaiming them. This scanning can either be triggered manually from
>> user space or automatically from the kernel.
> 
> Whether it is introducing a special rwsem or scanning an empty page
> table, there are two problems as follows:
> 
> 	#1. When to trigger the scanning or releasing?

For example when reclaiming memory, when scanning page tables in
khugepaged, or triggered by user space (note that this is the approach I
originally looked into). But it certainly requires more locking thought
to avoid stopping essentially any page table walker.

> 	#2. Every time to release a 4K page table page, 512 page table
> 	    entries need to be scanned.

It would happen only when actually trigger reclaim of page tables
(again, someone has to trigger it), so it's barely an issue.

For example, khugepaged already scans the page tables either way.

> 
> For #1, if the scanning is triggered manually from user space, the
> kernel is relatively passive, and the user does not fully know the best
> timing to scan. If the scanning is triggered automatically from the
> kernel, that is great. But the timing is not easy to confirm, is it
> scanned and reclaimed every time zap or try_to_unmap?
> 
> For #2, refcount has advantages.
> 
>>
>>>
>>> There is some advantage with this thinking because it harmonizes well
>>> with the other stuff that wants to convert tables into leafs, but has
>>> to deal with complicated locking.
>>>
>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>> help with freeing pages. It also puts more atomics in normal fast
>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>
>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>> means a table cannot be come a leaf. I don't know if there is space
>>> for another atomic in the PMD level, and we'd have to use a hitching
>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>> too..
>>>
>>> I wouldn't be so quick to say one is better than the other, but at
>>> least let's have thought about a locking solution before merging
>>> refcounts :)
>>
>> Yes, absolutely. I can see the beauty in the current approach, because
>> it just reclaims "automatically" once possible -- page table empty and
>> nobody is walking it. The downside is that it doesn't always make sense
>> to reclaim an empty page table immediately once it turns empty.
>>
>> Also, it adds complexity for something that is only a problem in some
>> corner cases -- sparse memory mappings, especially relevant for some
>> memory allocators after freeing a lot of memory or running VMs with
>> memory ballooning after inflating the balloon. Some of these use cases
>> might be good with just triggering page table reclaim manually from user
>> space.
>>
> 
> Yes, this is indeed a problem. Perhaps some flags can be introduced so
> that the release of page table pages can be delayed in some cases.
> Similar to the lazyfree mechanism in MADV_FREE?

The issue AFAIU is that once your refcount hits 0 (no more references,
no more entries), the longer you wait with reclaim, the longer others
have to wait for populating a fresh page table because the "page table
to be reclaimed" is still stuck around. You'd have to keep the refcount
increased for a while, and only drop it after a while. But when? And
how? IMHO it's not trivial, but maybe there is an easy way to achieve it.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 17:37           ` David Hildenbrand
  2021-11-10 17:49             ` Jason Gunthorpe
@ 2021-11-11  3:58             ` Qi Zheng
  2021-11-11  9:22               ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-11  3:58 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>> is a step into the right direction. If you want to modify *some* page
>>> table in your process you have exclude each and every page table walker.
>>> Or did I mis-interpret what you were saying?
>>
>> That is one possible design, it favours fast walking and penalizes
>> mutation. We could also stick a lock in the PMD (instead of a
>> refcount) and still logically be using a lock instead of a refcount
>> scheme. Remember modify here is "want to change a table pointer into a
>> leaf pointer" so it isn't an every day activity..
> 
> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty. This not only happens when zapping, but also
> during writeback/swapping. So while writing back / swapping you might be
> left with empty page tables to reclaim.
> 
> Of course, this is the current approach. Another approach that doesn't
> require additional refcounts is scanning page tables for empty ones and
> reclaiming them. This scanning can either be triggered manually from
> user space or automatically from the kernel.

Whether it is introducing a special rwsem or scanning an empty page
table, there are two problems as follows:

	#1. When to trigger the scanning or releasing?
	#2. Every time to release a 4K page table page, 512 page table
	    entries need to be scanned.

For #1, if the scanning is triggered manually from user space, the
kernel is relatively passive, and the user does not fully know the best
timing to scan. If the scanning is triggered automatically from the
kernel, that is great. But the timing is not easy to confirm, is it
scanned and reclaimed every time zap or try_to_unmap?

For #2, refcount has advantages.

> 
>>
>> There is some advantage with this thinking because it harmonizes well
>> with the other stuff that wants to convert tables into leafs, but has
>> to deal with complicated locking.
>>
>> On the other hand, refcounts are a degenerate kind of rwsem and only
>> help with freeing pages. It also puts more atomics in normal fast
>> paths since we are refcounting each PTE, not read locking the PMD.
>>
>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>> means a table cannot be come a leaf. I don't know if there is space
>> for another atomic in the PMD level, and we'd have to use a hitching
>> post/hashed waitq scheme too since there surely isn't room for a waitq
>> too..
>>
>> I wouldn't be so quick to say one is better than the other, but at
>> least let's have thought about a locking solution before merging
>> refcounts :)
> 
> Yes, absolutely. I can see the beauty in the current approach, because
> it just reclaims "automatically" once possible -- page table empty and
> nobody is walking it. The downside is that it doesn't always make sense
> to reclaim an empty page table immediately once it turns empty.
> 
> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.
> 

Yes, this is indeed a problem. Perhaps some flags can be introduced so
that the release of page table pages can be delayed in some cases.
Similar to the lazyfree mechanism in MADV_FREE?

Thanks,
Qi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 17:37           ` David Hildenbrand
@ 2021-11-10 17:49             ` Jason Gunthorpe
  2021-11-11  3:58             ` Qi Zheng
  1 sibling, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 17:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 06:37:46PM +0100, David Hildenbrand wrote:

> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty.

What do you think is the frequency of unlocked page table walks that
this would have to block on?

> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.

Right, this is why it would be nice if the complexity could address
more than one problem, like the existing complex locking around the
thp stuff..

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 16:39         ` Jason Gunthorpe
@ 2021-11-10 17:37           ` David Hildenbrand
  2021-11-10 17:49             ` Jason Gunthorpe
  2021-11-11  3:58             ` Qi Zheng
  0 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2021-11-10 17:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

>> It would still be a fairly coarse-grained locking, I am not sure if that
>> is a step into the right direction. If you want to modify *some* page
>> table in your process you have exclude each and every page table walker.
>> Or did I mis-interpret what you were saying?
> 
> That is one possible design, it favours fast walking and penalizes
> mutation. We could also stick a lock in the PMD (instead of a
> refcount) and still logically be using a lock instead of a refcount
> scheme. Remember modify here is "want to change a table pointer into a
> leaf pointer" so it isn't an every day activity..

It will be if we somewhat frequent when reclaim an empty PTE page table
as soon as it turns empty. This not only happens when zapping, but also
during writeback/swapping. So while writing back / swapping you might be
left with empty page tables to reclaim.

Of course, this is the current approach. Another approach that doesn't
require additional refcounts is scanning page tables for empty ones and
reclaiming them. This scanning can either be triggered manually from
user space or automatically from the kernel.

> 
> There is some advantage with this thinking because it harmonizes well
> with the other stuff that wants to convert tables into leafs, but has
> to deal with complicated locking.
> 
> On the other hand, refcounts are a degenerate kind of rwsem and only
> help with freeing pages. It also puts more atomics in normal fast
> paths since we are refcounting each PTE, not read locking the PMD.
> 
> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
> means a table cannot be come a leaf. I don't know if there is space
> for another atomic in the PMD level, and we'd have to use a hitching
> post/hashed waitq scheme too since there surely isn't room for a waitq
> too..
> 
> I wouldn't be so quick to say one is better than the other, but at
> least let's have thought about a locking solution before merging
> refcounts :)

Yes, absolutely. I can see the beauty in the current approach, because
it just reclaims "automatically" once possible -- page table empty and
nobody is walking it. The downside is that it doesn't always make sense
to reclaim an empty page table immediately once it turns empty.

Also, it adds complexity for something that is only a problem in some
corner cases -- sparse memory mappings, especially relevant for some
memory allocators after freeing a lot of memory or running VMs with
memory ballooning after inflating the balloon. Some of these use cases
might be good with just triggering page table reclaim manually from user
space.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 16:53           ` David Hildenbrand
@ 2021-11-10 16:56             ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Qi Zheng, akpm, tglx, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

On Wed, Nov 10, 2021 at 05:53:57PM +0100, David Hildenbrand wrote:
> On 10.11.21 17:49, Matthew Wilcox wrote:
> > On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> >>> I'd suggest to make this new lock a special rwsem which allows either
> >>> concurrent read access OR concurrent PTL access, but not both. This
> >>
> >> I looked into such a lock recently in similar context and something like
> >> that does not exist yet (and fairness will be challenging). You either
> >> have a single reader or multiple writer. I'd be interested if someone
> >> knows of something like that.
> > 
> > We've talked about having such a lock before for filesystems which want
> > to permit either many direct-IO accesses or many buffered-IO accesses, but
> > want to exclude a mixture of direct-IO and buffered-IO.  The name we came
> > up with for such a lock was the red-blue lock.  Either Team Red has the
> > lock, or Team Blue has the lock (or it's free).  Indicate free with velue
> > zero, Team Red with positive numbers and Team Blue with negative numbers.
> > If we need to indicate an opposing team is waiting for the semaphore,
> > we can use a high bit (1 << 30) to indicate no new team members can
> > acquire the lock.  Not sure whether anybody ever coded it up.
> 
> Interesting, thanks for sharing!
> 
> My excessive google search didn't reveal anything back then (~3 months
> ago) -- only questions on popular coding websites asking essentially for
> the same thing without any helpful replies. But maybe I used the wrong
> keywords (e.g., "multiple reader, multiple writer", I certainly didn't
> search for "red-blue lock", but I do like the name :) ).
> 
> Fairness might still be the biggest issue, but I am certainly no locking
> expert.

Fairness could use the same basic logic as the write prefered to read
in the rwsem.

The atomic implementation would be with atomic_dec_unless_positive()
and atomic_inc_unless_negative(), without fairness it looks
straightfoward.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 16:49         ` Matthew Wilcox
@ 2021-11-10 16:53           ` David Hildenbrand
  2021-11-10 16:56             ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: David Hildenbrand @ 2021-11-10 16:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Qi Zheng, akpm, tglx, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

On 10.11.21 17:49, Matthew Wilcox wrote:
> On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
>>> I'd suggest to make this new lock a special rwsem which allows either
>>> concurrent read access OR concurrent PTL access, but not both. This
>>
>> I looked into such a lock recently in similar context and something like
>> that does not exist yet (and fairness will be challenging). You either
>> have a single reader or multiple writer. I'd be interested if someone
>> knows of something like that.
> 
> We've talked about having such a lock before for filesystems which want
> to permit either many direct-IO accesses or many buffered-IO accesses, but
> want to exclude a mixture of direct-IO and buffered-IO.  The name we came
> up with for such a lock was the red-blue lock.  Either Team Red has the
> lock, or Team Blue has the lock (or it's free).  Indicate free with velue
> zero, Team Red with positive numbers and Team Blue with negative numbers.
> If we need to indicate an opposing team is waiting for the semaphore,
> we can use a high bit (1 << 30) to indicate no new team members can
> acquire the lock.  Not sure whether anybody ever coded it up.

Interesting, thanks for sharing!

My excessive google search didn't reveal anything back then (~3 months
ago) -- only questions on popular coding websites asking essentially for
the same thing without any helpful replies. But maybe I used the wrong
keywords (e.g., "multiple reader, multiple writer", I certainly didn't
search for "red-blue lock", but I do like the name :) ).

Fairness might still be the biggest issue, but I am certainly no locking
expert.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 15:37       ` David Hildenbrand
  2021-11-10 16:39         ` Jason Gunthorpe
@ 2021-11-10 16:49         ` Matthew Wilcox
  2021-11-10 16:53           ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Matthew Wilcox @ 2021-11-10 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Qi Zheng, akpm, tglx, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> > I'd suggest to make this new lock a special rwsem which allows either
> > concurrent read access OR concurrent PTL access, but not both. This
> 
> I looked into such a lock recently in similar context and something like
> that does not exist yet (and fairness will be challenging). You either
> have a single reader or multiple writer. I'd be interested if someone
> knows of something like that.

We've talked about having such a lock before for filesystems which want
to permit either many direct-IO accesses or many buffered-IO accesses, but
want to exclude a mixture of direct-IO and buffered-IO.  The name we came
up with for such a lock was the red-blue lock.  Either Team Red has the
lock, or Team Blue has the lock (or it's free).  Indicate free with velue
zero, Team Red with positive numbers and Team Blue with negative numbers.
If we need to indicate an opposing team is waiting for the semaphore,
we can use a high bit (1 << 30) to indicate no new team members can
acquire the lock.  Not sure whether anybody ever coded it up.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 15:37       ` David Hildenbrand
@ 2021-11-10 16:39         ` Jason Gunthorpe
  2021-11-10 17:37           ` David Hildenbrand
  2021-11-10 16:49         ` Matthew Wilcox
  1 sibling, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 16:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:

> >  - Fully locked. Under the PTLs (gup slow is an example)
> >  - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
> >  - hw-locked. Barriered with TLB flush (gup fast is an example)
> 
> Three additions as far as I can tell:
> 
> 1. Fully locked currently needs the read mmap lock OR the rmap lock in
>    read. PTLs on their own are not sufficient AFAIKT.

I think the reality is we don't have any fully locked walkers.. Even
gup slow is semi-lockless until it reaches lower levels, then it takes
the PTLs. (I forgot about that detail!)

The mmap lock is being used to protect the higher levels of the page
table. It is part of its complicated dual purpose.

> 2. #1 and #2 can currently only walk within VMA ranges.

AFICT this is an artifact of re-using the mmap lock to protect the
page table and not being able to obtain the mmap lock in all the
places the page table structure is manipulated.

> 3. We can theoretically walk page tables outside VMA ranges with the
> write mmap lock, because page tables get removed with the mmap lock in
> read mode and heavy-weight operations (VMA layout, khugepaged) are
> performed under the write mmap lock.

Yes, again, an artifact of the current locking.
 
> The rmap locks protect from modifications where we want to exclude rmap
> walkers similarly to how we grab the mmap lock in write, where the PTLs
> are not sufficient.

It is a similar dual purpose locking as the mmap sem :(

> > #1 should be completely safe as the PTLs will protect eveything
> > #2 is safe so long as the write side is held during any layout changes
> > #3 interacts with the TLB flush, and is also safe with zap
> > 
> > rmap itself is a #1 page table walker, ie it gets the PTLs under
> > page_vma_mapped_walk().
> 
> When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Ah, here I was thinking about a lock that can protect all the
levels. Today we are abusing the mmap lock to act as the pud_lock, for
instance.

> Because the PMD-PTLs re usually not taken in case we know there is a
> page table (nothing would currently change it without heavy
> locking).

This only works with the lockless walkers, and relies on the read mmap
sem/etc to also mean 'a PTE table cannot become a leaf PMD'

> For example, walk_page_range() requires the mmap lock in read and grabs
> the PTE-PTLs.

Yes, that is a semi-locked reader.

> It would still be a fairly coarse-grained locking, I am not sure if that
> is a step into the right direction. If you want to modify *some* page
> table in your process you have exclude each and every page table walker.
> Or did I mis-interpret what you were saying?

That is one possible design, it favours fast walking and penalizes
mutation. We could also stick a lock in the PMD (instead of a
refcount) and still logically be using a lock instead of a refcount
scheme. Remember modify here is "want to change a table pointer into a
leaf pointer" so it isn't an every day activity..

There is some advantage with this thinking because it harmonizes well
with the other stuff that wants to convert tables into leafs, but has
to deal with complicated locking.

On the other hand, refcounts are a degenerate kind of rwsem and only
help with freeing pages. It also puts more atomics in normal fast
paths since we are refcounting each PTE, not read locking the PMD.

Perhaps the ideal thing would be to stick a rwsem in the PMD. read
means a table cannot be come a leaf. I don't know if there is space
for another atomic in the PMD level, and we'd have to use a hitching
post/hashed waitq scheme too since there surely isn't room for a waitq
too..

I wouldn't be so quick to say one is better than the other, but at
least let's have thought about a locking solution before merging
refcounts :)

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 14:38     ` Jason Gunthorpe
@ 2021-11-10 15:37       ` David Hildenbrand
  2021-11-10 16:39         ` Jason Gunthorpe
  2021-11-10 16:49         ` Matthew Wilcox
  0 siblings, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2021-11-10 15:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 10.11.21 15:38, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
>> On 10.11.21 13:56, Jason Gunthorpe wrote:
>>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>>
>>>> In this patch series, we add a pte_refcount field to the struct page of page
>>>> table to track how many users of PTE page table. Similar to the mechanism of
>>>> page refcount, the user of PTE page table should hold a refcount to it before
>>>> accessing. The PTE page table page will be freed when the last refcount is
>>>> dropped.
>>>
>>> So, this approach basically adds two atomics on every PTE map
>>>
>>> If I have it right the reason that zap cannot clean the PTEs today is
>>> because zap cannot obtain the mmap lock due to a lock ordering issue
>>> with the inode lock vs mmap lock.
>>
>> There are different ways to zap: madvise(DONTNEED) vs
>> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
>> comming: a process page table walker or the rmap.
> 
> AFAIK rmap is the same issue, it can't lock the mmap_sem
> 
>> The way locking currently works doesn't allow to remove a page table
>> just by holding the mmap lock, not even in write mode. 
> 
> I'm not sure I understand this. If the goal is to free the PTE tables
> then the main concern is use-after free on page table walkers (which
> this series is addressing). Ignoring bugs, we have only three ways to
> read the page table:

Yes, use-after-free and reuse-while-freeing are the two challenges AFAIKs.

> 
>  - Fully locked. Under the PTLs (gup slow is an example)
>  - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
>  - hw-locked. Barriered with TLB flush (gup fast is an example)

Three additions as far as I can tell:

1. Fully locked currently needs the read mmap lock OR the rmap lock in
   read. PTLs on their own are not sufficient AFAIKT.
2. #1 and #2 can currently only walk within VMA ranges.
3. We can theoretically walk page tables outside VMA ranges with the
write mmap lock, because page tables get removed with the mmap lock in
read mode and heavy-weight operations (VMA layout, khugepaged) are
performed under the write mmap lock.

The rmap locks protect from modifications where we want to exclude rmap
walkers similarly to how we grab the mmap lock in write, where the PTLs
are not sufficient.

See mm/mremap.c:move_ptes() as an example which performs VMA layout +
page table modifications. See khugepagd which doesn't perform VMA layout
modifications but page table modifications.

> 
> #1 should be completely safe as the PTLs will protect eveything
> #2 is safe so long as the write side is held during any layout changes
> #3 interacts with the TLB flush, and is also safe with zap
> 
> rmap itself is a #1 page table walker, ie it gets the PTLs under
> page_vma_mapped_walk().

When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Because the PMD-PTLs re usually not taken in case we know there is a
page table (nothing would currently change it without heavy locking).
And if they are taken, they are only held while allocating/checking a
PMDE, not while actually *using* the page table that's mapped in that entry.

For example, walk_page_range() requires the mmap lock in read and grabs
the PTE-PTLs.

> 
> The sin we have comitted here is that both the mmap lock and the PTLs
> are being used to protect the page table itself with a very
> complicated dual semantic.
> 
> Splitting the sleeping mmap lock into 'covers vma' and 'covers page
> tables' lets us solve the lock ordering and semi-locked can become
> more fully locked by the new lock, instead of by abusing mmap sem.

It would still be a fairly coarse-grained locking, I am not sure if that
is a step into the right direction. If you want to modify *some* page
table in your process you have exclude each and every page table walker.
Or did I mis-interpret what you were saying?

> 
> I'd suggest to make this new lock a special rwsem which allows either
> concurrent read access OR concurrent PTL access, but not both. This

I looked into such a lock recently in similar context and something like
that does not exist yet (and fairness will be challenging). You either
have a single reader or multiple writer. I'd be interested if someone
knows of something like that.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 13:25   ` David Hildenbrand
  2021-11-10 13:59     ` Qi Zheng
@ 2021-11-10 14:38     ` Jason Gunthorpe
  2021-11-10 15:37       ` David Hildenbrand
  1 sibling, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 14:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
> > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> > 
> >> In this patch series, we add a pte_refcount field to the struct page of page
> >> table to track how many users of PTE page table. Similar to the mechanism of
> >> page refcount, the user of PTE page table should hold a refcount to it before
> >> accessing. The PTE page table page will be freed when the last refcount is
> >> dropped.
> > 
> > So, this approach basically adds two atomics on every PTE map
> > 
> > If I have it right the reason that zap cannot clean the PTEs today is
> > because zap cannot obtain the mmap lock due to a lock ordering issue
> > with the inode lock vs mmap lock.
> 
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.

AFAIK rmap is the same issue, it can't lock the mmap_sem

> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. 

I'm not sure I understand this. If the goal is to free the PTE tables
then the main concern is use-after free on page table walkers (which
this series is addressing). Ignoring bugs, we have only three ways to
read the page table:

 - Fully locked. Under the PTLs (gup slow is an example)
 - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
 - hw-locked. Barriered with TLB flush (gup fast is an example)

#1 should be completely safe as the PTLs will protect eveything
#2 is safe so long as the write side is held during any layout changes
#3 interacts with the TLB flush, and is also safe with zap

rmap itself is a #1 page table walker, ie it gets the PTLs under
page_vma_mapped_walk().

The sin we have comitted here is that both the mmap lock and the PTLs
are being used to protect the page table itself with a very
complicated dual semantic.

Splitting the sleeping mmap lock into 'covers vma' and 'covers page
tables' lets us solve the lock ordering and semi-locked can become
more fully locked by the new lock, instead of by abusing mmap sem.

I'd suggest to make this new lock a special rwsem which allows either
concurrent read access OR concurrent PTL access, but not both. This
way we don't degrade performance of the split PTLs, *and* when
something needs to change the page table structure it has a way to
properly exclude all the #2 lockless readers.

So evey touch to the page table starts by obtaining this new lock,
depending on the access mode to be used. (PTL vs lockless read)

We can keep the existing THP logic where a leaf PMD can be transformed
to a non-leaf PMD in the semi-locked case, but the case where a
non-leaf PMD is transformed to a leaf PMD has to take the lock.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 13:25   ` David Hildenbrand
@ 2021-11-10 13:59     ` Qi Zheng
  2021-11-10 14:38     ` Jason Gunthorpe
  1 sibling, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10 13:59 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/10/21 9:25 PM, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>
>>> In this patch series, we add a pte_refcount field to the struct page of page
>>> table to track how many users of PTE page table. Similar to the mechanism of
>>> page refcount, the user of PTE page table should hold a refcount to it before
>>> accessing. The PTE page table page will be freed when the last refcount is
>>> dropped.
>>
>> So, this approach basically adds two atomics on every PTE map
>>
>> If I have it right the reason that zap cannot clean the PTEs today is
>> because zap cannot obtain the mmap lock due to a lock ordering issue
>> with the inode lock vs mmap lock.
> 
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.
> 
> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. You'll also need
> to hold the respective rmap locks -- which implies that reclaiming apge
> tables crossing VMAs is "problematic". Take a look at khugepaged which
> has to play quite some tricks to remove a page table.
> 
> And there are other ways we can create empty page tables via the rmap,
> like reclaim/writeback, although they are rather a secondary concern mostly.
> 
>>
>> If it could obtain the mmap lock then it could do the zap using the
>> write side as unmapping a vma does.
>>
>> Rather than adding a new "lock" to ever PTE I wonder if it would be
>> more efficient to break up the mmap lock and introduce a specific
>> rwsem for the page table itself, in addition to the PTL. Currently the
>> mmap lock is protecting both the vma list and the page table.
> 
> There is the rmap side of things as well. At least the rmap won't
> reclaim alloc/free page tables, but it will walk page tables while
> holding the respective rmap lock.
> 
>>
>> I think that would allow the lock ordering issue to be resolved and
>> zap could obtain a page table rwsem.
>>
>> Compared to two atomics per PTE this would just be two atomic per
>> page table walk operation, it is conceptually a lot simpler, and would
>> allow freeing all the page table levels, not just PTEs.
> 
> Another alternative is to not do it in the kernel automatically, but
> instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
> called by user space explicitly once it's reasonable. While this will
> work for the obvious madvise(DONTNEED) users -- like memory allocators
> -- that zap memory, it's a bit more complicated once shared memory is
> involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
> work for many use cases that want to optimize memory consumption for
> sparse memory mappings.
> 
> Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
> will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
> are most certainly the most important part piece.
> 

total agree!

Thanks,
Qi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 12:56 ` Jason Gunthorpe
  2021-11-10 13:25   ` David Hildenbrand
@ 2021-11-10 13:54   ` Qi Zheng
  1 sibling, 0 replies; 36+ messages in thread
From: Qi Zheng @ 2021-11-10 13:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/10/21 8:56 PM, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> 
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
> 
> So, this approach basically adds two atomics on every PTE map
> 
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

Currently, both MADV_DONTNEED and MADV_FREE obtain the read side of
mmap_lock instead of write side, which is the reason that 
jemalloc/tcmalloc prefer to use madvise() to release physical memory.

> 
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.

Even if it obtains the write side of mmap_lock, how to make sure that
all the page table entries are empty? Traverse 512 entries every time?

> 
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

Now each level of page table has its own spin lock. Can you explain the
working mechanism of this special rwsem more clearly?

If we can reduce the protection range of mmap_lock, it is indeed a great
thing, but I think it is very difficult, and it will not solve the
problem of how to check that all entries in the page table page are
empty.

> 
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
> 
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

The reason why only the PTE page is released now is that it is the
largest. This reference count can actually be used for other levels of
page tables.

> 
> ?
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 12:56 ` Jason Gunthorpe
@ 2021-11-10 13:25   ` David Hildenbrand
  2021-11-10 13:59     ` Qi Zheng
  2021-11-10 14:38     ` Jason Gunthorpe
  2021-11-10 13:54   ` Qi Zheng
  1 sibling, 2 replies; 36+ messages in thread
From: David Hildenbrand @ 2021-11-10 13:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 10.11.21 13:56, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> 
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
> 
> So, this approach basically adds two atomics on every PTE map
> 
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

There are different ways to zap: madvise(DONTNEED) vs
fallocate(PUNCH_HOLE). It depends on "from where" we're actually
comming: a process page table walker or the rmap.

The way locking currently works doesn't allow to remove a page table
just by holding the mmap lock, not even in write mode. You'll also need
to hold the respective rmap locks -- which implies that reclaiming apge
tables crossing VMAs is "problematic". Take a look at khugepaged which
has to play quite some tricks to remove a page table.

And there are other ways we can create empty page tables via the rmap,
like reclaim/writeback, although they are rather a secondary concern mostly.

> 
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.
> 
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

There is the rmap side of things as well. At least the rmap won't
reclaim alloc/free page tables, but it will walk page tables while
holding the respective rmap lock.

> 
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
> 
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

Another alternative is to not do it in the kernel automatically, but
instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
called by user space explicitly once it's reasonable. While this will
work for the obvious madvise(DONTNEED) users -- like memory allocators
-- that zap memory, it's a bit more complicated once shared memory is
involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
work for many use cases that want to optimize memory consumption for
sparse memory mappings.

Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
are most certainly the most important part piece.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 10:54 Qi Zheng
@ 2021-11-10 12:56 ` Jason Gunthorpe
  2021-11-10 13:25   ` David Hildenbrand
  2021-11-10 13:54   ` Qi Zheng
  0 siblings, 2 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 12:56 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:

> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.

So, this approach basically adds two atomics on every PTE map

If I have it right the reason that zap cannot clean the PTEs today is
because zap cannot obtain the mmap lock due to a lock ordering issue
with the inode lock vs mmap lock.

If it could obtain the mmap lock then it could do the zap using the
write side as unmapping a vma does.

Rather than adding a new "lock" to ever PTE I wonder if it would be
more efficient to break up the mmap lock and introduce a specific
rwsem for the page table itself, in addition to the PTL. Currently the
mmap lock is protecting both the vma list and the page table.

I think that would allow the lock ordering issue to be resolved and
zap could obtain a page table rwsem.

Compared to two atomics per PTE this would just be two atomic per
page table walk operation, it is conceptually a lot simpler, and would
allow freeing all the page table levels, not just PTEs.

?

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 00/15] Free user PTE page table pages
@ 2021-11-10 10:54 Qi Zheng
  2021-11-10 12:56 ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.6 MB
VmPTE                102640 kB                         248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads         before (pf/min)                     after (pf/min)
    1                32,085,255                         31,880,833 (-0.64%)
    8               101,674,967                        100,588,311 (-1.17%)
   16               113,207,000                        112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

And there are no obvious changes in perf hot spots:

before:
  19.29%  [kernel]  [k] clear_page_rep
  16.12%  [kernel]  [k] do_user_addr_fault
   9.57%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.16%  [kernel]  [k] get_page_from_freelist
   5.03%  [kernel]  [k] __handle_mm_fault
   3.53%  [kernel]  [k] __rcu_read_unlock
   3.45%  [kernel]  [k] handle_mm_fault
   3.38%  [kernel]  [k] down_read_trylock
   2.74%  [kernel]  [k] free_unref_page_list
   2.17%  [kernel]  [k] up_read
   1.93%  [kernel]  [k] charge_memcg
   1.73%  [kernel]  [k] try_charge_memcg
   1.71%  [kernel]  [k] __alloc_pages
   1.69%  [kernel]  [k] ___perf_sw_event
   1.44%  [kernel]  [k] get_mem_cgroup_from_mm

after:
  18.19%  [kernel]  [k] clear_page_rep
  16.28%  [kernel]  [k] do_user_addr_fault
   8.39%  [kernel]  [k] _raw_spin_unlock_irqrestore
   5.12%  [kernel]  [k] get_page_from_freelist
   4.81%  [kernel]  [k] __handle_mm_fault
   4.68%  [kernel]  [k] down_read_trylock
   3.80%  [kernel]  [k] handle_mm_fault
   3.59%  [kernel]  [k] get_mem_cgroup_from_mm
   2.49%  [kernel]  [k] free_unref_page_list
   2.41%  [kernel]  [k] up_read
   2.16%  [kernel]  [k] charge_memcg
   1.92%  [kernel]  [k] __rcu_read_unlock
   1.88%  [kernel]  [k] ___perf_sw_event
   1.70%  [kernel]  [k] pte_get_unless_zero

This series is based on next-20211108.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Changelog in v2 -> v3:
 - Refactored this patch series:
        - [PATCH v3 6/15]: Introduce the new dummy helpers first
        - [PATCH v3 7-12/15]: Convert each subsystem individually
        - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
   And thanks for the advice from David and Jason.
 - Add a document.

Changelog in v1 -> v2:
 - Change pte_install() to pmd_install().
 - Fix some typo and code style problems.
 - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
   and [PATCH v2 7/9].

Qi Zheng (15):
  mm: do code cleanups to filemap_map_pmd()
  mm: introduce is_huge_pmd() helper
  mm: move pte_offset_map_lock() to pgtable.h
  mm: rework the parameter of lock_page_or_retry()
  mm: add pmd_installed_type return for __pte_alloc() and other friends
  mm: introduce refcount for user PTE page table page
  mm/pte_ref: add support for user PTE page table page allocation
  mm/pte_ref: initialize the refcount of the withdrawn PTE page table
    page
  mm/pte_ref: add support for the map/unmap of user PTE page table page
  mm/pte_ref: add support for page fault path
  mm/pte_ref: take a refcount before accessing the PTE page table page
  mm/pte_ref: update the pmd entry in move_normal_pmd()
  mm/pte_ref: free user PTE page table pages
  Documentation: add document for pte_ref
  mm/pte_ref: use mmu_gather to free PTE page table pages

 Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
 arch/x86/Kconfig             |   2 +-
 fs/proc/task_mmu.c           |  24 +++-
 fs/userfaultfd.c             |   9 +-
 include/linux/huge_mm.h      |  10 +-
 include/linux/mm.h           | 170 ++++-------------------------
 include/linux/mm_types.h     |   6 +-
 include/linux/pagemap.h      |   8 +-
 include/linux/pgtable.h      | 152 +++++++++++++++++++++++++-
 include/linux/pte_ref.h      | 146 +++++++++++++++++++++++++
 include/linux/rmap.h         |   2 +
 kernel/events/uprobes.c      |   2 +
 mm/Kconfig                   |   4 +
 mm/Makefile                  |   4 +-
 mm/damon/vaddr.c             |  12 +-
 mm/debug_vm_pgtable.c        |   5 +-
 mm/filemap.c                 |  45 +++++---
 mm/gup.c                     |  25 ++++-
 mm/hmm.c                     |   5 +-
 mm/huge_memory.c             |   3 +-
 mm/internal.h                |   4 +-
 mm/khugepaged.c              |  21 +++-
 mm/ksm.c                     |   6 +-
 mm/madvise.c                 |  21 +++-
 mm/memcontrol.c              |  12 +-
 mm/memory-failure.c          |  11 +-
 mm/memory.c                  | 254 ++++++++++++++++++++++++++++++++-----------
 mm/mempolicy.c               |   6 +-
 mm/migrate.c                 |  54 ++++-----
 mm/mincore.c                 |   7 +-
 mm/mlock.c                   |   1 +
 mm/mmu_gather.c              |  40 +++----
 mm/mprotect.c                |  11 +-
 mm/mremap.c                  |  14 ++-
 mm/page_vma_mapped.c         |   4 +
 mm/pagewalk.c                |  15 ++-
 mm/pgtable-generic.c         |   1 +
 mm/pte_ref.c                 | 141 ++++++++++++++++++++++++
 mm/rmap.c                    |  10 ++
 mm/swapfile.c                |   3 +
 mm/userfaultfd.c             |  40 +++++--
 41 files changed, 1186 insertions(+), 340 deletions(-)
 create mode 100644 Documentation/vm/pte_ref.rst
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-11-11 13:01 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10  8:40 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
2021-11-10  8:40 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
2021-11-10  8:40 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
2021-11-10 12:29   ` Jason Gunthorpe
2021-11-10 12:58     ` Qi Zheng
2021-11-10 12:59       ` Jason Gunthorpe
2021-11-10  8:40 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
2021-11-10  8:40 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-11-10  8:40 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
2021-11-10  8:40 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
2021-11-10  8:40 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
2021-11-10  8:40 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
2021-11-10  8:40 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
2021-11-10  8:52 ` [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
2021-11-10 10:54 Qi Zheng
2021-11-10 12:56 ` Jason Gunthorpe
2021-11-10 13:25   ` David Hildenbrand
2021-11-10 13:59     ` Qi Zheng
2021-11-10 14:38     ` Jason Gunthorpe
2021-11-10 15:37       ` David Hildenbrand
2021-11-10 16:39         ` Jason Gunthorpe
2021-11-10 17:37           ` David Hildenbrand
2021-11-10 17:49             ` Jason Gunthorpe
2021-11-11  3:58             ` Qi Zheng
2021-11-11  9:22               ` David Hildenbrand
2021-11-11 11:08                 ` Qi Zheng
2021-11-11 11:19                   ` David Hildenbrand
2021-11-11 12:00                     ` Qi Zheng
2021-11-11 12:20                       ` David Hildenbrand
2021-11-11 12:32                         ` Qi Zheng
2021-11-11 12:51                           ` David Hildenbrand
2021-11-11 13:01                             ` Qi Zheng
2021-11-10 16:49         ` Matthew Wilcox
2021-11-10 16:53           ` David Hildenbrand
2021-11-10 16:56             ` Jason Gunthorpe
2021-11-10 13:54   ` Qi Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.