All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/18] Try to free user PTE page table pages
@ 2022-04-29 13:35 Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
                   ` (18 more replies)
  0 siblings, 19 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Hi,

This patch series aims to try to free user PTE page table pages when no one is
using it.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_ref field to the struct page of page table
to track how many users of user PTE page table. Similar to the mechanism of page
refcount, the user of PTE page table should hold a refcount to it before
accessing. The user PTE page table page may be freed when the last refcount is
dropped.

Different from the idea of another patchset of mine before[1], the pte_ref
becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
entryies, and then release the user PTE page table page when checking that
pte_ref is 0. The advantage of this is that there is basically no performance
overhead in percpu mode, but it can also free the empty PTEs. In addition, the
code implementation of this patchset is much simpler and more portable than the
another patchset[1].

Testing:

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 kB                          96 kB

I also have tested the stability by LTP[2] for several weeks. I have not seen
any crash so far.

This series is based on v5.18-rc2.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20211110105428.32458-1-zhengqi.arch@bytedance.com/
[2] https://github.com/linux-test-project/ltp

Qi Zheng (18):
  x86/mm/encrypt: add the missing pte_unmap() call
  percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync()
    returns
  percpu_ref: make percpu_ref_switch_lock per percpu_ref
  mm: convert to use ptep_clear() in pte_clear_not_present_full()
  mm: split the related definitions of pte_offset_map_lock() into
    pgtable.h
  mm: introduce CONFIG_FREE_USER_PTE
  mm: add pte_to_page() helper
  mm: introduce percpu_ref for user PTE page table page
  pte_ref: add pte_tryget() and {__,}pte_put() helper
  mm: add pte_tryget_map{_lock}() helper
  mm: convert to use pte_tryget_map_lock()
  mm: convert to use pte_tryget_map()
  mm: add try_to_free_user_pte() helper
  mm: use try_to_free_user_pte() in MADV_DONTNEED case
  mm: use try_to_free_user_pte() in MADV_FREE case
  pte_ref: add track_pte_{set, clear}() helper
  x86/mm: add x86_64 support for pte_ref
  Documentation: add document for pte_ref

 Documentation/vm/index.rst         |   1 +
 Documentation/vm/pte_ref.rst       | 210 ++++++++++++++++++++++++++
 arch/x86/Kconfig                   |   1 +
 arch/x86/include/asm/pgtable.h     |   7 +-
 arch/x86/mm/mem_encrypt_identity.c |  10 +-
 fs/proc/task_mmu.c                 |  16 +-
 fs/userfaultfd.c                   |  10 +-
 include/linux/mm.h                 | 162 ++------------------
 include/linux/mm_types.h           |   1 +
 include/linux/percpu-refcount.h    |   6 +-
 include/linux/pgtable.h            | 196 +++++++++++++++++++++++-
 include/linux/pte_ref.h            |  73 +++++++++
 include/linux/rmap.h               |   2 +
 include/linux/swapops.h            |   4 +-
 kernel/events/core.c               |   5 +-
 lib/percpu-refcount.c              |  86 +++++++----
 mm/Kconfig                         |  10 ++
 mm/Makefile                        |   2 +-
 mm/damon/vaddr.c                   |  30 ++--
 mm/debug_vm_pgtable.c              |   2 +-
 mm/filemap.c                       |   4 +-
 mm/gup.c                           |  20 ++-
 mm/hmm.c                           |   9 +-
 mm/huge_memory.c                   |   4 +-
 mm/internal.h                      |   3 +-
 mm/khugepaged.c                    |  18 ++-
 mm/ksm.c                           |   4 +-
 mm/madvise.c                       |  35 +++--
 mm/memcontrol.c                    |   8 +-
 mm/memory-failure.c                |  15 +-
 mm/memory.c                        | 187 +++++++++++++++--------
 mm/mempolicy.c                     |   4 +-
 mm/migrate.c                       |   8 +-
 mm/migrate_device.c                |  22 ++-
 mm/mincore.c                       |   5 +-
 mm/mlock.c                         |   5 +-
 mm/mprotect.c                      |   4 +-
 mm/mremap.c                        |  10 +-
 mm/oom_kill.c                      |   3 +-
 mm/page_table_check.c              |   2 +-
 mm/page_vma_mapped.c               |  59 +++++++-
 mm/pagewalk.c                      |   6 +-
 mm/pte_ref.c                       | 230 +++++++++++++++++++++++++++++
 mm/rmap.c                          |   9 ++
 mm/swap_state.c                    |   4 +-
 mm/swapfile.c                      |  18 ++-
 mm/userfaultfd.c                   |  11 +-
 mm/vmalloc.c                       |   2 +-
 48 files changed, 1203 insertions(+), 340 deletions(-)
 create mode 100644 Documentation/vm/pte_ref.rst
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The paired pte_unmap() call is missing before the sme_populate_pgd()
returns. Although this code only runs under the CONFIG_X86_64, for
the correctness of the code semantics, it is necessary to add a
paired pte_unmap() call.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/mm/mem_encrypt_identity.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index b43bc24d2bb6..6d323230320a 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -190,6 +190,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
 	pte = pte_offset_map(pmd, ppd->vaddr);
 	if (pte_none(*pte))
 		set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
+	pte_unmap(pte);
 }
 
 static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

In the percpu_ref_call_confirm_rcu(), we call the wake_up_all()
before calling percpu_ref_put(), which will cause the value of
percpu_ref to be unstable when percpu_ref_switch_to_atomic_sync()
returns.

	CPU0				CPU1

percpu_ref_switch_to_atomic_sync(&ref)
--> percpu_ref_switch_to_atomic(&ref)
    --> percpu_ref_get(ref);	/* put after confirmation */
	call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu);

					percpu_ref_switch_to_atomic_rcu
					--> percpu_ref_call_confirm_rcu
					    --> data->confirm_switch = NULL;
						wake_up_all(&percpu_ref_switch_waitq);

    /* here waiting to wake up */
    wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch);
						(A)percpu_ref_put(ref);
/* The value of &ref is unstable! */
percpu_ref_is_zero(&ref)
						(B)percpu_ref_put(ref);

As shown above, assuming that the counts on each cpu add up to 0 before
calling percpu_ref_switch_to_atomic_sync(), we expect that after switching
to atomic mode, percpu_ref_is_zero() can return true. But actually it will
return different values in the two cases of A and B, which is not what
we expected.

Now there are two users of percpu_ref_switch_to_atomic_sync() in the kernel:

	i. mddev->writes_pending in the driver/md/md.c
	ii. q->q_usage_counter in the block/blk-pm.c

And they are all used as shown above. In the worst case, percpu_ref_is_zero()
may not hold because of the case B every time. While this is unlikely to occur
in a production environment, it is a problem.

This patch moves percpu_ref_put() out of the rcu handler and call it after
wait_event(), which can makes ref stable after percpu_ref_switch_to_atomic_sync()
returns. Then in the example above, percpu_ref_is_zero() can see a steady 0 value,
which is what we would expect.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/percpu-refcount.h |  4 ++-
 lib/percpu-refcount.c           | 56 +++++++++++++++++++++++----------
 2 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d73a1c08c3e3..75844939a965 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -98,6 +98,7 @@ struct percpu_ref_data {
 	percpu_ref_func_t	*confirm_switch;
 	bool			force_atomic:1;
 	bool			allow_reinit:1;
+	bool			sync:1;
 	struct rcu_head		rcu;
 	struct percpu_ref	*ref;
 };
@@ -123,7 +124,8 @@ int __must_check percpu_ref_init(struct percpu_ref *ref,
 				 gfp_t gfp);
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
-				 percpu_ref_func_t *confirm_switch);
+				 percpu_ref_func_t *confirm_switch,
+				 bool sync);
 void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index af9302141bcf..3a8906715e09 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -99,6 +99,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
 	data->release = release;
 	data->confirm_switch = NULL;
 	data->ref = ref;
+	data->sync = false;
 	ref->data = data;
 	return 0;
 }
@@ -146,21 +147,33 @@ void percpu_ref_exit(struct percpu_ref *ref)
 }
 EXPORT_SYMBOL_GPL(percpu_ref_exit);
 
+static inline void percpu_ref_switch_to_atomic_post(struct percpu_ref *ref)
+{
+	struct percpu_ref_data *data = ref->data;
+
+	if (!data->allow_reinit)
+		__percpu_ref_exit(ref);
+
+	/* drop ref from percpu_ref_switch_to_atomic() */
+	percpu_ref_put(ref);
+}
+
 static void percpu_ref_call_confirm_rcu(struct rcu_head *rcu)
 {
 	struct percpu_ref_data *data = container_of(rcu,
 			struct percpu_ref_data, rcu);
 	struct percpu_ref *ref = data->ref;
+	bool need_put = true;
+
+	if (data->sync)
+		need_put = data->sync = false;
 
 	data->confirm_switch(ref);
 	data->confirm_switch = NULL;
 	wake_up_all(&percpu_ref_switch_waitq);
 
-	if (!data->allow_reinit)
-		__percpu_ref_exit(ref);
-
-	/* drop ref from percpu_ref_switch_to_atomic() */
-	percpu_ref_put(ref);
+	if (need_put)
+		percpu_ref_switch_to_atomic_post(ref);
 }
 
 static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
@@ -210,14 +223,19 @@ static void percpu_ref_noop_confirm_switch(struct percpu_ref *ref)
 }
 
 static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref,
-					  percpu_ref_func_t *confirm_switch)
+					  percpu_ref_func_t *confirm_switch,
+					  bool sync)
 {
 	if (ref->percpu_count_ptr & __PERCPU_REF_ATOMIC) {
 		if (confirm_switch)
 			confirm_switch(ref);
+		if (sync)
+			percpu_ref_get(ref);
 		return;
 	}
 
+	ref->data->sync = sync;
+
 	/* switching from percpu to atomic */
 	ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC;
 
@@ -232,13 +250,16 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 	call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu);
 }
 
-static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
+static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref, bool sync)
 {
 	unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
 	int cpu;
 
 	BUG_ON(!percpu_count);
 
+	if (sync)
+		percpu_ref_get(ref);
+
 	if (!(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC))
 		return;
 
@@ -261,7 +282,8 @@ static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 }
 
 static void __percpu_ref_switch_mode(struct percpu_ref *ref,
-				     percpu_ref_func_t *confirm_switch)
+				     percpu_ref_func_t *confirm_switch,
+				     bool sync)
 {
 	struct percpu_ref_data *data = ref->data;
 
@@ -276,9 +298,9 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref,
 			    percpu_ref_switch_lock);
 
 	if (data->force_atomic || percpu_ref_is_dying(ref))
-		__percpu_ref_switch_to_atomic(ref, confirm_switch);
+		__percpu_ref_switch_to_atomic(ref, confirm_switch, sync);
 	else
-		__percpu_ref_switch_to_percpu(ref);
+		__percpu_ref_switch_to_percpu(ref, sync);
 }
 
 /**
@@ -302,14 +324,15 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref,
  * switching to atomic mode, this function can be called from any context.
  */
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
-				 percpu_ref_func_t *confirm_switch)
+				 percpu_ref_func_t *confirm_switch,
+				 bool sync)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
 
 	ref->data->force_atomic = true;
-	__percpu_ref_switch_mode(ref, confirm_switch);
+	__percpu_ref_switch_mode(ref, confirm_switch, sync);
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
@@ -325,8 +348,9 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
  */
 void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref)
 {
-	percpu_ref_switch_to_atomic(ref, NULL);
+	percpu_ref_switch_to_atomic(ref, NULL, true);
 	wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch);
+	percpu_ref_switch_to_atomic_post(ref);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
 
@@ -355,7 +379,7 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
 
 	ref->data->force_atomic = false;
-	__percpu_ref_switch_mode(ref, NULL);
+	__percpu_ref_switch_mode(ref, NULL, false);
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
@@ -390,7 +414,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 		  ref->data->release);
 
 	ref->percpu_count_ptr |= __PERCPU_REF_DEAD;
-	__percpu_ref_switch_mode(ref, confirm_kill);
+	__percpu_ref_switch_mode(ref, confirm_kill, false);
 	percpu_ref_put(ref);
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
@@ -470,7 +494,7 @@ void percpu_ref_resurrect(struct percpu_ref *ref)
 
 	ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD;
 	percpu_ref_get(ref);
-	__percpu_ref_switch_mode(ref, NULL);
+	__percpu_ref_switch_mode(ref, NULL, false);
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Currently, percpu_ref uses the global percpu_ref_switch_lock to
protect the mode switching operation. When multiple percpu_ref
perform mode switching at the same time, the lock may become a
performance bottleneck.

This patch introduces per percpu_ref percpu_ref_switch_lock to
fixes this situation.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/percpu-refcount.h |  2 ++
 lib/percpu-refcount.c           | 30 +++++++++++++++---------------
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 75844939a965..eb8695e578fd 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -110,6 +110,8 @@ struct percpu_ref {
 	 */
 	unsigned long		percpu_count_ptr;
 
+	spinlock_t percpu_ref_switch_lock;
+
 	/*
 	 * 'percpu_ref' is often embedded into user structure, and only
 	 * 'percpu_count_ptr' is required in fast path, move other fields
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 3a8906715e09..4336fd1bd77a 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -36,7 +36,6 @@
 
 #define PERCPU_COUNT_BIAS	(1LU << (BITS_PER_LONG - 1))
 
-static DEFINE_SPINLOCK(percpu_ref_switch_lock);
 static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq);
 
 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
@@ -95,6 +94,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
 		start_count++;
 
 	atomic_long_set(&data->count, start_count);
+	spin_lock_init(&ref->percpu_ref_switch_lock);
 
 	data->release = release;
 	data->confirm_switch = NULL;
@@ -137,11 +137,11 @@ void percpu_ref_exit(struct percpu_ref *ref)
 	if (!data)
 		return;
 
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags);
 	ref->percpu_count_ptr |= atomic_long_read(&ref->data->count) <<
 		__PERCPU_REF_FLAG_BITS;
 	ref->data = NULL;
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags);
 
 	kfree(data);
 }
@@ -287,7 +287,7 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref,
 {
 	struct percpu_ref_data *data = ref->data;
 
-	lockdep_assert_held(&percpu_ref_switch_lock);
+	lockdep_assert_held(&ref->percpu_ref_switch_lock);
 
 	/*
 	 * If the previous ATOMIC switching hasn't finished yet, wait for
@@ -295,7 +295,7 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref,
 	 * isn't in progress, this function can be called from any context.
 	 */
 	wait_event_lock_irq(percpu_ref_switch_waitq, !data->confirm_switch,
-			    percpu_ref_switch_lock);
+			    ref->percpu_ref_switch_lock);
 
 	if (data->force_atomic || percpu_ref_is_dying(ref))
 		__percpu_ref_switch_to_atomic(ref, confirm_switch, sync);
@@ -329,12 +329,12 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags);
 
 	ref->data->force_atomic = true;
 	__percpu_ref_switch_mode(ref, confirm_switch, sync);
 
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
 
@@ -376,12 +376,12 @@ void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags);
 
 	ref->data->force_atomic = false;
 	__percpu_ref_switch_mode(ref, NULL, false);
 
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
 
@@ -407,7 +407,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 {
 	unsigned long flags;
 
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags);
 
 	WARN_ONCE(percpu_ref_is_dying(ref),
 		  "%s called more than once on %ps!", __func__,
@@ -417,7 +417,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 	__percpu_ref_switch_mode(ref, confirm_kill, false);
 	percpu_ref_put(ref);
 
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
 
@@ -438,12 +438,12 @@ bool percpu_ref_is_zero(struct percpu_ref *ref)
 		return false;
 
 	/* protect us from being destroyed */
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags);
 	if (ref->data)
 		count = atomic_long_read(&ref->data->count);
 	else
 		count = ref->percpu_count_ptr >> __PERCPU_REF_FLAG_BITS;
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags);
 
 	return count == 0;
 }
@@ -487,7 +487,7 @@ void percpu_ref_resurrect(struct percpu_ref *ref)
 	unsigned long __percpu *percpu_count;
 	unsigned long flags;
 
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	spin_lock_irqsave(&ref->percpu_ref_switch_lock, flags);
 
 	WARN_ON_ONCE(!percpu_ref_is_dying(ref));
 	WARN_ON_ONCE(__ref_is_percpu(ref, &percpu_count));
@@ -496,6 +496,6 @@ void percpu_ref_resurrect(struct percpu_ref *ref)
 	percpu_ref_get(ref);
 	__percpu_ref_switch_mode(ref, NULL, false);
 
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	spin_unlock_irqrestore(&ref->percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_resurrect);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full()
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (2 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

After commit 08d5b29eac7d ("mm: ptep_clear() page table helper"),
the ptep_clear() can be used to track the clearing of PTE page
table entries, but pte_clear_not_present_full() is not covered,
so also convert it to use ptep_clear(), we will need this call
in subsequent patches.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pgtable.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..bed9a559d45b 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -423,7 +423,7 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm,
 					      pte_t *ptep,
 					      int full)
 {
-	pte_clear(mm, address, ptep);
+	ptep_clear(mm, address, ptep);
 }
 #endif
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (3 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The pte_offset_map_lock() and its friend pte_offset_map() are in mm.h
and pgtable.h respectively, it would be better to have them in one file.
Considering that they are all helper functions related to page tables,
move pte_offset_map_lock() to pgtable.h.

The pte_lockptr() is required for pte_offset_map_lock(), so move it and
its friends {pmd,pud}_lockptr() to pgtable.h together.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h      | 149 ----------------------------------------
 include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+), 149 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e34edb775334..0afd3b097e90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2252,70 +2252,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU */
 
-#if USE_SPLIT_PTE_PTLOCKS
-#if ALLOC_SPLIT_PTLOCKS
-void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
-extern void ptlock_free(struct page *page);
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
-	return page->ptl;
-}
-#else /* ALLOC_SPLIT_PTLOCKS */
-static inline void ptlock_cache_init(void)
-{
-}
-
-static inline bool ptlock_alloc(struct page *page)
-{
-	return true;
-}
-
-static inline void ptlock_free(struct page *page)
-{
-}
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
-	return &page->ptl;
-}
-#endif /* ALLOC_SPLIT_PTLOCKS */
-
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return ptlock_ptr(pmd_page(*pmd));
-}
-
-static inline bool ptlock_init(struct page *page)
-{
-	/*
-	 * prep_new_page() initialize page->private (and therefore page->ptl)
-	 * with 0. Make sure nobody took it in use in between.
-	 *
-	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache, which share storage with page->ptl.
-	 */
-	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
-	if (!ptlock_alloc(page))
-		return false;
-	spin_lock_init(ptlock_ptr(page));
-	return true;
-}
-
-#else	/* !USE_SPLIT_PTE_PTLOCKS */
-/*
- * We use mm->page_table_lock to guard all pagetable pages of the mm.
- */
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return &mm->page_table_lock;
-}
-static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
-static inline void ptlock_free(struct page *page) {}
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
 static inline void pgtable_init(void)
 {
 	ptlock_cache_init();
@@ -2338,20 +2274,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
-({							\
-	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
-	pte_t *__pte = pte_offset_map(pmd, address);	\
-	*(ptlp) = __ptl;				\
-	spin_lock(__ptl);				\
-	__pte;						\
-})
-
-#define pte_unmap_unlock(pte, ptl)	do {		\
-	spin_unlock(ptl);				\
-	pte_unmap(pte);					\
-} while (0)
-
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
 
 #define pte_alloc_map(mm, pmd, address)			\
@@ -2365,58 +2287,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
 		NULL: pte_offset_kernel(pmd, address))
 
-#if USE_SPLIT_PMD_PTLOCKS
-
-static struct page *pmd_to_page(pmd_t *pmd)
-{
-	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-	return virt_to_page((void *)((unsigned long) pmd & mask));
-}
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return ptlock_ptr(pmd_to_page(pmd));
-}
-
-static inline bool pmd_ptlock_init(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page->pmd_huge_pte = NULL;
-#endif
-	return ptlock_init(page);
-}
-
-static inline void pmd_ptlock_free(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
-#endif
-	ptlock_free(page);
-}
-
-#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
-
-#else
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return &mm->page_table_lock;
-}
-
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
-
-#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
-
-#endif
-
-static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
-{
-	spinlock_t *ptl = pmd_lockptr(mm, pmd);
-	spin_lock(ptl);
-	return ptl;
-}
-
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
 	if (!pmd_ptlock_init(page))
@@ -2433,25 +2303,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-/*
- * No scalability reason to split PUD locks yet, but follow the same pattern
- * as the PMD locks to make it easier if we decide to.  The VM should not be
- * considered ready to switch to split PUD locks yet; there may be places
- * which need to be converted from page_table_lock.
- */
-static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
-{
-	return &mm->page_table_lock;
-}
-
-static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
-{
-	spinlock_t *ptl = pud_lockptr(mm, pud);
-
-	spin_lock(ptl);
-	return ptl;
-}
-
 extern void __init pagecache_init(void);
 extern void free_initmem(void);
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index bed9a559d45b..0928acca6b48 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -85,6 +85,141 @@ static inline unsigned long pud_index(unsigned long address)
 #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 #endif
 
+#if USE_SPLIT_PTE_PTLOCKS
+#if ALLOC_SPLIT_PTLOCKS
+void __init ptlock_cache_init(void);
+extern bool ptlock_alloc(struct page *page);
+extern void ptlock_free(struct page *page);
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+	return page->ptl;
+}
+#else /* ALLOC_SPLIT_PTLOCKS */
+static inline void ptlock_cache_init(void)
+{
+}
+
+static inline bool ptlock_alloc(struct page *page)
+{
+	return true;
+}
+
+static inline void ptlock_free(struct page *page)
+{
+}
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+	return &page->ptl;
+}
+#endif /* ALLOC_SPLIT_PTLOCKS */
+
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_page(*pmd));
+}
+
+static inline bool ptlock_init(struct page *page)
+{
+	/*
+	 * prep_new_page() initialize page->private (and therefore page->ptl)
+	 * with 0. Make sure nobody took it in use in between.
+	 *
+	 * It can happen if arch try to use slab for page table allocation:
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
+	 */
+	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
+	if (!ptlock_alloc(page))
+		return false;
+	spin_lock_init(ptlock_ptr(page));
+	return true;
+}
+
+#else	/* !USE_SPLIT_PTE_PTLOCKS */
+/*
+ * We use mm->page_table_lock to guard all pagetable pages of the mm.
+ */
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return &mm->page_table_lock;
+}
+static inline void ptlock_cache_init(void) {}
+static inline bool ptlock_init(struct page *page) { return true; }
+static inline void ptlock_free(struct page *page) {}
+#endif /* USE_SPLIT_PTE_PTLOCKS */
+
+#if USE_SPLIT_PMD_PTLOCKS
+
+static struct page *pmd_to_page(pmd_t *pmd)
+{
+	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+	return virt_to_page((void *)((unsigned long) pmd & mask));
+}
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_to_page(pmd));
+}
+
+static inline bool pmd_ptlock_init(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page->pmd_huge_pte = NULL;
+#endif
+	return ptlock_init(page);
+}
+
+static inline void pmd_ptlock_free(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+#endif
+	ptlock_free(page);
+}
+
+#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
+
+#else /* !USE_SPLIT_PMD_PTLOCKS */
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return &mm->page_table_lock;
+}
+
+static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline void pmd_ptlock_free(struct page *page) {}
+
+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
+
+#endif /* USE_SPLIT_PMD_PTLOCKS */
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl = pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
+	return ptl;
+}
+
+/*
+ * No scalability reason to split PUD locks yet, but follow the same pattern
+ * as the PMD locks to make it easier if we decide to.  The VM should not be
+ * considered ready to switch to split PUD locks yet; there may be places
+ * which need to be converted from page_table_lock.
+ */
+static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
+{
+	return &mm->page_table_lock;
+}
+
+static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
+{
+	spinlock_t *ptl = pud_lockptr(mm, pud);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
 #ifndef pte_offset_kernel
 static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 {
@@ -103,6 +238,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_unmap(pte) ((void)(pte))	/* NOP */
 #endif
 
+#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
+({							\
+	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
+	pte_t *__pte = pte_offset_map(pmd, address);	\
+	*(ptlp) = __ptl;				\
+	spin_lock(__ptl);				\
+	__pte;						\
+})
+
+#define pte_unmap_unlock(pte, ptl)	do {		\
+	spin_unlock(ptl);				\
+	pte_unmap(pte);					\
+} while (0)
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (4 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

This configuration variable will be used to build the code needed to
free user PTE page table pages.

The PTE page table setting and clearing functions(such as set_pte_at())
are in the architecture's files, and these functions will be hooked to
implement FREE_USER_PTE, so the architecture support is needed.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..af99ed626732 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -909,6 +909,16 @@ config ANON_VMA_NAME
 	  area from being merged with adjacent virtual memory areas due to the
 	  difference in their name.
 
+config ARCH_SUPPORTS_FREE_USER_PTE
+	def_bool n
+
+config FREE_USER_PTE
+	bool "Free user PTE page tables"
+	default y
+	depends on ARCH_SUPPORTS_FREE_USER_PTE && MMU && SMP
+	help
+	  Try to free user PTE page table page when its all entries are none.
+
 source "mm/damon/Kconfig"
 
 endmenu
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 07/18] mm: add pte_to_page() helper
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (5 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Qi Zheng
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Add pte_to_page() helper similar to pmd_to_page(), which
will be used to get the struct page of the PTE page table.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pgtable.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0928acca6b48..d1218cb1013e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -85,6 +85,14 @@ static inline unsigned long pud_index(unsigned long address)
 #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 #endif
 
+#ifdef CONFIG_FREE_USER_PTE
+static inline struct page *pte_to_page(pte_t *pte)
+{
+	unsigned long mask = ~(PTRS_PER_PTE * sizeof(pte_t) - 1);
+	return virt_to_page((void *)((unsigned long) pte & mask));
+}
+#endif
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (6 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::

 First of all, we should hold as few write locks of mmap_lock as possible,
 since the mmap_lock semaphore has long been a contention point in the
 memory management subsystem. The mmap()/munmap() hold the write lock, and
 the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
 madvise() instead of munmap() to released physical memory can reduce the
 competition of the mmap_lock.

 Secondly, after using madvise() to release physical memory, there is no
 need to build vma and allocate page tables again when accessing the same
 virtual address again, which can also save some time.

The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.

+---------------------------+--------+---------+
|                           | 32-bit | 64-bit  |
+===========================+========+=========+
| user PTE page table pages | 3 MiB  | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB  | 1 GiB   |
+---------------------------+--------+---------+

(for 32-bit, take 3G user address space, 4K page size as an example;
 for 64-bit, take 48-bit address width, 4K page size as an example.)

After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.

To fix the situation, this patchset introduces a percpu_ref for each user
PTE page table page. The following people will hold a percpu_ref::

 The !pte_none() entry, such as regular page table entry that map physical
 pages, or swap entry, or migrate entry, etc.

 Visitor to the PTE page table entries, such as page table walker.

Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the percpu_ref is reduced to 0 (need to switch
to atomic mode first to check), it means that no one is using the PTE page
table page, then this free PTE page table page can be reclaimed at this
time.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h       |  9 +++++++-
 include/linux/mm_types.h |  1 +
 include/linux/pte_ref.h  | 29 +++++++++++++++++++++++++
 mm/Makefile              |  2 +-
 mm/pte_ref.c             | 47 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 86 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0afd3b097e90..1a6bc79c351b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -28,6 +28,7 @@
 #include <linux/sched.h>
 #include <linux/pgtable.h>
 #include <linux/kasan.h>
+#include <linux/pte_ref.h>
 
 struct mempolicy;
 struct anon_vma;
@@ -2260,11 +2261,16 @@ static inline void pgtable_init(void)
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
 {
-	if (!ptlock_init(page))
+	if (!pte_ref_init(page))
 		return false;
+	if (!ptlock_init(page))
+		goto free_pte_ref;
 	__SetPageTable(page);
 	inc_lruvec_page_state(page, NR_PAGETABLE);
 	return true;
+free_pte_ref:
+	pte_ref_free(page);
+	return false;
 }
 
 static inline void pgtable_pte_page_dtor(struct page *page)
@@ -2272,6 +2278,7 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	ptlock_free(page);
 	__ClearPageTable(page);
 	dec_lruvec_page_state(page, NR_PAGETABLE);
+	pte_ref_free(page);
 }
 
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..650bfb22b0e2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -153,6 +153,7 @@ struct page {
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
+				struct percpu_ref *pte_ref; /* PTE page only */
 			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..d3963a151ca5
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,29 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#ifdef CONFIG_FREE_USER_PTE
+
+bool pte_ref_init(pgtable_t pte);
+void pte_ref_free(pgtable_t pte);
+
+#else /* !CONFIG_FREE_USER_PTE */
+
+static inline bool pte_ref_init(pgtable_t pte)
+{
+	return true;
+}
+
+static inline void pte_ref_free(pgtable_t pte)
+{
+}
+
+#endif /* CONFIG_FREE_USER_PTE */
+
+#endif /* _LINUX_PTE_REF_H */
diff --git a/mm/Makefile b/mm/Makefile
index 4cc13f3179a5..b9711510f84f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -54,7 +54,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o mmap_lock.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o $(mmu-y) pte_ref.o
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..52e31be00de4
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2022, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+#include <linux/pgtable.h>
+#include <linux/pte_ref.h>
+#include <linux/percpu-refcount.h>
+#include <linux/slab.h>
+
+#ifdef CONFIG_FREE_USER_PTE
+
+static void no_op(struct percpu_ref *r) {}
+
+bool pte_ref_init(pgtable_t pte)
+{
+	struct percpu_ref *pte_ref;
+
+	pte_ref = kmalloc(sizeof(struct percpu_ref), GFP_KERNEL);
+	if (!pte_ref)
+		return false;
+	if (percpu_ref_init(pte_ref, no_op,
+			    PERCPU_REF_ALLOW_REINIT, GFP_KERNEL) < 0)
+		goto free_ref;
+	/* We want to start with the refcount at zero */
+	percpu_ref_put(pte_ref);
+
+	pte->pte_ref = pte_ref;
+	return true;
+free_ref:
+	kfree(pte_ref);
+	return false;
+}
+
+void pte_ref_free(pgtable_t pte)
+{
+	struct percpu_ref *ref = pte->pte_ref;
+	if (!ref)
+		return;
+
+	pte->pte_ref = NULL;
+	percpu_ref_exit(ref);
+	kfree(ref);
+}
+
+#endif /* CONFIG_FREE_USER_PTE */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (7 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The user PTE page table page may be freed when the last
percpu_ref is dropped. So we need to try to get its
percpu_ref before accessing the PTE page to prevent it
form being freed during the access process.

This patch adds pte_tryget() and {__,}pte_put() to help us
to get and put the percpu_ref of user PTE page table pages.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pte_ref.h | 23 ++++++++++++++++
 mm/pte_ref.c            | 58 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 81 insertions(+)

diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index d3963a151ca5..bfe620038699 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -12,6 +12,10 @@
 
 bool pte_ref_init(pgtable_t pte);
 void pte_ref_free(pgtable_t pte);
+void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
+bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
+void __pte_put(pgtable_t page);
+void pte_put(pte_t *ptep);
 
 #else /* !CONFIG_FREE_USER_PTE */
 
@@ -24,6 +28,25 @@ static inline void pte_ref_free(pgtable_t pte)
 {
 }
 
+static inline void free_user_pte(struct mm_struct *mm, pmd_t *pmd,
+				 unsigned long addr)
+{
+}
+
+static inline bool pte_tryget(struct mm_struct *mm, pmd_t *pmd,
+			      unsigned long addr)
+{
+	return true;
+}
+
+static inline void __pte_put(pgtable_t page)
+{
+}
+
+static inline void pte_put(pte_t *ptep)
+{
+}
+
 #endif /* CONFIG_FREE_USER_PTE */
 
 #endif /* _LINUX_PTE_REF_H */
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 52e31be00de4..5b382445561e 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -44,4 +44,62 @@ void pte_ref_free(pgtable_t pte)
 	kfree(ref);
 }
 
+void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {}
+
+/*
+ * pte_tryget - try to get the pte_ref of the user PTE page table page
+ * @mm: pointer the target address space
+ * @pmd: pointer to a PMD.
+ * @addr: virtual address associated with pmd.
+ *
+ * Return: true if getting the pte_ref succeeded. And false otherwise.
+ *
+ * Before accessing the user PTE page table, we need to hold a refcount to
+ * protect against the concurrent release of the PTE page table.
+ * But we will fail in the following case:
+ * 	- The content mapped in @pmd is not a PTE page
+ * 	- The pte_ref is zero, it may be reclaimed
+ */
+bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+	bool retval = true;
+	pmd_t pmdval;
+	pgtable_t pte;
+
+	rcu_read_lock();
+	pmdval = READ_ONCE(*pmd);
+	pte = pmd_pgtable(pmdval);
+	if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval))) {
+		retval = false;
+	} else if (!percpu_ref_tryget(pte->pte_ref)) {
+		rcu_read_unlock();
+		/*
+		 * Also do free_user_pte() here to prevent missed reclaim due
+		 * to race condition.
+		 */
+		free_user_pte(mm, pmd, addr & PMD_MASK);
+		return false;
+	}
+	rcu_read_unlock();
+
+	return retval;
+}
+
+void __pte_put(pgtable_t page)
+{
+	percpu_ref_put(page->pte_ref);
+}
+
+void pte_put(pte_t *ptep)
+{
+	pgtable_t page;
+
+	if (pte_huge(*ptep))
+		return;
+
+	page = pte_to_page(ptep);
+	__pte_put(page);
+}
+EXPORT_SYMBOL(pte_put);
+
 #endif /* CONFIG_FREE_USER_PTE */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (8 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Now, we usually use pte_offset_map{_lock}() to get the pte_t pointer
before accessing the PTE page table page. After adding the
FREE_USER_PTE, we also need to call the pte_tryget() before calling
pte_offset_map{_lock}(), which is used to try to get the reference
count of the PTE to prevent the PTE page table page from being freed
during the access process.

This patch adds pte_tryget_map{_lock}() to help us to do that. A
return value of NULL indicates that we failed to get the percpu_ref,
and there is a concurrent thread that is releasing this PTE (or has
already been released). It needs to be treated as the case of pte_none().

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pgtable.h | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index d1218cb1013e..6f205fee6348 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,6 +228,8 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 	return ptl;
 }
 
+#include <linux/pte_ref.h>
+
 #ifndef pte_offset_kernel
 static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 {
@@ -240,12 +242,38 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_offset_map(dir, address)				\
 	((pte_t *)kmap_atomic(pmd_page(*(dir))) +		\
 	 pte_index((address)))
-#define pte_unmap(pte) kunmap_atomic((pte))
+#define __pte_unmap(pte) kunmap_atomic((pte))
 #else
 #define pte_offset_map(dir, address)	pte_offset_kernel((dir), (address))
-#define pte_unmap(pte) ((void)(pte))	/* NOP */
+#define __pte_unmap(pte) ((void)(pte))	/* NOP */
 #endif
 
+#define pte_tryget_map(mm, pmd, address)		\
+({							\
+	pte_t *__pte = NULL;				\
+	if (pte_tryget(mm, pmd, address))		\
+		__pte = pte_offset_map(pmd, address);	\
+	__pte;						\
+})
+
+#define pte_unmap(pte)	do {				\
+	pte_put(pte);					\
+	__pte_unmap(pte);				\
+} while (0)
+
+#define pte_tryget_map_lock(mm, pmd, address, ptlp)	\
+({							\
+	spinlock_t *__ptl = NULL;			\
+	pte_t *__pte = NULL;				\
+	if (pte_tryget(mm, pmd, address)) {		\
+		__ptl = pte_lockptr(mm, pmd);		\
+		__pte = pte_offset_map(pmd, address);	\
+		*(ptlp) = __ptl;			\
+		spin_lock(__ptl);			\
+	}						\
+	__pte;						\
+})
+
 #define pte_offset_map_lock(mm, pmd, address, ptlp)	\
 ({							\
 	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
@@ -260,6 +288,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 	pte_unmap(pte);					\
 } while (0)
 
+#define __pte_unmap_unlock(pte, ptl)	do {		\
+	spin_unlock(ptl);				\
+	__pte_unmap(pte);				\
+} while (0)
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock()
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (9 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Use pte_tryget_map_lock() to help us to try to get the
refcount of the PTE page table page we want to access,
which can prevents the page from being freed during access.

For the following cases, the PTE page table page is stable:

 - got the refcount of PTE page table page already
 - has no concurrent threads(e.g. the write lock of mmap_lock
			     is acquired)
 - the PTE page table page is not yet visible
 - turn off the local cpu interrupt or hold the rcu lock
   (e.g. GUP fast path)
 - the PTE page table page is kernel PTE page table page

So we still keep using pte_offset_map_lock() and replace
pte_unmap_unlock() with __pte_unmap_unlock() which doesn't
reduce the refcount.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/proc/task_mmu.c    |  16 ++++--
 include/linux/mm.h    |   2 +-
 mm/damon/vaddr.c      |  30 ++++++----
 mm/debug_vm_pgtable.c |   2 +-
 mm/filemap.c          |   4 +-
 mm/gup.c              |   4 +-
 mm/khugepaged.c       |  10 +++-
 mm/ksm.c              |   4 +-
 mm/madvise.c          |  30 +++++++---
 mm/memcontrol.c       |   8 ++-
 mm/memory-failure.c   |   4 +-
 mm/memory.c           | 125 +++++++++++++++++++++++++++++-------------
 mm/mempolicy.c        |   4 +-
 mm/migrate_device.c   |  22 +++++---
 mm/mincore.c          |   5 +-
 mm/mlock.c            |   5 +-
 mm/mprotect.c         |   4 +-
 mm/mremap.c           |   5 +-
 mm/pagewalk.c         |   4 +-
 mm/swapfile.c         |  13 +++--
 mm/userfaultfd.c      |  11 +++-
 21 files changed, 219 insertions(+), 93 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f46060eb91b5..5fff96659e4f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -625,7 +625,9 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	 * keeps khugepaged out of here and from collapsing things
 	 * in here.
 	 */
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		goto out;
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
@@ -1178,7 +1180,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
@@ -1515,7 +1519,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	 * We can assume that @vma always points to a valid one and @end never
 	 * goes beyond vma->vm_end.
 	 */
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(walk->mm, pmdp, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		pagemap_entry_t pme;
 
@@ -1849,7 +1855,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
 		if (!page)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1a6bc79c351b..04f7a6c36dc7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2288,7 +2288,7 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
 	(pte_alloc(mm, pmd) ?			\
-		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
+		 NULL : pte_tryget_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
 	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index b2ec0aa1ff45..4aa9e252c081 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -372,10 +372,13 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 {
 	pte_t *pte;
 	spinlock_t *ptl;
+	pmd_t pmdval;
 
-	if (pmd_huge(*pmd)) {
+retry:
+	pmdval = READ_ONCE(*pmd);
+	if (pmd_huge(pmdval)) {
 		ptl = pmd_lock(walk->mm, pmd);
-		if (pmd_huge(*pmd)) {
+		if (pmd_huge(pmdval)) {
 			damon_pmdp_mkold(pmd, walk->mm, addr);
 			spin_unlock(ptl);
 			return 0;
@@ -383,9 +386,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 		spin_unlock(ptl);
 	}
 
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+	if (pmd_none(pmdval) || unlikely(pmd_bad(pmdval)))
 		return 0;
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		goto retry;
 	if (!pte_present(*pte))
 		goto out;
 	damon_ptep_mkold(pte, walk->mm, addr);
@@ -499,18 +504,21 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	struct page *page;
 	struct damon_young_walk_private *priv = walk->private;
+	pmd_t pmdval;
 
+retry:
+	pmdval = READ_ONCE(*pmd);
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (pmd_huge(*pmd)) {
+	if (pmd_huge(pmdval)) {
 		ptl = pmd_lock(walk->mm, pmd);
-		if (!pmd_huge(*pmd)) {
+		if (!pmd_huge(pmdval)) {
 			spin_unlock(ptl);
 			goto regular_page;
 		}
-		page = damon_get_page(pmd_pfn(*pmd));
+		page = damon_get_page(pmd_pfn(pmdval));
 		if (!page)
 			goto huge_out;
-		if (pmd_young(*pmd) || !page_is_idle(page) ||
+		if (pmd_young(pmdval) || !page_is_idle(page) ||
 					mmu_notifier_test_young(walk->mm,
 						addr)) {
 			*priv->page_sz = ((1UL) << HPAGE_PMD_SHIFT);
@@ -525,9 +533,11 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 regular_page:
 #endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
 
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+	if (pmd_none(pmdval) || unlikely(pmd_bad(pmdval)))
 		return -EINVAL;
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		goto retry;
 	if (!pte_present(*pte))
 		goto out;
 	page = damon_get_page(pte_pfn(*pte));
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index db2abd9e415b..91c4400ca13c 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1303,7 +1303,7 @@ static int __init debug_vm_pgtable(void)
 	 * proper page table lock.
 	 */
 
-	args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
+	args.ptep = pte_tryget_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
 	pte_clear_tests(&args);
 	pte_advanced_tests(&args);
 	pte_unmap_unlock(args.ptep, ptl);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3a5ffb5587cd..fc156922147b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3368,7 +3368,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	}
 
 	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+	vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+	if (!vmf->pte)
+		goto out;
 	do {
 again:
 		page = folio_file_page(folio, xas.xa_index);
diff --git a/mm/gup.c b/mm/gup.c
index f598a037eb04..d2c24181fb04 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -451,7 +451,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmd, address, &ptl);
+	if (!ptep)
+		return no_page_table(vma, flags);
 	pte = *ptep;
 	if (!pte_present(pte)) {
 		swp_entry_t entry;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a4e5eaf3eb01..3776cc315294 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1227,7 +1227,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	pte = pte_tryget_map_lock(mm, pmd, address, &ptl);
+	if (!pte) {
+		result = SCAN_PMD_NULL;
+		goto out;
+	}
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
@@ -1505,7 +1509,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		page_remove_rmap(page, vma, false);
 	}
 
-	pte_unmap_unlock(start_pte, ptl);
+	__pte_unmap_unlock(start_pte, ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
 	if (count) {
@@ -1521,7 +1525,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	return;
 
 abort:
-	pte_unmap_unlock(start_pte, ptl);
+	__pte_unmap_unlock(start_pte, ptl);
 	goto drop_hpage;
 }
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 063a48eeb5ee..64a5f965cfc5 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1138,7 +1138,9 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 				addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
-	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out_mn;
 	if (!pte_same(*ptep, orig_pte)) {
 		pte_unmap_unlock(ptep, ptl);
 		goto out_mn;
diff --git a/mm/madvise.c b/mm/madvise.c
index 1873616a37d2..8123397f14c8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -207,7 +207,9 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		struct page *page;
 		spinlock_t *ptl;
 
-		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+		orig_pte = pte_tryget_map_lock(vma->vm_mm, pmd, start, &ptl);
+		if (!orig_pte)
+			break;
 		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
 		pte_unmap_unlock(orig_pte, ptl);
 
@@ -400,7 +402,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		return 0;
 #endif
 	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!orig_pte)
+		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
@@ -432,12 +436,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			if (split_huge_page(page)) {
 				unlock_page(page);
 				put_page(page);
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
 				break;
 			}
 			unlock_page(page);
 			put_page(page);
-			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+			if (!pte)
+				break;
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
@@ -477,7 +483,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	}
 
 	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(orig_pte, ptl);
+	if (orig_pte)
+		pte_unmap_unlock(orig_pte, ptl);
 	if (pageout)
 		reclaim_pages(&page_list);
 	cond_resched();
@@ -602,7 +609,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+	if (!orig_pte)
+		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -648,12 +657,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			if (split_huge_page(page)) {
 				unlock_page(page);
 				put_page(page);
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
+				orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
 				goto out;
 			}
 			unlock_page(page);
 			put_page(page);
-			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+			orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+			if (!pte)
+				goto out;
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;
@@ -707,7 +718,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
 	}
 	arch_leave_lazy_mmu_mode();
-	pte_unmap_unlock(orig_pte, ptl);
+	if (orig_pte)
+		pte_unmap_unlock(orig_pte, ptl);
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 725f76723220..ad51ec9043b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5736,7 +5736,9 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (get_mctgt_type(vma, addr, *pte, NULL))
 			mc.precharge++;	/* increment precharge temporarily */
@@ -5955,7 +5957,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 retry:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
 		bool device = false;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dcb6bb9cf731..5247932df3fa 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -637,8 +637,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	if (pmd_trans_unstable(pmdp))
 		goto out;
 
-	mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
+	mapped_pte = ptep = pte_tryget_map_lock(walk->vma->vm_mm, pmdp,
 						addr, &ptl);
+	if (!mapped_pte)
+		goto out;
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
 					     hwp->pfn, &hwp->tk);
diff --git a/mm/memory.c b/mm/memory.c
index 76e3af9639d9..ca03006b32cb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1352,7 +1352,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
 	init_rss_vec(rss);
-	start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	start_pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+	if (!start_pte)
+		return end;
 	pte = start_pte;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
@@ -1846,7 +1848,9 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 		int pte_idx = 0;
 		const int batch_size = min_t(int, pages_to_write_in_pmd, 8);
 
-		start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
+		start_pte = pte_tryget_map_lock(mm, pmd, addr, &pte_lock);
+		if (!start_pte)
+			break;
 		for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
 			int err = insert_page_in_batch_locked(vma, pte,
 				addr, pages[curr_page_idx], prot);
@@ -2532,9 +2536,13 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 		if (!pte)
 			return -ENOMEM;
 	} else {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_offset_kernel(pmd, addr) :
-			pte_offset_map_lock(mm, pmd, addr, &ptl);
+		if (mm == &init_mm) {
+			mapped_pte = pte = pte_offset_kernel(pmd, addr);
+		} else {
+			mapped_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+			if (!mapped_pte)
+				return err;
+		}
 	}
 
 	BUG_ON(pmd_huge(*pmd));
@@ -2787,7 +2795,11 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 	if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
 		pte_t entry;
 
-		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+		vmf->pte = pte_tryget_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+		if (!vmf->pte) {
+			ret = false;
+			goto pte_unlock;
+		}
 		locked = true;
 		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 			/*
@@ -2815,7 +2827,11 @@ static inline bool cow_user_page(struct page *dst, struct page *src,
 			goto warn;
 
 		/* Re-validate under PTL if the page is still mapped */
-		vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+		vmf->pte = pte_tryget_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
+		if (!vmf->pte) {
+			ret = false;
+			goto pte_unlock;
+		}
 		locked = true;
 		if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 			/* The PTE changed under us, update local tlb */
@@ -3005,6 +3021,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	pte_t entry;
 	int page_copied = 0;
 	struct mmu_notifier_range range;
+	vm_fault_t ret = VM_FAULT_OOM;
 
 	if (unlikely(anon_vma_prepare(vma)))
 		goto oom;
@@ -3048,7 +3065,12 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	vmf->pte = pte_tryget_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (!vmf->pte) {
+		mmu_notifier_invalidate_range_only_end(&range);
+		ret = VM_FAULT_RETRY;
+		goto uncharge;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -3129,12 +3151,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		put_page(old_page);
 	}
 	return page_copied ? VM_FAULT_WRITE : 0;
+uncharge:
+	mem_cgroup_uncharge(page_folio(new_page));
 oom_free_new:
 	put_page(new_page);
 oom:
 	if (old_page)
 		put_page(old_page);
-	return VM_FAULT_OOM;
+	return ret;
 }
 
 /**
@@ -3156,8 +3180,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
 {
 	WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
-	vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
+	vmf->pte = pte_tryget_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
 				       &vmf->ptl);
+	if (!vmf->pte)
+		return VM_FAULT_NOPAGE;
 	/*
 	 * We might have raced with another page fault while we released the
 	 * pte_offset_map_lock.
@@ -3469,6 +3495,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	struct vm_area_struct *vma = vmf->vma;
 	struct mmu_notifier_range range;
+	vm_fault_t ret = 0;
 
 	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
 		return VM_FAULT_RETRY;
@@ -3477,16 +3504,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 				(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
 	mmu_notifier_invalidate_range_start(&range);
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+	vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 				&vmf->ptl);
+	if (!vmf->pte) {
+		ret = VM_FAULT_RETRY;
+		goto out;
+	}
 	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 		restore_exclusive_pte(vma, page, vmf->address, vmf->pte);
 
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
 	unlock_page(page);
 
 	mmu_notifier_invalidate_range_end(&range);
-	return 0;
+	return ret;
 }
 
 static inline bool should_try_to_free_swap(struct page *page,
@@ -3599,8 +3631,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+			vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd,
 					vmf->address, &vmf->ptl);
+			if (!vmf->pte) {
+				ret = VM_FAULT_OOM;
+				goto out;
+			}
 			if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
 				ret = VM_FAULT_OOM;
 			goto unlock;
@@ -3666,8 +3702,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+	vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
+	if (!vmf->pte)
+		goto out_page;
 	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
 		goto out_nomap;
 
@@ -3781,6 +3819,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_SHARED)
 		return VM_FAULT_SIGBUS;
 
+retry:
 	/*
 	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
 	 * pte_offset_map() on pmds where a huge pmd might be created
@@ -3803,8 +3842,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 			!mm_forbids_zeropage(vma->vm_mm)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
 						vma->vm_page_prot));
-		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+		vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd,
 				vmf->address, &vmf->ptl);
+		if (!vmf->pte)
+			goto retry;
 		if (!pte_none(*vmf->pte)) {
 			update_mmu_tlb(vma, vmf->address, vmf->pte);
 			goto unlock;
@@ -3843,8 +3884,10 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+	vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 			&vmf->ptl);
+	if (!vmf->pte)
+		goto uncharge;
 	if (!pte_none(*vmf->pte)) {
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		goto release;
@@ -3875,6 +3918,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 release:
 	put_page(page);
 	goto unlock;
+uncharge:
+	mem_cgroup_uncharge(page_folio(page));
 oom_free_page:
 	put_page(page);
 oom:
@@ -4112,8 +4157,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	if (pmd_devmap_trans_unstable(vmf->pmd))
 		return 0;
 
-	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+	vmf->pte = pte_tryget_map_lock(vma->vm_mm, vmf->pmd,
 				      vmf->address, &vmf->ptl);
+	if (!vmf->pte)
+		return 0;
 	ret = 0;
 	/* Re-check under ptl */
 	if (likely(pte_none(*vmf->pte)))
@@ -4340,31 +4387,27 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
 	 */
 	if (!vma->vm_ops->fault) {
-		/*
-		 * If we find a migration pmd entry or a none pmd entry, which
-		 * should never happen, return SIGBUS
-		 */
-		if (unlikely(!pmd_present(*vmf->pmd)))
-			ret = VM_FAULT_SIGBUS;
-		else {
-			vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
+		vmf->pte = pte_tryget_map_lock(vmf->vma->vm_mm,
 						       vmf->pmd,
 						       vmf->address,
 						       &vmf->ptl);
-			/*
-			 * Make sure this is not a temporary clearing of pte
-			 * by holding ptl and checking again. A R/M/W update
-			 * of pte involves: take ptl, clearing the pte so that
-			 * we don't have concurrent modification by hardware
-			 * followed by an update.
-			 */
-			if (unlikely(pte_none(*vmf->pte)))
-				ret = VM_FAULT_SIGBUS;
-			else
-				ret = VM_FAULT_NOPAGE;
-
-			pte_unmap_unlock(vmf->pte, vmf->ptl);
+		if (!vmf->pte) {
+			ret = VM_FAULT_RETRY;
+			goto out;
 		}
+		/*
+			* Make sure this is not a temporary clearing of pte
+			* by holding ptl and checking again. A R/M/W update
+			* of pte involves: take ptl, clearing the pte so that
+			* we don't have concurrent modification by hardware
+			* followed by an update.
+			*/
+		if (unlikely(pte_none(*vmf->pte)))
+			ret = VM_FAULT_SIGBUS;
+		else
+			ret = VM_FAULT_NOPAGE;
+
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
 	} else if (!(vmf->flags & FAULT_FLAG_WRITE))
 		ret = do_read_fault(vmf);
 	else if (!(vma->vm_flags & VM_SHARED))
@@ -4372,6 +4415,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 	else
 		ret = do_shared_fault(vmf);
 
+out:
 	/* preallocated pagetable is unused: free it */
 	if (vmf->prealloc_pte) {
 		pte_free(vm_mm, vmf->prealloc_pte);
@@ -5003,13 +5047,16 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
 					(address & PAGE_MASK) + PAGE_SIZE);
 		mmu_notifier_invalidate_range_start(range);
 	}
-	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+	ptep = pte_tryget_map_lock(mm, pmd, address, ptlp);
+	if (!ptep)
+		goto invalid;
 	if (!pte_present(*ptep))
 		goto unlock;
 	*ptepp = ptep;
 	return 0;
 unlock:
 	pte_unmap_unlock(ptep, *ptlp);
+invalid:
 	if (range)
 		mmu_notifier_invalidate_range_end(range);
 out:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8c74107a2b15..a846666c64c3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -523,7 +523,9 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	mapped_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!mapped_pte)
+		return 0;
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 70c7dc05bbfc..260471f37470 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -64,21 +64,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	pmd_t pmdval;
 
 again:
-	if (pmd_none(*pmdp))
+	pmdval = READ_ONCE(*pmdp);
+	if (pmd_none(pmdval))
 		return migrate_vma_collect_hole(start, end, -1, walk);
 
-	if (pmd_trans_huge(*pmdp)) {
+	if (pmd_trans_huge(pmdval)) {
 		struct page *page;
 
 		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (unlikely(!pmd_trans_huge(pmdval))) {
 			spin_unlock(ptl);
 			goto again;
 		}
 
-		page = pmd_page(*pmdp);
+		page = pmd_page(pmdval);
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
 			split_huge_pmd(vma, pmdp, addr);
@@ -99,16 +101,18 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (ret)
 				return migrate_vma_collect_skip(start, end,
 								walk);
-			if (pmd_none(*pmdp))
+			if (pmd_none(pmdval))
 				return migrate_vma_collect_hole(start, end, -1,
 								walk);
 		}
 	}
 
-	if (unlikely(pmd_bad(*pmdp)))
+	if (unlikely(pmd_bad(pmdval)))
 		return migrate_vma_collect_skip(start, end, walk);
 
-	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmdp, addr, &ptl);
+	if (!ptep)
+		goto again;
 	arch_enter_lazy_mmu_mode();
 
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
@@ -588,7 +592,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 			entry = pte_mkwrite(pte_mkdirty(entry));
 	}
 
-	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmdp, addr, &ptl);
+	if (!ptep)
+		goto abort;
 
 	if (check_stable_address_space(mm))
 		goto unlock_abort;
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..337f8a45ded0 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -105,6 +105,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
 
+again:
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		memset(vec, 1, nr);
@@ -117,7 +118,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	ptep = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto again;
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		pte_t pte = *ptep;
 
diff --git a/mm/mlock.c b/mm/mlock.c
index 716caf851043..89f7de636efc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -314,6 +314,7 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *start_pte, *pte;
 	struct page *page;
 
+again:
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (!pmd_present(*pmd))
@@ -328,7 +329,9 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
 		goto out;
 	}
 
-	start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	start_pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!start_pte)
+		goto again;
 	for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..aa09cd34ea30 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -63,7 +63,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	 * from under us even if the mmap_lock is only hold for
 	 * reading.
 	 */
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 
 	/* Get target node for single threaded private VMAs */
 	if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
diff --git a/mm/mremap.c b/mm/mremap.c
index 303d3290b938..d5ea5ce8a22a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -167,7 +167,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	 * We don't have to worry about the ordering of src and dst
 	 * pte locks because exclusive mmap_lock prevents deadlock.
 	 */
-	old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	old_pte = pte_tryget_map_lock(mm, old_pmd, old_addr, &old_ptl);
+	if (!old_pte)
+		goto drop_lock;
 	new_pte = pte_offset_map(new_pmd, new_addr);
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
@@ -206,6 +208,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+drop_lock:
 	if (need_rmap_locks)
 		drop_rmap_locks(vma);
 }
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9b3db11a4d1d..264b717e24ef 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -50,7 +50,9 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		err = walk_pte_range_inner(pte, addr, end, walk);
 		pte_unmap(pte);
 	} else {
-		pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+		pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+		if (!pte)
+			return end;
 		err = walk_pte_range_inner(pte, addr, end, walk);
 		pte_unmap_unlock(pte, ptl);
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63c61f8b2611..710fbeec9e58 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1790,10 +1790,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	if (unlikely(!page))
 		return -ENOMEM;
 
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte) {
+		ret = -EAGAIN;
+		goto out;
+	}
 	if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
 		ret = 0;
-		goto out;
+		goto unlock;
 	}
 
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
@@ -1808,8 +1812,9 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	swap_free(entry);
-out:
+unlock:
 	pte_unmap_unlock(pte, ptl);
+out:
 	if (page != swapcache) {
 		unlock_page(page);
 		put_page(page);
@@ -1897,7 +1902,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
-		if (ret)
+		if (ret && ret != -EAGAIN)
 			return ret;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0cb8e5ef1713..c1bce9cf5657 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -79,7 +79,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 			_dst_pte = pte_mkwrite(_dst_pte);
 	}
 
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	dst_pte = pte_tryget_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!dst_pte)
+		return -EAGAIN;
 
 	if (vma_is_shmem(dst_vma)) {
 		/* serialize against truncate with the page table lock */
@@ -194,7 +196,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 
 	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
 					 dst_vma->vm_page_prot));
-	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	dst_pte = pte_tryget_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+	if (!dst_pte)
+		return -EAGAIN;
 	if (dst_vma->vm_file) {
 		/* the shmem MAP_PRIVATE case requires checking the i_size */
 		inode = dst_vma->vm_file->f_inode;
@@ -587,6 +591,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 		}
 
+again:
 		dst_pmdval = pmd_read_atomic(dst_pmd);
 		/*
 		 * If the dst_pmd is mapped as THP don't
@@ -612,6 +617,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
 				       src_addr, &page, mcopy_mode, wp_copy);
+		if (err == -EAGAIN)
+			goto again;
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 12/18] mm: convert to use pte_tryget_map()
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (10 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Use pte_tryget_map() to help us to try to get the
refcount of the PTE page table page we want to access,
which can prevents the page from being freed during access.

For unuse_pte_range(), there are multiple locations where
pte_offset_map() is called, and it is inconvenient to handle
error conditions, so we perform pte_tryget() in advance in
unuse_pmd_range().

For the following cases, the PTE page table page is stable:

 - got the refcount of PTE page table page already
 - has no concurrent threads(e.g. the write lock of mmap_lock
			     is acquired)
 - the PTE page table page is not yet visible
 - turn off the local cpu interrupt or hold the rcu lock
   (e.g. GUP fast path)
 - the PTE page table page is kernel PTE page table page

So we still keep using pte_offset_map() and replace pte_unmap()
with __pte_unmap() which doesn't reduce the refcount.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/mm/mem_encrypt_identity.c | 11 ++++++++---
 fs/userfaultfd.c                   | 10 +++++++---
 include/linux/mm.h                 |  2 +-
 include/linux/swapops.h            |  4 ++--
 kernel/events/core.c               |  5 ++++-
 mm/gup.c                           | 16 +++++++++++-----
 mm/hmm.c                           |  9 +++++++--
 mm/huge_memory.c                   |  4 ++--
 mm/khugepaged.c                    |  8 +++++---
 mm/memory-failure.c                | 11 ++++++++---
 mm/memory.c                        | 19 +++++++++++++------
 mm/migrate.c                       |  8 ++++++--
 mm/mremap.c                        |  5 ++++-
 mm/page_table_check.c              |  2 +-
 mm/page_vma_mapped.c               | 13 ++++++++++---
 mm/pagewalk.c                      |  2 +-
 mm/swap_state.c                    |  4 ++--
 mm/swapfile.c                      |  9 ++++++---
 mm/vmalloc.c                       |  2 +-
 19 files changed, 99 insertions(+), 45 deletions(-)

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index 6d323230320a..37a3f4da7bd2 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -171,26 +171,31 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	pmd_t pmdval;
 
 	pud = sme_prepare_pgd(ppd);
 	if (!pud)
 		return;
 
 	pmd = pmd_offset(pud, ppd->vaddr);
-	if (pmd_none(*pmd)) {
+retry:
+	pmdval = READ_ONCE(*pmd);
+	if (pmd_none(pmdval)) {
 		pte = ppd->pgtable_area;
 		memset(pte, 0, sizeof(*pte) * PTRS_PER_PTE);
 		ppd->pgtable_area += sizeof(*pte) * PTRS_PER_PTE;
 		set_pmd(pmd, __pmd(PMD_FLAGS | __pa(pte)));
 	}
 
-	if (pmd_large(*pmd))
+	if (pmd_large(pmdval))
 		return;
 
 	pte = pte_offset_map(pmd, ppd->vaddr);
+	if (!pte)
+		goto retry;
 	if (pte_none(*pte))
 		set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
-	pte_unmap(pte);
+	__pte_unmap(pte);
 }
 
 static void __init __sme_map_range_pmd(struct sme_populate_pgd_data *ppd)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..c83fc73f29c0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -309,6 +309,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 * This is to deal with the instability (as in
 	 * pmd_trans_unstable) of the pmd.
 	 */
+retry:
 	_pmd = READ_ONCE(*pmd);
 	if (pmd_none(_pmd))
 		goto out;
@@ -324,10 +325,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	}
 
 	/*
-	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
-	 * and use the standard pte_offset_map() instead of parsing _pmd.
+	 * After we tryget successfully, the pmd is stable (as in
+	 * !pmd_trans_unstable) so we can re-read it and use the standard
+	 * pte_offset_map() instead of parsing _pmd.
 	 */
-	pte = pte_offset_map(pmd, address);
+	pte = pte_tryget_map(mm, pmd, address);
+	if (!pte)
+		goto retry;
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
 	 * changes under us.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04f7a6c36dc7..cc8fb009bab7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2284,7 +2284,7 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
 
 #define pte_alloc_map(mm, pmd, address)			\
-	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+	(pte_alloc(mm, pmd) ? NULL : pte_tryget_map(mm, pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
 	(pte_alloc(mm, pmd) ?			\
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d356ab4047f7..b671ecd6b5e7 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -214,7 +214,7 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl);
-extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+extern bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					unsigned long address);
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte);
@@ -236,7 +236,7 @@ static inline int is_migration_entry(swp_entry_t swp)
 
 static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 					spinlock_t *ptl) { }
-static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+static inline bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 					 unsigned long address) { }
 static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
 		struct mm_struct *mm, pte_t *pte) { }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 23bb19716ad3..443b0af075e6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7215,6 +7215,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 		return pud_leaf_size(pud);
 
 	pmdp = pmd_offset_lockless(pudp, pud, addr);
+retry:
 	pmd = READ_ONCE(*pmdp);
 	if (!pmd_present(pmd))
 		return 0;
@@ -7222,7 +7223,9 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
 	if (pmd_leaf(pmd))
 		return pmd_leaf_size(pmd);
 
-	ptep = pte_offset_map(&pmd, addr);
+	ptep = pte_tryget_map(mm, &pmd, addr);
+	if (!ptep)
+		goto retry;
 	pte = ptep_get_lockless(ptep);
 	if (pte_present(pte))
 		size = pte_leaf_size(pte);
diff --git a/mm/gup.c b/mm/gup.c
index d2c24181fb04..114a7e7f871b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -470,7 +470,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		if (!is_migration_entry(entry))
 			goto no_page;
 		pte_unmap_unlock(ptep, ptl);
-		migration_entry_wait(mm, pmd, address);
+		if (!migration_entry_wait(mm, pmd, address))
+			return no_page_table(vma, flags);
 		goto retry;
 	}
 	if ((flags & FOLL_NUMA) && pte_protnone(pte))
@@ -805,6 +806,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	pmd_t *pmd;
 	pte_t *pte;
 	int ret = -EFAULT;
+	pmd_t pmdval;
 
 	/* user gate pages are read-only */
 	if (gup_flags & FOLL_WRITE)
@@ -822,10 +824,14 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	if (pud_none(*pud))
 		return -EFAULT;
 	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
+retry:
+	pmdval = READ_ONCE(*pmd);
+	if (!pmd_present(pmdval))
 		return -EFAULT;
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-	pte = pte_offset_map(pmd, address);
+	VM_BUG_ON(pmd_trans_huge(pmdval));
+	pte = pte_tryget_map(mm, pmd, address);
+	if (!pte)
+		goto retry;
 	if (pte_none(*pte))
 		goto unmap;
 	*vma = get_gate_vma(mm);
@@ -2223,7 +2229,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 pte_unmap:
 	if (pgmap)
 		put_dev_pagemap(pgmap);
-	pte_unmap(ptem);
+	__pte_unmap(ptem);
 	return ret;
 }
 #else
diff --git a/mm/hmm.c b/mm/hmm.c
index af71aac3140e..0cf45092efca 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -279,7 +279,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (is_migration_entry(entry)) {
 			pte_unmap(ptep);
 			hmm_vma_walk->last = addr;
-			migration_entry_wait(walk->mm, pmdp, addr);
+			if (!migration_entry_wait(walk->mm, pmdp, addr))
+				return -EAGAIN;
 			return -EBUSY;
 		}
 
@@ -384,12 +385,16 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	ptep = pte_offset_map(pmdp, addr);
+	ptep = pte_tryget_map(walk->mm, pmdp, addr);
+	if (!ptep)
+		goto again;
 	for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) {
 		int r;
 
 		r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, hmm_pfns);
 		if (r) {
+			if (r == -EAGAIN)
+				goto again;
 			/* hmm_vma_handle_pte() did pte_unmap() */
 			return r;
 		}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c468fee595ff..73ac2e9c9193 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1932,7 +1932,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 		pte = pte_offset_map(&_pmd, haddr);
 		VM_BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, haddr, pte, entry);
-		pte_unmap(pte);
+		__pte_unmap(pte);
 	}
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
@@ -2086,7 +2086,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		set_pte_at(mm, addr, pte, entry);
 		if (!pmd_migration)
 			atomic_inc(&page[i]._mapcount);
-		pte_unmap(pte);
+		__pte_unmap(pte);
 	}
 
 	if (!pmd_migration) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3776cc315294..f540d7983b2d 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1003,7 +1003,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 			.pmd = pmd,
 		};
 
-		vmf.pte = pte_offset_map(pmd, address);
+		vmf.pte = pte_tryget_map(mm, pmd, address);
+		if (!vmf.pte)
+			return false;
 		vmf.orig_pte = *vmf.pte;
 		if (!is_swap_pte(vmf.orig_pte)) {
 			pte_unmap(vmf.pte);
@@ -1145,7 +1147,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spin_unlock(pte_ptl);
 
 	if (unlikely(!isolated)) {
-		pte_unmap(pte);
+		__pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1168,7 +1170,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl,
 			&compound_pagelist);
-	pte_unmap(pte);
+	__pte_unmap(pte);
 	/*
 	 * spin_lock() below is not the equivalent of smp_wmb(), but
 	 * the smp_wmb() inside __SetPageUptodate() can be reused to
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5247932df3fa..2a840ddfc34e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -304,6 +304,7 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	pmd_t pmdval;
 
 	VM_BUG_ON_VMA(address == -EFAULT, vma);
 	pgd = pgd_offset(vma->vm_mm, address);
@@ -318,11 +319,15 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
 	if (pud_devmap(*pud))
 		return PUD_SHIFT;
 	pmd = pmd_offset(pud, address);
-	if (!pmd_present(*pmd))
+retry:
+	pmdval = READ_ONCE(*pmd);
+	if (!pmd_present(pmdval))
 		return 0;
-	if (pmd_devmap(*pmd))
+	if (pmd_devmap(pmdval))
 		return PMD_SHIFT;
-	pte = pte_offset_map(pmd, address);
+	pte = pte_tryget_map(vma->vm_mm, pmd, address);
+	if (!pte)
+		goto retry;
 	if (pte_present(*pte) && pte_devmap(*pte))
 		ret = PAGE_SHIFT;
 	pte_unmap(pte);
diff --git a/mm/memory.c b/mm/memory.c
index ca03006b32cb..aa2bac561d5e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1091,7 +1091,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 
 	arch_leave_lazy_mmu_mode();
 	spin_unlock(src_ptl);
-	pte_unmap(orig_src_pte);
+	__pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
 	cond_resched();
@@ -3566,8 +3566,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
-			migration_entry_wait(vma->vm_mm, vmf->pmd,
-					     vmf->address);
+			if (!migration_entry_wait(vma->vm_mm, vmf->pmd,
+					     vmf->address))
+				ret = VM_FAULT_RETRY;
 		} else if (is_device_exclusive_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = remove_device_exclusive_entry(vmf);
@@ -4507,7 +4508,9 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 		flags |= TNF_MIGRATED;
 	} else {
 		flags |= TNF_MIGRATE_FAIL;
-		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		vmf->pte = pte_tryget_map(vma->vm_mm, vmf->pmd, vmf->address);
+		if (!vmf->pte)
+			return VM_FAULT_RETRY;
 		spin_lock(vmf->ptl);
 		if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4617,7 +4620,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
 
-	if (unlikely(pmd_none(*vmf->pmd))) {
+retry:
+	if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) {
 		/*
 		 * Leave __pte_alloc() until later: because vm_ops->fault may
 		 * want to allocate huge page, and if we expose page table
@@ -4646,7 +4650,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 * mmap_lock read mode and khugepaged takes it in write mode.
 		 * So now it's safe to run pte_offset_map().
 		 */
-		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		vmf->pte = pte_tryget_map(vmf->vma->vm_mm, vmf->pmd,
+					  vmf->address);
+		if (!vmf->pte)
+			goto retry;
 		vmf->orig_pte = *vmf->pte;
 
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c31ee1e1c9b..125fbe300df2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -301,12 +301,16 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
 	pte_unmap_unlock(ptep, ptl);
 }
 
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+bool migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long address)
 {
 	spinlock_t *ptl = pte_lockptr(mm, pmd);
-	pte_t *ptep = pte_offset_map(pmd, address);
+	pte_t *ptep = pte_tryget_map(mm, pmd, address);
+	if (!ptep)
+		return false;
 	__migration_entry_wait(mm, ptep, ptl);
+
+	return true;
 }
 
 void migration_entry_wait_huge(struct vm_area_struct *vma,
diff --git a/mm/mremap.c b/mm/mremap.c
index d5ea5ce8a22a..71022d42f577 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,7 +170,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	old_pte = pte_tryget_map_lock(mm, old_pmd, old_addr, &old_ptl);
 	if (!old_pte)
 		goto drop_lock;
-	new_pte = pte_offset_map(new_pmd, new_addr);
+	new_pte = pte_tryget_map(mm, new_pmd, new_addr);
+	if (!new_pte)
+		goto unmap_drop_lock;
 	new_ptl = pte_lockptr(mm, new_pmd);
 	if (new_ptl != old_ptl)
 		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
@@ -207,6 +209,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
+unmap_drop_lock:
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 drop_lock:
 	if (need_rmap_locks)
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 2458281bff89..185e84f22c6c 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -251,7 +251,7 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm,
 		pte_t *ptep = pte_offset_map(&pmd, addr);
 		unsigned long i;
 
-		pte_unmap(ptep);
+		__pte_unmap(ptep);
 		for (i = 0; i < PTRS_PER_PTE; i++) {
 			__page_table_check_pte_clear(mm, addr, *ptep);
 			addr += PAGE_SIZE;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 14a5cda73dee..8ecf8fd7cf5e 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -15,7 +15,9 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 
 static bool map_pte(struct page_vma_mapped_walk *pvmw)
 {
-	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+	pvmw->pte = pte_tryget_map(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address);
+	if (!pvmw->pte)
+		return false;
 	if (!(pvmw->flags & PVMW_SYNC)) {
 		if (pvmw->flags & PVMW_MIGRATION) {
 			if (!is_swap_pte(*pvmw->pte))
@@ -203,6 +205,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		}
 
 		pvmw->pmd = pmd_offset(pud, pvmw->address);
+retry:
 		/*
 		 * Make sure the pmd value isn't cached in a register by the
 		 * compiler and used as a stale value after we've observed a
@@ -251,8 +254,12 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
-		if (!map_pte(pvmw))
-			goto next_pte;
+		if (!map_pte(pvmw)) {
+			if (!pvmw->pte)
+				goto retry;
+			else
+				goto next_pte;
+		}
 this_pte:
 		if (check_pte(pvmw))
 			return true;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 264b717e24ef..adb5dacbd537 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -48,7 +48,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	if (walk->no_vma) {
 		pte = pte_offset_map(pmd, addr);
 		err = walk_pte_range_inner(pte, addr, end, walk);
-		pte_unmap(pte);
+		__pte_unmap(pte);
 	} else {
 		pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
 		if (!pte)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 013856004825..5b70c2c815ef 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -743,7 +743,7 @@ static void swap_ra_info(struct vm_fault *vmf,
 			SWAP_RA_VAL(faddr, win, 0));
 
 	if (win == 1) {
-		pte_unmap(orig_pte);
+		__pte_unmap(orig_pte);
 		return;
 	}
 
@@ -768,7 +768,7 @@ static void swap_ra_info(struct vm_fault *vmf,
 	for (pfn = start; pfn != end; pfn++)
 		*tpte++ = *pte++;
 #endif
-	pte_unmap(orig_pte);
+	__pte_unmap(orig_pte);
 }
 
 /**
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 710fbeec9e58..f1c64fc15e24 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1845,7 +1845,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			continue;
 
 		offset = swp_offset(entry);
-		pte_unmap(pte);
+		__pte_unmap(pte);
 		swap_map = &si->swap_map[offset];
 		page = lookup_swap_cache(entry, vma, addr);
 		if (!page) {
@@ -1880,7 +1880,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 try_next:
 		pte = pte_offset_map(pmd, addr);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
-	pte_unmap(pte - 1);
+	__pte_unmap(pte - 1);
 
 	ret = 0;
 out:
@@ -1901,8 +1901,11 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
+		if (!pte_tryget(vma->vm_mm, pmd, addr))
+			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type);
-		if (ret && ret != -EAGAIN)
+		__pte_put(pmd_pgtable(*pmd));
+		if (ret)
 			return ret;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e163372d3967..080aa78bdaff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -694,7 +694,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 	pte = *ptep;
 	if (pte_present(pte))
 		page = pte_page(pte);
-	pte_unmap(ptep);
+	__pte_unmap(ptep);
 
 	return page;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (11 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-30 13:35   ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Normally, the percpu_ref of the user PTE page table page is in
percpu mode. This patch add try_to_free_user_pte() to switch
the percpu_ref to atomic mode and check if it is 0. If the
percpu_ref is 0, which means that no one is using the user PTE
page table page, then we can safely reclaim it.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pte_ref.h |  7 +++
 mm/pte_ref.c            | 99 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index bfe620038699..379c3b45a6ab 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -16,6 +16,8 @@ void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
 bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
 void __pte_put(pgtable_t page);
 void pte_put(pte_t *ptep);
+void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+			  bool switch_back);
 
 #else /* !CONFIG_FREE_USER_PTE */
 
@@ -47,6 +49,11 @@ static inline void pte_put(pte_t *ptep)
 {
 }
 
+static inline void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd,
+					unsigned long addr, bool switch_back)
+{
+}
+
 #endif /* CONFIG_FREE_USER_PTE */
 
 #endif /* _LINUX_PTE_REF_H */
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 5b382445561e..bf9629272c71 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -8,6 +8,9 @@
 #include <linux/pte_ref.h>
 #include <linux/percpu-refcount.h>
 #include <linux/slab.h>
+#include <linux/hugetlb.h>
+#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 #ifdef CONFIG_FREE_USER_PTE
 
@@ -44,8 +47,6 @@ void pte_ref_free(pgtable_t pte)
 	kfree(ref);
 }
 
-void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {}
-
 /*
  * pte_tryget - try to get the pte_ref of the user PTE page table page
  * @mm: pointer the target address space
@@ -102,4 +103,98 @@ void pte_put(pte_t *ptep)
 }
 EXPORT_SYMBOL(pte_put);
 
+#ifdef CONFIG_DEBUG_VM
+void pte_free_debug(pmd_t pmd)
+{
+	pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd);
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PTE; i++)
+		BUG_ON(!pte_none(*ptep++));
+}
+#else
+static inline void pte_free_debug(pmd_t pmd)
+{
+}
+#endif
+
+static inline void pte_free_rcu(struct rcu_head *rcu)
+{
+	struct page *page = container_of(rcu, struct page, rcu_head);
+
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
+/*
+ * free_user_pte - free the user PTE page table page
+ * @mm: pointer the target address space
+ * @pmd: pointer to a PMD
+ * @addr: start address of the tlb range to be flushed
+ *
+ * Context: The pmd range has been unmapped and TLB purged. And the user PTE
+ *	    page table page will be freed by rcu handler.
+ */
+void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	spinlock_t *ptl;
+	pmd_t pmdval;
+
+	ptl = pmd_lock(mm, pmd);
+	pmdval = *pmd;
+	if (pmd_none(pmdval) || pmd_leaf(pmdval)) {
+		spin_unlock(ptl);
+		return;
+	}
+	pmd_clear(pmd);
+	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+	spin_unlock(ptl);
+
+	pte_free_debug(pmdval);
+	mm_dec_nr_ptes(mm);
+	call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
+}
+
+/*
+ * try_to_free_user_pte - try to free the user PTE page table page
+ * @mm: pointer the target address space
+ * @pmd: pointer to a PMD
+ * @addr: virtual address associated with pmd
+ * @switch_back: indicates if switching back to percpu mode is required
+ */
+void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
+			  bool switch_back)
+{
+	pgtable_t pte;
+
+	if (&init_mm == mm)
+		return;
+
+	if (!pte_tryget(mm, pmd, addr))
+		return;
+	pte = pmd_pgtable(*pmd);
+	percpu_ref_switch_to_atomic_sync(pte->pte_ref);
+	rcu_read_lock();
+	/*
+	 * Here we can safely put the pte_ref because we already hold the rcu
+	 * lock, which guarantees that the user PTE page table page will not
+	 * be released.
+	 */
+	__pte_put(pte);
+	if (percpu_ref_is_zero(pte->pte_ref)) {
+		rcu_read_unlock();
+		free_user_pte(mm, pmd, addr & PMD_MASK);
+		return;
+	}
+	rcu_read_unlock();
+
+	if (switch_back) {
+		if (pte_tryget(mm, pmd, addr)) {
+			percpu_ref_switch_to_percpu(pte->pte_ref);
+			__pte_put(pte);
+		}
+	}
+}
+
 #endif /* CONFIG_FREE_USER_PTE */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (12 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Immediately after a successful MADV_DONTNEED operation, the
physical page is unmapped from the PTE page table entry. This
is a good time to call try_to_free_user_pte() to try to free
the PTE page table page.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/internal.h |  3 ++-
 mm/memory.c   | 43 +++++++++++++++++++++++++++++--------------
 mm/oom_kill.c |  3 ++-
 3 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index cf16280ce132..f93a9170d2e3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -77,7 +77,8 @@ struct zap_details;
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
-			     struct zap_details *details);
+			     struct zap_details *details,
+			     bool free_pte);
 
 void page_cache_ra_order(struct readahead_control *, struct file_ra_state *,
 		unsigned int order);
diff --git a/mm/memory.c b/mm/memory.c
index aa2bac561d5e..75a0e16a095a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1339,7 +1339,8 @@ static inline bool should_zap_page(struct zap_details *details, struct page *pag
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
-				struct zap_details *details)
+				struct zap_details *details,
+				bool free_pte)
 {
 	struct mm_struct *mm = tlb->mm;
 	int force_flush = 0;
@@ -1348,6 +1349,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	unsigned long start = addr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1455,13 +1457,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto again;
 	}
 
+	if (free_pte)
+		try_to_free_user_pte(mm, pmd, start, true);
+
 	return addr;
 }
 
 static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
-				struct zap_details *details)
+				struct zap_details *details,
+				bool free_pte)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -1496,7 +1502,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		 */
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			goto next;
-		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+		next = zap_pte_range(tlb, vma, pmd, addr, next, details,
+				     free_pte);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
@@ -1507,7 +1514,8 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
-				struct zap_details *details)
+				struct zap_details *details,
+				bool free_pte)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -1525,7 +1533,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 		}
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		next = zap_pmd_range(tlb, vma, pud, addr, next, details);
+		next = zap_pmd_range(tlb, vma, pud, addr, next, details,
+				     free_pte);
 next:
 		cond_resched();
 	} while (pud++, addr = next, addr != end);
@@ -1536,7 +1545,8 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
-				struct zap_details *details)
+				struct zap_details *details,
+				bool free_pte)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -1546,7 +1556,8 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		next = zap_pud_range(tlb, vma, p4d, addr, next, details);
+		next = zap_pud_range(tlb, vma, p4d, addr, next, details,
+				     free_pte);
 	} while (p4d++, addr = next, addr != end);
 
 	return addr;
@@ -1555,7 +1566,8 @@ static inline unsigned long zap_p4d_range(struct mmu_gather *tlb,
 void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
-			     struct zap_details *details)
+			     struct zap_details *details,
+			     bool free_pte)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -1567,7 +1579,8 @@ void unmap_page_range(struct mmu_gather *tlb,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		next = zap_p4d_range(tlb, vma, pgd, addr, next, details);
+		next = zap_p4d_range(tlb, vma, pgd, addr, next, details,
+				     free_pte);
 	} while (pgd++, addr = next, addr != end);
 	tlb_end_vma(tlb, vma);
 }
@@ -1576,7 +1589,8 @@ void unmap_page_range(struct mmu_gather *tlb,
 static void unmap_single_vma(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr,
-		struct zap_details *details)
+		struct zap_details *details,
+		bool free_pte)
 {
 	unsigned long start = max(vma->vm_start, start_addr);
 	unsigned long end;
@@ -1612,7 +1626,8 @@ static void unmap_single_vma(struct mmu_gather *tlb,
 				i_mmap_unlock_write(vma->vm_file->f_mapping);
 			}
 		} else
-			unmap_page_range(tlb, vma, start, end, details);
+			unmap_page_range(tlb, vma, start, end, details,
+					 free_pte);
 	}
 }
 
@@ -1644,7 +1659,7 @@ void unmap_vmas(struct mmu_gather *tlb,
 				start_addr, end_addr);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
-		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL, false);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
@@ -1669,7 +1684,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
 	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
-		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+		unmap_single_vma(&tlb, vma, start, range.end, NULL, true);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
@@ -1695,7 +1710,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
-	unmap_single_vma(&tlb, vma, address, range.end, details);
+	unmap_single_vma(&tlb, vma, address, range.end, details, true);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7ec38194f8e1..c4c25a7add7b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -549,7 +549,8 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 				ret = false;
 				continue;
 			}
-			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
+			unmap_page_range(&tlb, vma, range.start, range.end,
+					 NULL, false);
 			mmu_notifier_invalidate_range_end(&range);
 			tlb_finish_mmu(&tlb);
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (13 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Different from MADV_DONTNEED case, MADV_FREE just marks the physical
page as lazyfree instead of unmapping it immediately, and the physical
page will not be unmapped until the system memory is tight. So we
convert the percpu_ref of the related user PTE page table page to
atomic mode in madvise_free_pte_range(), and then check if it is 0
in try_to_unmap_one(). If it is 0, we can safely reclaim the PTE page
table page at this time.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/rmap.h |  2 ++
 mm/madvise.c         |  7 ++++++-
 mm/page_vma_mapped.c | 46 ++++++++++++++++++++++++++++++++++++++++++--
 mm/rmap.c            |  9 +++++++++
 4 files changed, 61 insertions(+), 3 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 17230c458341..a3174d3bf118 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -204,6 +204,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
 #define PVMW_SYNC		(1 << 0)
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
+/* Used for MADV_FREE page */
+#define PVMW_MADV_FREE		(1 << 2)
 
 struct page_vma_mapped_walk {
 	unsigned long pfn;
diff --git a/mm/madvise.c b/mm/madvise.c
index 8123397f14c8..bd4bcaad5a9f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -598,7 +598,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *orig_pte, *pte, ptent;
 	struct page *page;
 	int nr_swap = 0;
+	bool have_lazyfree = false;
 	unsigned long next;
+	unsigned long start = addr;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -709,6 +711,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			tlb_remove_tlb_entry(tlb, pte, addr);
 		}
 		mark_page_lazyfree(page);
+		have_lazyfree = true;
 	}
 out:
 	if (nr_swap) {
@@ -718,8 +721,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		add_mm_counter(mm, MM_SWAPENTS, nr_swap);
 	}
 	arch_leave_lazy_mmu_mode();
-	if (orig_pte)
+	if (orig_pte) {
 		pte_unmap_unlock(orig_pte, ptl);
+		try_to_free_user_pte(mm, pmd, start, !have_lazyfree);
+	}
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8ecf8fd7cf5e..00bc09f57f48 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -266,8 +266,30 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 next_pte:
 		do {
 			pvmw->address += PAGE_SIZE;
-			if (pvmw->address >= end)
-				return not_found(pvmw);
+			if (pvmw->address >= end) {
+				not_found(pvmw);
+
+				if (pvmw->flags & PVMW_MADV_FREE) {
+					pgtable_t pte;
+					pmd_t pmdval;
+
+					pvmw->flags &= ~PVMW_MADV_FREE;
+					rcu_read_lock();
+					pmdval = READ_ONCE(*pvmw->pmd);
+					if (pmd_none(pmdval) || pmd_leaf(pmdval)) {
+						rcu_read_unlock();
+						return false;
+					}
+					pte = pmd_pgtable(pmdval);
+					if (percpu_ref_is_zero(pte->pte_ref)) {
+						rcu_read_unlock();
+						free_user_pte(mm, pvmw->pmd, pvmw->address);
+					} else {
+						rcu_read_unlock();
+					}
+				}
+				return false;
+			}
 			/* Did we cross page table boundary? */
 			if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
 				if (pvmw->ptl) {
@@ -275,6 +297,26 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					pvmw->ptl = NULL;
 				}
 				pte_unmap(pvmw->pte);
+				if (pvmw->flags & PVMW_MADV_FREE) {
+					pgtable_t pte;
+					pmd_t pmdval;
+
+					pvmw->flags &= ~PVMW_MADV_FREE;
+					rcu_read_lock();
+					pmdval = READ_ONCE(*pvmw->pmd);
+					if (pmd_none(pmdval) || pmd_leaf(pmdval)) {
+						rcu_read_unlock();
+						pvmw->pte = NULL;
+						goto restart;
+					}
+					pte = pmd_pgtable(pmdval);
+					if (percpu_ref_is_zero(pte->pte_ref)) {
+						rcu_read_unlock();
+						free_user_pte(mm, pvmw->pmd, pvmw->address);
+					} else {
+						rcu_read_unlock();
+					}
+				}
 				pvmw->pte = NULL;
 				goto restart;
 			}
diff --git a/mm/rmap.c b/mm/rmap.c
index fedb82371efe..f978d324d4f9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1616,6 +1616,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
+					if (IS_ENABLED(CONFIG_FREE_USER_PTE))
+						pvmw.flags |= PVMW_MADV_FREE;
 					goto discard;
 				}
 
@@ -1627,6 +1629,13 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				folio_set_swapbacked(folio);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
+				if (IS_ENABLED(CONFIG_FREE_USER_PTE) &&
+				    pte_tryget(mm, pvmw.pmd, address)) {
+					pgtable_t pte_page = pmd_pgtable(*pvmw.pmd);
+
+					percpu_ref_switch_to_percpu(pte_page->pte_ref);
+					__pte_put(pte_page);
+				}
 				break;
 			}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (14 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The track_pte_set() is used to track the setting of the PTE page table
entry, and the percpu_ref of the PTE page table page will be incremented
when the entry changes from pte_none() to !pte_none().

The track_pte_clear() is used to track the clearing of the PTE page
table entry, and the percpu_ref of the PTE page table page will be
decremented when the entry changes from !pte_none() to pte_none().

In this way, the usage of the PTE page table page can be tracked by
its percpu_ref.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pte_ref.h | 14 ++++++++++++++
 mm/pte_ref.c            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 379c3b45a6ab..6ab740e1b989 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -18,6 +18,10 @@ void __pte_put(pgtable_t page);
 void pte_put(pte_t *ptep);
 void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
 			  bool switch_back);
+void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		   pte_t pte);
+void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte);
 
 #else /* !CONFIG_FREE_USER_PTE */
 
@@ -54,6 +58,16 @@ static inline void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd,
 {
 }
 
+static inline void track_pte_set(struct mm_struct *mm, unsigned long addr,
+				 pte_t *ptep, pte_t pte)
+{
+}
+
+static inline void track_pte_clear(struct mm_struct *mm, unsigned long addr,
+				   pte_t *ptep, pte_t pte)
+{
+}
+
 #endif /* CONFIG_FREE_USER_PTE */
 
 #endif /* _LINUX_PTE_REF_H */
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index bf9629272c71..e92510deda0b 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -197,4 +197,34 @@ void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
 	}
 }
 
+void track_pte_set(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		   pte_t pte)
+{
+	pgtable_t page;
+
+	if (&init_mm == mm || pte_huge(pte))
+		return;
+
+	page = pte_to_page(ptep);
+	BUG_ON(percpu_ref_is_zero(page->pte_ref));
+	if (pte_none(*ptep) && !pte_none(pte))
+		percpu_ref_get(page->pte_ref);
+}
+EXPORT_SYMBOL(track_pte_set);
+
+void track_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
+		     pte_t pte)
+{
+	pgtable_t page;
+
+	if (&init_mm == mm || pte_huge(pte))
+		return;
+
+	page = pte_to_page(ptep);
+	BUG_ON(percpu_ref_is_zero(page->pte_ref));
+	if (!pte_none(pte))
+		percpu_ref_put(page->pte_ref);
+}
+EXPORT_SYMBOL(track_pte_clear);
+
 #endif /* CONFIG_FREE_USER_PTE */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (15 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
  2022-05-17  8:30 ` [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
  18 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Add pte_ref hooks into routines that modify user PTE page tables,
and select ARCH_SUPPORTS_FREE_USER_PTE, so that the pte_ref code
can be compiled and worked on this architecture.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b0142e01002e..c1046fc15882 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
 	select SWIOTLB
 	select ARCH_HAS_ELFCORE_COMPAT
 	select ZONE_DMA32
+	select ARCH_SUPPORTS_FREE_USER_PTE
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..08d0aa5ce8d4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,7 @@
 #include <asm/coco.h>
 #include <asm-generic/pgtable_uffd.h>
 #include <linux/page_table_check.h>
+#include <linux/pte_ref.h>
 
 extern pgd_t early_top_pgt[PTRS_PER_PGD];
 bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -1010,6 +1011,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pte)
 {
 	page_table_check_pte_set(mm, addr, ptep, pte);
+	track_pte_set(mm, addr, ptep, pte);
 	set_pte(ptep, pte);
 }
 
@@ -1055,6 +1057,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 {
 	pte_t pte = native_ptep_get_and_clear(ptep);
 	page_table_check_pte_clear(mm, addr, pte);
+	track_pte_clear(mm, addr, ptep, pte);
 	return pte;
 }
 
@@ -1071,6 +1074,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 		 */
 		pte = native_local_ptep_get_and_clear(ptep);
 		page_table_check_pte_clear(mm, addr, pte);
+		track_pte_clear(mm, addr, ptep, pte);
 	} else {
 		pte = ptep_get_and_clear(mm, addr, ptep);
 	}
@@ -1081,7 +1085,8 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
 static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep)
 {
-	if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK))
+	if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK)
+	    || IS_ENABLED(CONFIG_FREE_USER_PTE))
 		ptep_get_and_clear(mm, addr, ptep);
 	else
 		pte_clear(mm, addr, ptep);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 18/18] Documentation: add document for pte_ref
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (16 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
@ 2022-04-29 13:35 ` Qi Zheng
  2022-04-30 13:19   ` Bagas Sanjaya
  2022-05-17  8:30 ` [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
  18 siblings, 1 reply; 27+ messages in thread
From: Qi Zheng @ 2022-04-29 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

This commit adds document for pte_ref under `Documentation/vm/`.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/vm/index.rst   |   1 +
 Documentation/vm/pte_ref.rst | 210 +++++++++++++++++++++++++++++++++++
 2 files changed, 211 insertions(+)
 create mode 100644 Documentation/vm/pte_ref.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..ee71baccc2e7 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -31,6 +31,7 @@ algorithms.  If you are looking for advice on simply allocating memory, see the
    page_frags
    page_owner
    page_table_check
+   pte_ref
    remap_file_pages
    slub
    split_page_table_lock
diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
new file mode 100644
index 000000000000..0ac1e5a408d7
--- /dev/null
+++ b/Documentation/vm/pte_ref.rst
@@ -0,0 +1,210 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================================
+pte_ref: Tracking about how many references to each user PTE page table page
+============================================================================
+
+Preface
+=======
+
+Now in order to pursue high performance, applications mostly use some
+high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
+These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
+physical memory for the following reasons::
+
+ First of all, we should hold as few write locks of mmap_lock as possible,
+ since the mmap_lock semaphore has long been a contention point in the
+ memory management subsystem. The mmap()/munmap() hold the write lock, and
+ the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
+ madvise() instead of munmap() to released physical memory can reduce the
+ competition of the mmap_lock.
+
+ Secondly, after using madvise() to release physical memory, there is no
+ need to build vma and allocate page tables again when accessing the same
+ virtual address again, which can also save some time.
+
+The following is the largest user PTE page table memory that can be
+allocated by a single user process in a 32-bit and a 64-bit system.
+
++---------------------------+--------+---------+
+|                           | 32-bit | 64-bit  |
++===========================+========+=========+
+| user PTE page table pages | 3 MiB  | 512 GiB |
++---------------------------+--------+---------+
+| user PMD page table pages | 3 KiB  | 1 GiB   |
++---------------------------+--------+---------+
+
+(for 32-bit, take 3G user address space, 4K page size as an example;
+ for 64-bit, take 48-bit address width, 4K page size as an example.)
+
+After using madvise(), everything looks good, but as can be seen from the
+above table, a single process can create a large number of PTE page tables
+on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
+release page table memory. And before the process exits or calls munmap(),
+the kernel cannot reclaim these pages even if these PTE page tables do not
+map anything.
+
+To fix the situation, we introduces a reference count for each user PTE page
+table page. Then we can track whether users are using the user PTE page table
+page and reclaim the user PTE page table pages that does not map anything at
+the right time.
+
+Introduction
+============
+
+The ``pte_ref``, which is the reference count of user PTE page table page, is
+``percpu_ref`` type. It is used to track the usage of each user PTE page table
+page.
+
+Who will hold the pte_ref?
+--------------------------
+
+The following people will hold a pte_ref::
+
+ The !pte_none() entry, such as regular page table entry that map physical
+ pages, or swap entry, or migrate entry, etc.
+
+ Visitor to the PTE page table entries, such as page table walker.
+
+Any ``!pte_none()`` entry and visitor can be regarded as the user of the PTE
+page table page. When the pte_ref is reduced to 0, it means that no one is
+using the PTE page table page, then this free PTE page table page can be
+reclaimed at this time.
+
+About mode switching
+--------------------
+
+When user PTE page table page is allocated, its ``pte_ref`` will be initialized
+to percpu mode, which basically does not bring performance overhead. When we
+want to reclaim the PTE page, it will be switched to atomic mode. Then we can
+check if the ``pte_ref`` is zero::
+
+ - If it is zero, we can safely reclaim it immediately;
+ - If it is not zero but we expect that the PTE page can be reclaimed
+   automatically when no one is using it, we can keep its ``pte_ref`` in
+   atomic mode (e.g. MADV_FREE case);
+ - If it is not zero, and we will continue to try at the next opportunity,
+   then we can choose to switch back to percpu mode (e.g. MADV_DONTNEED case).
+
+Competitive relationship
+------------------------
+
+Now, the user page table will only be released by calling ``free_pgtables()``
+when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path).
+So other threads only need to ensure mutual exclusion with these paths to ensure
+that the page table is not released. For example::
+
+	thread A			thread B
+	page table walker		munmap
+	=================		======
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+		pte_offset_map_lock()
+		*walk page table*
+		pte_unmap_unlock()
+	}
+	mmap_read_unlock()
+
+					mmap_write_lock_killable()
+					detach_vmas_to_be_unmapped()
+					unmap_region()
+					--> free_pgtables()
+
+But after we introduce the ``pte_ref`` for the user PTE page table page, these
+existing balances will be broken. The page can be released at any time when its
+``pte_ref`` is reduced to 0. Therefore, the following case may happen::
+
+	thread A		thread B			thread C
+	page table walker	madvise(MADV_DONTNEED)		page fault
+	=================	======================		==========
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+
+				mmap_read_lock()
+				unmap_page_range()
+				--> zap_pte_range()
+				    /* the pte_ref is reduced to 0 */
+				    --> free PTE page table page
+
+								mmap_read_lock()
+								/* may allocate
+								 * a new huge
+								 * pmd or a new
+								 * PTE page
+								 */
+
+		/* broken!! */
+		pte_offset_map_lock()
+
+As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so
+they can execute concurrently. When thread B releases the PTE page table page,
+the value in the corresponding pmd entry will become unstable, which may be
+none or huge pmd, or map a new PTE page table page again. This will cause system
+chaos and even panic.
+
+So as described in the section "Who will hold the pte_ref?", the page table
+walker (visitor) also need to try to take a ``pte_ref`` to the user PTE page
+table page before walking page table (the helper ``pte_tryget_map{_lock}()``
+can help us to do this), then the system will become orderly again::
+
+	thread A		thread B
+	page table walker	madvise(MADV_DONTNEED)
+	=================	======================
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+		pte_tryget()
+		--> percpu_ref_tryget
+		*if successfully, then:*
+
+				mmap_read_lock()
+				unmap_page_range()
+				--> zap_pte_range()
+				    /* the pte_refcount is reduced to 1 */
+
+		pte_offset_map_lock()
+		*walk page table*
+		pte_unmap_unlock()
+
+There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
+to do any additional operations to ensure that the system is in order. Take fast
+GUP as an example::
+
+	thread A		thread B
+	fast GUP		madvise(MADV_DONTNEED)
+	========		======================
+
+	get_user_pages_fast_only()
+	--> local_irq_save();
+				call_rcu(pte_free_rcu)
+	    gup_pgd_range();
+	    local_irq_restore();
+	    			/* do pte_free_rcu() */
+
+Helpers
+=======
+
++----------------------+------------------------------------------------+
+| pte_ref_init         | Initialize the pte_ref                         |
++----------------------+------------------------------------------------+
+| pte_ref_free         | Free the pte_ref                               |
++----------------------+------------------------------------------------+
+| pte_tryget           | Try to hold a pte_ref                          |
++----------------------+------------------------------------------------+
+| pte_put              | Decrement a pte_ref                            |
++----------------------+------------------------------------------------+
+| pte_tryget_map       | Do pte_tryget and pte_offset_map               |
++----------------------+------------------------------------------------+
+| pte_tryget_map_lock  | Do pte_tryget and pte_offset_map_lock          |
++----------------------+------------------------------------------------+
+| free_user_pte        | Free the user PTE page table page              |
++----------------------+------------------------------------------------+
+| try_to_free_user_pte | Try to free the user PTE page table page       |
++----------------------+------------------------------------------------+
+| track_pte_set        | Track the setting of user PTE page table page  |
++----------------------+------------------------------------------------+
+| track_pte_clear      | Track the clearing of user PTE page table page |
++----------------------+------------------------------------------------+
+
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 18/18] Documentation: add document for pte_ref
  2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
@ 2022-04-30 13:19   ` Bagas Sanjaya
  2022-04-30 13:32     ` Qi Zheng
  0 siblings, 1 reply; 27+ messages in thread
From: Bagas Sanjaya @ 2022-04-30 13:19 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

Hi Qi,

On Fri, Apr 29, 2022 at 09:35:52PM +0800, Qi Zheng wrote:
> +Now in order to pursue high performance, applications mostly use some
> +high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
> +These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
> +physical memory for the following reasons::
> +
> + First of all, we should hold as few write locks of mmap_lock as possible,
> + since the mmap_lock semaphore has long been a contention point in the
> + memory management subsystem. The mmap()/munmap() hold the write lock, and
> + the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
> + madvise() instead of munmap() to released physical memory can reduce the
> + competition of the mmap_lock.
> +
> + Secondly, after using madvise() to release physical memory, there is no
> + need to build vma and allocate page tables again when accessing the same
> + virtual address again, which can also save some time.
> +

I think we can use enumerated list, like below:

-- >8 --

diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
index 0ac1e5a408d7c6..67b18e74fcb367 100644
--- a/Documentation/vm/pte_ref.rst
+++ b/Documentation/vm/pte_ref.rst
@@ -10,18 +10,18 @@ Preface
 Now in order to pursue high performance, applications mostly use some
 high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
 These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
-physical memory for the following reasons::
-
- First of all, we should hold as few write locks of mmap_lock as possible,
- since the mmap_lock semaphore has long been a contention point in the
- memory management subsystem. The mmap()/munmap() hold the write lock, and
- the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
- madvise() instead of munmap() to released physical memory can reduce the
- competition of the mmap_lock.
-
- Secondly, after using madvise() to release physical memory, there is no
- need to build vma and allocate page tables again when accessing the same
- virtual address again, which can also save some time.
+physical memory for the following reasons:
+
+1. We should hold as few write locks of mmap_lock as possible,
+   since the mmap_lock semaphore has long been a contention point in the
+   memory management subsystem. The mmap()/munmap() hold the write lock, and
+   the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
+   madvise() instead of munmap() to released physical memory can reduce the
+   competition of the mmap_lock.
+
+2. After using madvise() to release physical memory, there is no
+   need to build vma and allocate page tables again when accessing the same
+   virtual address again, which can also save some time.
 
 The following is the largest user PTE page table memory that can be
 allocated by a single user process in a 32-bit and a 64-bit system.

> +The following is the largest user PTE page table memory that can be
> +allocated by a single user process in a 32-bit and a 64-bit system.
> +

We can say "assuming 4K page size" here,

> ++---------------------------+--------+---------+
> +|                           | 32-bit | 64-bit  |
> ++===========================+========+=========+
> +| user PTE page table pages | 3 MiB  | 512 GiB |
> ++---------------------------+--------+---------+
> +| user PMD page table pages | 3 KiB  | 1 GiB   |
> ++---------------------------+--------+---------+
> +
> +(for 32-bit, take 3G user address space, 4K page size as an example;
> + for 64-bit, take 48-bit address width, 4K page size as an example.)
> +

... instead of here.

> +There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
> +to do any additional operations to ensure that the system is in order. Take fast
> +GUP as an example::
> +
> +	thread A		thread B
> +	fast GUP		madvise(MADV_DONTNEED)
> +	========		======================
> +
> +	get_user_pages_fast_only()
> +	--> local_irq_save();
> +				call_rcu(pte_free_rcu)
> +	    gup_pgd_range();
> +	    local_irq_restore();
> +	    			/* do pte_free_rcu() */
> +

I see whitespace warning circa do pte_free_rcu() line above when
applying this series.

Thanks.

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 18/18] Documentation: add document for pte_ref
  2022-04-30 13:19   ` Bagas Sanjaya
@ 2022-04-30 13:32     ` Qi Zheng
  0 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-30 13:32 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg, tj,
	dennis, ming.lei, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming



On 2022/4/30 9:19 PM, Bagas Sanjaya wrote:
> Hi Qi,
> 
> On Fri, Apr 29, 2022 at 09:35:52PM +0800, Qi Zheng wrote:
>> +Now in order to pursue high performance, applications mostly use some
>> +high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
>> +These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
>> +physical memory for the following reasons::
>> +
>> + First of all, we should hold as few write locks of mmap_lock as possible,
>> + since the mmap_lock semaphore has long been a contention point in the
>> + memory management subsystem. The mmap()/munmap() hold the write lock, and
>> + the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
>> + madvise() instead of munmap() to released physical memory can reduce the
>> + competition of the mmap_lock.
>> +
>> + Secondly, after using madvise() to release physical memory, there is no
>> + need to build vma and allocate page tables again when accessing the same
>> + virtual address again, which can also save some time.
>> +
> 
> I think we can use enumerated list, like below:

Thanks for your review, LGTM, will do.

> 
> -- >8 --
> 
> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
> index 0ac1e5a408d7c6..67b18e74fcb367 100644
> --- a/Documentation/vm/pte_ref.rst
> +++ b/Documentation/vm/pte_ref.rst
> @@ -10,18 +10,18 @@ Preface
>   Now in order to pursue high performance, applications mostly use some
>   high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
>   These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
> -physical memory for the following reasons::
> -
> - First of all, we should hold as few write locks of mmap_lock as possible,
> - since the mmap_lock semaphore has long been a contention point in the
> - memory management subsystem. The mmap()/munmap() hold the write lock, and
> - the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
> - madvise() instead of munmap() to released physical memory can reduce the
> - competition of the mmap_lock.
> -
> - Secondly, after using madvise() to release physical memory, there is no
> - need to build vma and allocate page tables again when accessing the same
> - virtual address again, which can also save some time.
> +physical memory for the following reasons:
> +
> +1. We should hold as few write locks of mmap_lock as possible,
> +   since the mmap_lock semaphore has long been a contention point in the
> +   memory management subsystem. The mmap()/munmap() hold the write lock, and
> +   the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
> +   madvise() instead of munmap() to released physical memory can reduce the
> +   competition of the mmap_lock.
> +
> +2. After using madvise() to release physical memory, there is no
> +   need to build vma and allocate page tables again when accessing the same
> +   virtual address again, which can also save some time.
>   
>   The following is the largest user PTE page table memory that can be
>   allocated by a single user process in a 32-bit and a 64-bit system.
> 
>> +The following is the largest user PTE page table memory that can be
>> +allocated by a single user process in a 32-bit and a 64-bit system.
>> +
> 
> We can say "assuming 4K page size" here,
> 
>> ++---------------------------+--------+---------+
>> +|                           | 32-bit | 64-bit  |
>> ++===========================+========+=========+
>> +| user PTE page table pages | 3 MiB  | 512 GiB |
>> ++---------------------------+--------+---------+
>> +| user PMD page table pages | 3 KiB  | 1 GiB   |
>> ++---------------------------+--------+---------+
>> +
>> +(for 32-bit, take 3G user address space, 4K page size as an example;
>> + for 64-bit, take 48-bit address width, 4K page size as an example.)
>> +
> 
> ... instead of here.

will do.

> 
>> +There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
>> +to do any additional operations to ensure that the system is in order. Take fast
>> +GUP as an example::
>> +
>> +	thread A		thread B
>> +	fast GUP		madvise(MADV_DONTNEED)
>> +	========		======================
>> +
>> +	get_user_pages_fast_only()
>> +	--> local_irq_save();
>> +				call_rcu(pte_free_rcu)
>> +	    gup_pgd_range();
>> +	    local_irq_restore();
>> +	    			/* do pte_free_rcu() */
>> +
> 
> I see whitespace warning circa do pte_free_rcu() line above when
> applying this series.

will fix.

Thanks,
Qi

> 
> Thanks.
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper
  2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
@ 2022-04-30 13:35   ` Qi Zheng
  0 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-04-30 13:35 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, david, jgg, tj, dennis, ming.lei
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming



On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Normally, the percpu_ref of the user PTE page table page is in
> percpu mode. This patch add try_to_free_user_pte() to switch
> the percpu_ref to atomic mode and check if it is 0. If the
> percpu_ref is 0, which means that no one is using the user PTE
> page table page, then we can safely reclaim it.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>   include/linux/pte_ref.h |  7 +++
>   mm/pte_ref.c            | 99 ++++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 104 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
> index bfe620038699..379c3b45a6ab 100644
> --- a/include/linux/pte_ref.h
> +++ b/include/linux/pte_ref.h
> @@ -16,6 +16,8 @@ void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
>   bool pte_tryget(struct mm_struct *mm, pmd_t *pmd, unsigned long addr);
>   void __pte_put(pgtable_t page);
>   void pte_put(pte_t *ptep);
> +void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +			  bool switch_back);
>   
>   #else /* !CONFIG_FREE_USER_PTE */
>   
> @@ -47,6 +49,11 @@ static inline void pte_put(pte_t *ptep)
>   {
>   }
>   
> +static inline void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd,
> +					unsigned long addr, bool switch_back)
> +{
> +}
> +
>   #endif /* CONFIG_FREE_USER_PTE */
>   
>   #endif /* _LINUX_PTE_REF_H */
> diff --git a/mm/pte_ref.c b/mm/pte_ref.c
> index 5b382445561e..bf9629272c71 100644
> --- a/mm/pte_ref.c
> +++ b/mm/pte_ref.c
> @@ -8,6 +8,9 @@
>   #include <linux/pte_ref.h>
>   #include <linux/percpu-refcount.h>
>   #include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +#include <asm/tlbflush.h>
> +#include <asm/pgalloc.h>
>   
>   #ifdef CONFIG_FREE_USER_PTE
>   
> @@ -44,8 +47,6 @@ void pte_ref_free(pgtable_t pte)
>   	kfree(ref);
>   }
>   
> -void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr) {}
> -
>   /*
>    * pte_tryget - try to get the pte_ref of the user PTE page table page
>    * @mm: pointer the target address space
> @@ -102,4 +103,98 @@ void pte_put(pte_t *ptep)
>   }
>   EXPORT_SYMBOL(pte_put);
>   
> +#ifdef CONFIG_DEBUG_VM
> +void pte_free_debug(pmd_t pmd)
> +{
> +	pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd);
> +	int i = 0;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++)
> +		BUG_ON(!pte_none(*ptep++));
> +}
> +#else
> +static inline void pte_free_debug(pmd_t pmd)
> +{
> +}
> +#endif
> +
> +static inline void pte_free_rcu(struct rcu_head *rcu)
> +{
> +	struct page *page = container_of(rcu, struct page, rcu_head);
> +
> +	pgtable_pte_page_dtor(page);
> +	__free_page(page);
> +}
> +
> +/*
> + * free_user_pte - free the user PTE page table page
> + * @mm: pointer the target address space
> + * @pmd: pointer to a PMD
> + * @addr: start address of the tlb range to be flushed
> + *
> + * Context: The pmd range has been unmapped and TLB purged. And the user PTE
> + *	    page table page will be freed by rcu handler.
> + */
> +void free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	spinlock_t *ptl;
> +	pmd_t pmdval;
> +
> +	ptl = pmd_lock(mm, pmd);
> +	pmdval = *pmd;
> +	if (pmd_none(pmdval) || pmd_leaf(pmdval)) {
> +		spin_unlock(ptl);
> +		return;
> +	}
> +	pmd_clear(pmd);
> +	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
> +	spin_unlock(ptl);
> +
> +	pte_free_debug(pmdval);
> +	mm_dec_nr_ptes(mm);
> +	call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
> +}
> +
> +/*
> + * try_to_free_user_pte - try to free the user PTE page table page
> + * @mm: pointer the target address space
> + * @pmd: pointer to a PMD
> + * @addr: virtual address associated with pmd
> + * @switch_back: indicates if switching back to percpu mode is required
> + */
> +void try_to_free_user_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> +			  bool switch_back)
> +{
> +	pgtable_t pte;
> +
> +	if (&init_mm == mm)
> +		return;
> +
> +	if (!pte_tryget(mm, pmd, addr))
> +		return;
> +	pte = pmd_pgtable(*pmd);
> +	percpu_ref_switch_to_atomic_sync(pte->pte_ref);
> +	rcu_read_lock();
> +	/*
> +	 * Here we can safely put the pte_ref because we already hold the rcu
> +	 * lock, which guarantees that the user PTE page table page will not
> +	 * be released.
> +	 */
> +	__pte_put(pte);
> +	if (percpu_ref_is_zero(pte->pte_ref)) {
> +		rcu_read_unlock();
> +		free_user_pte(mm, pmd, addr & PMD_MASK);
> +		return;
> +	}
> +	rcu_read_unlock();
> +
> +	if (switch_back) {
> +		if (pte_tryget(mm, pmd, addr)) {
> +			percpu_ref_switch_to_percpu(pte->pte_ref);
> +			__pte_put(pte);
> +		}
> +	}

We shouldn't switch back to percpu mode here, it will drastically reduce
performance.

> +}
> +
>   #endif /* CONFIG_FREE_USER_PTE */

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 00/18] Try to free user PTE page table pages
  2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
                   ` (17 preceding siblings ...)
  2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
@ 2022-05-17  8:30 ` Qi Zheng
  2022-05-18 14:51   ` David Hildenbrand
  18 siblings, 1 reply; 27+ messages in thread
From: Qi Zheng @ 2022-05-17  8:30 UTC (permalink / raw)
  To: david
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming,
	akpm, tglx, kirill.shutemov, jgg, tj, dennis, ming.lei



On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to try to free user PTE page table pages when no one is
> using it.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_ref field to the struct page of page table
> to track how many users of user PTE page table. Similar to the mechanism of page
> refcount, the user of PTE page table should hold a refcount to it before
> accessing. The user PTE page table page may be freed when the last refcount is
> dropped.
> 
> Different from the idea of another patchset of mine before[1], the pte_ref
> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
> entryies, and then release the user PTE page table page when checking that
> pte_ref is 0. The advantage of this is that there is basically no performance
> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
> code implementation of this patchset is much simpler and more portable than the
> another patchset[1].

Hi David,

I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?

For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).

For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
              to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
                 operations, which have a certain performance overhead,
                 but should not become a performance bottleneck until the
                 mmap_lock contention problem is resolved.

For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
              pte_ref are all operations on local cpu variables, there
              is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
               that the unused PTE pages can be freed.

There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.

Thanks,
Qi

[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table 
pages)
[2]: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/

> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 00/18] Try to free user PTE page table pages
  2022-05-17  8:30 ` [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
@ 2022-05-18 14:51   ` David Hildenbrand
  2022-05-18 14:56     ` Matthew Wilcox
  2022-05-19  3:58     ` Qi Zheng
  0 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2022-05-18 14:51 UTC (permalink / raw)
  To: Qi Zheng
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming,
	akpm, tglx, kirill.shutemov, jgg, tj, dennis, ming.lei

On 17.05.22 10:30, Qi Zheng wrote:
> 
> 
> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to try to free user PTE page table pages when no one is
>> using it.
>>
>> The beginning of this story is that some malloc libraries(e.g. jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
>> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>>          VIRT:  55t
>>          RES:   590g
>>          VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE page
>> tables to save memory. In the above cases, we can save memory about 108g(best
>> case). And the larger the difference between the size of VIRT and RES, the
>> more memory we save.
>>
>> In this patch series, we add a pte_ref field to the struct page of page table
>> to track how many users of user PTE page table. Similar to the mechanism of page
>> refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The user PTE page table page may be freed when the last refcount is
>> dropped.
>>
>> Different from the idea of another patchset of mine before[1], the pte_ref
>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>> entryies, and then release the user PTE page table page when checking that
>> pte_ref is 0. The advantage of this is that there is basically no performance
>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>> code implementation of this patchset is much simpler and more portable than the
>> another patchset[1].
> 
> Hi David,
> 
> I learned from the LWN article[1] that you led a session at the LSFMM on
> the problems posed by the lack of page-table reclaim (And thank you very
> much for mentioning some of my work in this direction). So I want to
> know, what are the further plans of the community for this problem?

Hi,

yes, I talked about the involved challenges, especially, how malicious
user space can trigger allocation of almost elusively page tables and
essentially consume a lot of unmovable+unswappable memory and even store
secrets in the page table structure.

Empty PTE tables is one such case we care about, but there is more. Even
with your approach, we can still end up with many page tables that are
allocated on higher levels (e.g., PMD tables) or page tables that are
not empty (especially, filled with the shared zeropage).

Ideally, we'd have some mechanism that can reclaim also other
reclaimable page tables (e.g., filled with shared zeropage). One idea
was to add reclaimable page tables to the LRU list and to then
scan+reclaim them on demand. There are multiple challenges involved,
obviously. One is how to synchronize against concurrent page table
walkers, another one is how to invalidate MMU notifiers from reclaim
context. It would most probably involve storing required information in
the memmap to be able to lock+synchronize.

Having that said, adding infrastructure that might not be easy to extend
to the more general case of reclaiming other reclaimable page tables on
multiple levels (esp PMD tables) might not be what we want. OTOH, it
gets the job done for once case we care about.

It's really hard to tell what to do because reclaiming page tables and
eventually handling malicious user space correctly is far from trivial :)

I'll be on vacation until end of May, I'll come back to this mail once
I'm back.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 00/18] Try to free user PTE page table pages
  2022-05-18 14:51   ` David Hildenbrand
@ 2022-05-18 14:56     ` Matthew Wilcox
  2022-05-19  4:03       ` Qi Zheng
  2022-05-19  3:58     ` Qi Zheng
  1 sibling, 1 reply; 27+ messages in thread
From: Matthew Wilcox @ 2022-05-18 14:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming, akpm, tglx, kirill.shutemov, jgg, tj, dennis,
	ming.lei

On Wed, May 18, 2022 at 04:51:06PM +0200, David Hildenbrand wrote:
> yes, I talked about the involved challenges, especially, how malicious
> user space can trigger allocation of almost elusively page tables and
> essentially consume a lot of unmovable+unswappable memory and even store
> secrets in the page table structure.

There are a lot of ways for userspace to consume a large amount of
kernel memory.  For example, one can open a file and set file locks on
alternate bytes.  We generally handle this by accounting the memory to
the process and let the OOM killer, rlimits, memcg or other mechanism
take care of it.  Just because page tables are (generally) reclaimable
doesn't mean we need to treat them specially.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 00/18] Try to free user PTE page table pages
  2022-05-18 14:51   ` David Hildenbrand
  2022-05-18 14:56     ` Matthew Wilcox
@ 2022-05-19  3:58     ` Qi Zheng
  1 sibling, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-05-19  3:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming,
	akpm, tglx, kirill.shutemov, jgg, tj, dennis, ming.lei



On 2022/5/18 10:51 PM, David Hildenbrand wrote:
> On 17.05.22 10:30, Qi Zheng wrote:
>>
>>
>> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>>> Hi, >>>
>>> Different from the idea of another patchset of mine before[1], the pte_ref
>>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>>> entryies, and then release the user PTE page table page when checking that
>>> pte_ref is 0. The advantage of this is that there is basically no performance
>>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>>> code implementation of this patchset is much simpler and more portable than the
>>> another patchset[1].
>>
>> Hi David,
>>
>> I learned from the LWN article[1] that you led a session at the LSFMM on
>> the problems posed by the lack of page-table reclaim (And thank you very
>> much for mentioning some of my work in this direction). So I want to
>> know, what are the further plans of the community for this problem?
> 
> Hi,
> 
> yes, I talked about the involved challenges, especially, how malicious
> user space can trigger allocation of almost elusively page tables and
> essentially consume a lot of unmovable+unswappable memory and even store
> secrets in the page table structure.

It is indeed difficult to deal with malicious user space programs,
because as long as there is an entry in PTE page table page that
maps the physical page, the entire PTE page cannot be freed.

So maybe we should first solve the problems encountered in engineering
practice. We encountered the problems I mentioned in the cover letter
several times on our server:

	VIRT:  55t
         RES:   590g
         VmPTE: 110g

They are not malicious programs, they just use jemalloc/tcmalloc
normally (currently jemalloc/tcmalloc often uses mmap+madvise instead
of mmap+munmap to improve performance). And we checked and found taht
most of these VmPTEs are empty.

Of course, normal operations may also lead to the consequences of
similar malicious programs, but we have not found such examples
on our servers.

> 
> Empty PTE tables is one such case we care about, but there is more. Even
> with your approach, we can still end up with many page tables that are
> allocated on higher levels (e.g., PMD tables) or page tables that are

Yes, currently my patch does not consider PMD tables. The reason is that
its maximum memory consumption is only 1G on 64-bits system, the impact
is smaller that 512G of PTE tables.

> not empty (especially, filled with the shared zeropage).

This case is indeed a problem, and more difficult. :(

> 
> Ideally, we'd have some mechanism that can reclaim also other
> reclaimable page tables (e.g., filled with shared zeropage). One idea
> was to add reclaimable page tables to the LRU list and to then
> scan+reclaim them on demand. There are multiple challenges involved,
> obviously. One is how to synchronize against concurrent page table

Agree, the current situation is that holding the read lock of mmap_lock
can ensure that the PTE tables is stable. If the refcount method is not
considered or the logic of the lock that protects the PTE tables is not
changed, then the write lock of mmap_lock should be held to ensure
synchronization (this has a huge impact on performance).

> walkers, another one is how to invalidate MMU notifiers from reclaim
> context. It would most probably involve storing required information in
> the memmap to be able to lock+synchronize.

This may also be a way to explore.

> 
> Having that said, adding infrastructure that might not be easy to extend
> to the more general case of reclaiming other reclaimable page tables on
> multiple levels (esp PMD tables) might not be what we want. OTOH, it
> gets the job done for once case we care about.
> 
> It's really hard to tell what to do because reclaiming page tables and
> eventually handling malicious user space correctly is far from trivial :)

Yeah, agree :(

> 
> I'll be on vacation until end of May, I'll come back to this mail once
> I'm back.
> 

OK, thanks, and have a nice holiday.

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 00/18] Try to free user PTE page table pages
  2022-05-18 14:56     ` Matthew Wilcox
@ 2022-05-19  4:03       ` Qi Zheng
  0 siblings, 0 replies; 27+ messages in thread
From: Qi Zheng @ 2022-05-19  4:03 UTC (permalink / raw)
  To: Matthew Wilcox, David Hildenbrand
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming,
	akpm, tglx, kirill.shutemov, jgg, tj, dennis, ming.lei



On 2022/5/18 10:56 PM, Matthew Wilcox wrote:
> On Wed, May 18, 2022 at 04:51:06PM +0200, David Hildenbrand wrote:
>> yes, I talked about the involved challenges, especially, how malicious
>> user space can trigger allocation of almost elusively page tables and
>> essentially consume a lot of unmovable+unswappable memory and even store
>> secrets in the page table structure.
> 
> There are a lot of ways for userspace to consume a large amount of
> kernel memory.  For example, one can open a file and set file locks on

Yes, malicious programs are really hard to avoid, maybe we should try to
solve some common cases first (such as empty PTE tables).

> alternate bytes.  We generally handle this by accounting the memory to
> the process and let the OOM killer, rlimits, memcg or other mechanism
> take care of it.  Just because page tables are (generally) reclaimable
> doesn't mean we need to treat them specially.
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-05-19  4:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-29 13:35 [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 01/18] x86/mm/encrypt: add the missing pte_unmap() call Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 06/18] mm: introduce CONFIG_FREE_USER_PTE Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 07/18] mm: add pte_to_page() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 08/18] mm: introduce percpu_ref for user PTE page table page Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 10/18] mm: add pte_tryget_map{_lock}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 11/18] mm: convert to use pte_tryget_map_lock() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 12/18] mm: convert to use pte_tryget_map() Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 13/18] mm: add try_to_free_user_pte() helper Qi Zheng
2022-04-30 13:35   ` Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 15/18] mm: use try_to_free_user_pte() in MADV_FREE case Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 16/18] pte_ref: add track_pte_{set, clear}() helper Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 17/18] x86/mm: add x86_64 support for pte_ref Qi Zheng
2022-04-29 13:35 ` [RFC PATCH 18/18] Documentation: add document " Qi Zheng
2022-04-30 13:19   ` Bagas Sanjaya
2022-04-30 13:32     ` Qi Zheng
2022-05-17  8:30 ` [RFC PATCH 00/18] Try to free user PTE page table pages Qi Zheng
2022-05-18 14:51   ` David Hildenbrand
2022-05-18 14:56     ` Matthew Wilcox
2022-05-19  4:03       ` Qi Zheng
2022-05-19  3:58     ` Qi Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.