linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/15] Free user PTE page table pages
@ 2021-11-10 10:54 Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
                   ` (15 more replies)
  0 siblings, 16 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Hi,

This patch series aims to free user PTE page table pages when all PTE entries
are empty.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs.
They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_refcount field to the struct page of page
table to track how many users of PTE page table. Similar to the mechanism of
page refcount, the user of PTE page table should hold a refcount to it before
accessing. The PTE page table page will be freed when the last refcount is
dropped.

Testing:

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.6 MB
VmPTE                102640 kB                         248 kB

I also have tested the stability by LTP[1] for several weeks. I have not seen
any crash so far.

The performance of page fault can be affected because of the allocation/freeing
of PTE page table pages. The following is the test result by using a micro
benchmark[2]:

root@~# perf stat -e page-faults --repeat 5 ./multi-fault $threads:

threads         before (pf/min)                     after (pf/min)
    1                32,085,255                         31,880,833 (-0.64%)
    8               101,674,967                        100,588,311 (-1.17%)
   16               113,207,000                        112,801,832 (-0.36%)

(The "pfn/min" means how many page faults in one minute.)

The performance of page fault is ~1% slower than before.

And there are no obvious changes in perf hot spots:

before:
  19.29%  [kernel]  [k] clear_page_rep
  16.12%  [kernel]  [k] do_user_addr_fault
   9.57%  [kernel]  [k] _raw_spin_unlock_irqrestore
   6.16%  [kernel]  [k] get_page_from_freelist
   5.03%  [kernel]  [k] __handle_mm_fault
   3.53%  [kernel]  [k] __rcu_read_unlock
   3.45%  [kernel]  [k] handle_mm_fault
   3.38%  [kernel]  [k] down_read_trylock
   2.74%  [kernel]  [k] free_unref_page_list
   2.17%  [kernel]  [k] up_read
   1.93%  [kernel]  [k] charge_memcg
   1.73%  [kernel]  [k] try_charge_memcg
   1.71%  [kernel]  [k] __alloc_pages
   1.69%  [kernel]  [k] ___perf_sw_event
   1.44%  [kernel]  [k] get_mem_cgroup_from_mm

after:
  18.19%  [kernel]  [k] clear_page_rep
  16.28%  [kernel]  [k] do_user_addr_fault
   8.39%  [kernel]  [k] _raw_spin_unlock_irqrestore
   5.12%  [kernel]  [k] get_page_from_freelist
   4.81%  [kernel]  [k] __handle_mm_fault
   4.68%  [kernel]  [k] down_read_trylock
   3.80%  [kernel]  [k] handle_mm_fault
   3.59%  [kernel]  [k] get_mem_cgroup_from_mm
   2.49%  [kernel]  [k] free_unref_page_list
   2.41%  [kernel]  [k] up_read
   2.16%  [kernel]  [k] charge_memcg
   1.92%  [kernel]  [k] __rcu_read_unlock
   1.88%  [kernel]  [k] ___perf_sw_event
   1.70%  [kernel]  [k] pte_get_unless_zero

This series is based on next-20211108.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://github.com/linux-test-project/ltp
[2] https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Changelog in v2 -> v3:
 - Refactored this patch series:
        - [PATCH v3 6/15]: Introduce the new dummy helpers first
        - [PATCH v3 7-12/15]: Convert each subsystem individually
        - [PATCH v3 13/15]: Implement the actual logic to the dummy helpers
   And thanks for the advice from David and Jason.
 - Add a document.

Changelog in v1 -> v2:
 - Change pte_install() to pmd_install().
 - Fix some typo and code style problems.
 - Split [PATCH v1 5/7] into [PATCH v2 4/9], [PATCH v2 5/9],[PATCH v2 6/9]
   and [PATCH v2 7/9].

Qi Zheng (15):
  mm: do code cleanups to filemap_map_pmd()
  mm: introduce is_huge_pmd() helper
  mm: move pte_offset_map_lock() to pgtable.h
  mm: rework the parameter of lock_page_or_retry()
  mm: add pmd_installed_type return for __pte_alloc() and other friends
  mm: introduce refcount for user PTE page table page
  mm/pte_ref: add support for user PTE page table page allocation
  mm/pte_ref: initialize the refcount of the withdrawn PTE page table
    page
  mm/pte_ref: add support for the map/unmap of user PTE page table page
  mm/pte_ref: add support for page fault path
  mm/pte_ref: take a refcount before accessing the PTE page table page
  mm/pte_ref: update the pmd entry in move_normal_pmd()
  mm/pte_ref: free user PTE page table pages
  Documentation: add document for pte_ref
  mm/pte_ref: use mmu_gather to free PTE page table pages

 Documentation/vm/pte_ref.rst | 216 ++++++++++++++++++++++++++++++++++++
 arch/x86/Kconfig             |   2 +-
 fs/proc/task_mmu.c           |  24 +++-
 fs/userfaultfd.c             |   9 +-
 include/linux/huge_mm.h      |  10 +-
 include/linux/mm.h           | 170 ++++-------------------------
 include/linux/mm_types.h     |   6 +-
 include/linux/pagemap.h      |   8 +-
 include/linux/pgtable.h      | 152 +++++++++++++++++++++++++-
 include/linux/pte_ref.h      | 146 +++++++++++++++++++++++++
 include/linux/rmap.h         |   2 +
 kernel/events/uprobes.c      |   2 +
 mm/Kconfig                   |   4 +
 mm/Makefile                  |   4 +-
 mm/damon/vaddr.c             |  12 +-
 mm/debug_vm_pgtable.c        |   5 +-
 mm/filemap.c                 |  45 +++++---
 mm/gup.c                     |  25 ++++-
 mm/hmm.c                     |   5 +-
 mm/huge_memory.c             |   3 +-
 mm/internal.h                |   4 +-
 mm/khugepaged.c              |  21 +++-
 mm/ksm.c                     |   6 +-
 mm/madvise.c                 |  21 +++-
 mm/memcontrol.c              |  12 +-
 mm/memory-failure.c          |  11 +-
 mm/memory.c                  | 254 ++++++++++++++++++++++++++++++++-----------
 mm/mempolicy.c               |   6 +-
 mm/migrate.c                 |  54 ++++-----
 mm/mincore.c                 |   7 +-
 mm/mlock.c                   |   1 +
 mm/mmu_gather.c              |  40 +++----
 mm/mprotect.c                |  11 +-
 mm/mremap.c                  |  14 ++-
 mm/page_vma_mapped.c         |   4 +
 mm/pagewalk.c                |  15 ++-
 mm/pgtable-generic.c         |   1 +
 mm/pte_ref.c                 | 141 ++++++++++++++++++++++++
 mm/rmap.c                    |  10 ++
 mm/swapfile.c                |   3 +
 mm/userfaultfd.c             |  40 +++++--
 41 files changed, 1186 insertions(+), 340 deletions(-)
 create mode 100644 Documentation/vm/pte_ref.rst
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd()
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Currently we have two times the same few lines repeated
in filemap_map_pmd(). Deduplicate them and fix some code
style issues.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/filemap.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index daa0e23a6ee6..07c654202870 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3203,11 +3203,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	struct mm_struct *mm = vmf->vma->vm_mm;
 
 	/* Huge page is mapped? No need to proceed. */
-	if (pmd_trans_huge(*vmf->pmd)) {
-		unlock_page(page);
-		put_page(page);
-		return true;
-	}
+	if (pmd_trans_huge(*vmf->pmd))
+		goto out;
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
 		vm_fault_t ret = do_set_pmd(vmf, page);
@@ -3222,13 +3219,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
 	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd)) {
-		unlock_page(page);
-		put_page(page);
-		return true;
-	}
+	if (pmd_devmap_trans_unstable(vmf->pmd))
+		goto out;
 
 	return false;
+
+out:
+	unlock_page(page);
+	put_page(page);
+	return true;
 }
 
 static struct page *next_uptodate_page(struct page *page,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 02/15] mm: introduce is_huge_pmd() helper
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-11 13:46   ` kernel test robot
  2021-11-10 10:54 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Currently we have some times the following judgments repeated in the
code:

	is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)

which is to determine whether the *pmd is a huge pmd, so introduce
is_huge_pmd() helper to deduplicate them.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/huge_mm.h | 10 +++++++---
 mm/huge_memory.c        |  3 +--
 mm/memory.c             |  5 ++---
 mm/mprotect.c           |  2 +-
 mm/mremap.c             |  3 +--
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..b37a89180846 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -199,8 +199,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 #define split_huge_pmd(__vma, __pmd, __address)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
-					|| pmd_devmap(*____pmd))	\
+		if (is_huge_pmd(*____pmd))				\
 			__split_huge_pmd(__vma, __pmd, __address,	\
 						false, NULL);		\
 	}  while (0)
@@ -232,11 +231,16 @@ static inline int is_swap_pmd(pmd_t pmd)
 	return !pmd_none(pmd) && !pmd_present(pmd);
 }
 
+static inline int is_huge_pmd(pmd_t pmd)
+{
+	return is_swap_pmd(pmd) || pmd_trans_huge(pmd) || pmd_devmap(pmd);
+}
+
 /* mmap_lock must be held on entry */
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_huge_pmd(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..e76ee2e1e423 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1832,8 +1832,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+	if (likely(is_huge_pmd(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 855486fff526..b00cd60fc368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1146,8 +1146,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_huge_pmd(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1441,7 +1440,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_huge_pmd(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr))
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e552f5e0ccbd..2d5064a4631c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -257,7 +257,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_huge_pmd(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
diff --git a/mm/mremap.c b/mm/mremap.c
index 002eec83e91e..c6e9da09dd0a 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -532,8 +532,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
-		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
-		    pmd_devmap(*old_pmd)) {
+		if (is_huge_pmd(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
 					   old_pmd, new_pmd, need_rmap_locks))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

pte_offset_map() is in include/linux/pgtable.h, so move its
friend pte_offset_map_lock() to pgtable.h together.

pte_lockptr() is required for pte_offset_map_lock(), so
also move {pte,pmd,pud}_lockptr() to pgtable.h.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h      | 149 ------------------------------------------------
 include/linux/pgtable.h | 149 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 149 insertions(+), 149 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..706da081b9f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2284,70 +2284,6 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU */
 
-#if USE_SPLIT_PTE_PTLOCKS
-#if ALLOC_SPLIT_PTLOCKS
-void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
-extern void ptlock_free(struct page *page);
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
-	return page->ptl;
-}
-#else /* ALLOC_SPLIT_PTLOCKS */
-static inline void ptlock_cache_init(void)
-{
-}
-
-static inline bool ptlock_alloc(struct page *page)
-{
-	return true;
-}
-
-static inline void ptlock_free(struct page *page)
-{
-}
-
-static inline spinlock_t *ptlock_ptr(struct page *page)
-{
-	return &page->ptl;
-}
-#endif /* ALLOC_SPLIT_PTLOCKS */
-
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return ptlock_ptr(pmd_page(*pmd));
-}
-
-static inline bool ptlock_init(struct page *page)
-{
-	/*
-	 * prep_new_page() initialize page->private (and therefore page->ptl)
-	 * with 0. Make sure nobody took it in use in between.
-	 *
-	 * It can happen if arch try to use slab for page table allocation:
-	 * slab code uses page->slab_cache, which share storage with page->ptl.
-	 */
-	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
-	if (!ptlock_alloc(page))
-		return false;
-	spin_lock_init(ptlock_ptr(page));
-	return true;
-}
-
-#else	/* !USE_SPLIT_PTE_PTLOCKS */
-/*
- * We use mm->page_table_lock to guard all pagetable pages of the mm.
- */
-static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return &mm->page_table_lock;
-}
-static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
-static inline void ptlock_free(struct page *page) {}
-#endif /* USE_SPLIT_PTE_PTLOCKS */
-
 static inline void pgtable_init(void)
 {
 	ptlock_cache_init();
@@ -2370,20 +2306,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
-({							\
-	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
-	pte_t *__pte = pte_offset_map(pmd, address);	\
-	*(ptlp) = __ptl;				\
-	spin_lock(__ptl);				\
-	__pte;						\
-})
-
-#define pte_unmap_unlock(pte, ptl)	do {		\
-	spin_unlock(ptl);				\
-	pte_unmap(pte);					\
-} while (0)
-
 #define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
 
 #define pte_alloc_map(mm, pmd, address)			\
@@ -2397,58 +2319,6 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
 		NULL: pte_offset_kernel(pmd, address))
 
-#if USE_SPLIT_PMD_PTLOCKS
-
-static struct page *pmd_to_page(pmd_t *pmd)
-{
-	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-	return virt_to_page((void *)((unsigned long) pmd & mask));
-}
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return ptlock_ptr(pmd_to_page(pmd));
-}
-
-static inline bool pmd_ptlock_init(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	page->pmd_huge_pte = NULL;
-#endif
-	return ptlock_init(page);
-}
-
-static inline void pmd_ptlock_free(struct page *page)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
-#endif
-	ptlock_free(page);
-}
-
-#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
-
-#else
-
-static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
-{
-	return &mm->page_table_lock;
-}
-
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
-
-#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
-
-#endif
-
-static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
-{
-	spinlock_t *ptl = pmd_lockptr(mm, pmd);
-	spin_lock(ptl);
-	return ptl;
-}
-
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
 	if (!pmd_ptlock_init(page))
@@ -2465,25 +2335,6 @@ static inline void pgtable_pmd_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-/*
- * No scalability reason to split PUD locks yet, but follow the same pattern
- * as the PMD locks to make it easier if we decide to.  The VM should not be
- * considered ready to switch to split PUD locks yet; there may be places
- * which need to be converted from page_table_lock.
- */
-static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
-{
-	return &mm->page_table_lock;
-}
-
-static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
-{
-	spinlock_t *ptl = pud_lockptr(mm, pud);
-
-	spin_lock(ptl);
-	return ptl;
-}
-
 extern void __init pagecache_init(void);
 extern void __init free_area_init_memoryless_node(int nid);
 extern void free_initmem(void);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..c8f045705c1e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -84,6 +84,141 @@ static inline unsigned long pud_index(unsigned long address)
 #define pgd_index(a)  (((a) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
 #endif
 
+#if USE_SPLIT_PTE_PTLOCKS
+#if ALLOC_SPLIT_PTLOCKS
+void __init ptlock_cache_init(void);
+extern bool ptlock_alloc(struct page *page);
+extern void ptlock_free(struct page *page);
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+	return page->ptl;
+}
+#else /* ALLOC_SPLIT_PTLOCKS */
+static inline void ptlock_cache_init(void)
+{
+}
+
+static inline bool ptlock_alloc(struct page *page)
+{
+	return true;
+}
+
+static inline void ptlock_free(struct page *page)
+{
+}
+
+static inline spinlock_t *ptlock_ptr(struct page *page)
+{
+	return &page->ptl;
+}
+#endif /* ALLOC_SPLIT_PTLOCKS */
+
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_page(*pmd));
+}
+
+static inline bool ptlock_init(struct page *page)
+{
+	/*
+	 * prep_new_page() initialize page->private (and therefore page->ptl)
+	 * with 0. Make sure nobody took it in use in between.
+	 *
+	 * It can happen if arch try to use slab for page table allocation:
+	 * slab code uses page->slab_cache, which share storage with page->ptl.
+	 */
+	VM_BUG_ON_PAGE(*(unsigned long *)&page->ptl, page);
+	if (!ptlock_alloc(page))
+		return false;
+	spin_lock_init(ptlock_ptr(page));
+	return true;
+}
+
+#else	/* !USE_SPLIT_PTE_PTLOCKS */
+/*
+ * We use mm->page_table_lock to guard all pagetable pages of the mm.
+ */
+static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return &mm->page_table_lock;
+}
+static inline void ptlock_cache_init(void) {}
+static inline bool ptlock_init(struct page *page) { return true; }
+static inline void ptlock_free(struct page *page) {}
+#endif /* USE_SPLIT_PTE_PTLOCKS */
+
+#if USE_SPLIT_PMD_PTLOCKS
+
+static struct page *pmd_to_page(pmd_t *pmd)
+{
+	unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+	return virt_to_page((void *)((unsigned long) pmd & mask));
+}
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return ptlock_ptr(pmd_to_page(pmd));
+}
+
+static inline bool pmd_ptlock_init(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	page->pmd_huge_pte = NULL;
+#endif
+	return ptlock_init(page);
+}
+
+static inline void pmd_ptlock_free(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+#endif
+	ptlock_free(page);
+}
+
+#define pmd_huge_pte(mm, pmd) (pmd_to_page(pmd)->pmd_huge_pte)
+
+#else
+
+static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
+{
+	return &mm->page_table_lock;
+}
+
+static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline void pmd_ptlock_free(struct page *page) {}
+
+#define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
+
+#endif
+
+static inline spinlock_t *pmd_lock(struct mm_struct *mm, pmd_t *pmd)
+{
+	spinlock_t *ptl = pmd_lockptr(mm, pmd);
+	spin_lock(ptl);
+	return ptl;
+}
+
+/*
+ * No scalability reason to split PUD locks yet, but follow the same pattern
+ * as the PMD locks to make it easier if we decide to.  The VM should not be
+ * considered ready to switch to split PUD locks yet; there may be places
+ * which need to be converted from page_table_lock.
+ */
+static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
+{
+	return &mm->page_table_lock;
+}
+
+static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
+{
+	spinlock_t *ptl = pud_lockptr(mm, pud);
+
+	spin_lock(ptl);
+	return ptl;
+}
+
 #ifndef pte_offset_kernel
 static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 {
@@ -102,6 +237,20 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_unmap(pte) ((void)(pte))	/* NOP */
 #endif
 
+#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
+({							\
+	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
+	pte_t *__pte = pte_offset_map(pmd, address);	\
+	*(ptlp) = __ptl;				\
+	spin_lock(__ptl);				\
+	__pte;						\
+})
+
+#define pte_unmap_unlock(pte, ptl)	do {		\
+	spin_unlock(ptl);				\
+	pte_unmap(pte);					\
+} while (0)
+
 /* Find an entry in the second-level page table.. */
 #ifndef pmd_offset
 static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry()
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (2 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

We need the vmf in lock_page_or_retry() in the subsequent patch,
so pass in it directly.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/pagemap.h | 8 +++-----
 mm/filemap.c            | 6 ++++--
 mm/memory.c             | 4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 6a30916b76e5..94f9547b4411 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -709,8 +709,7 @@ static inline bool wake_page_match(struct wait_page_queue *wait_page,
 
 void __folio_lock(struct folio *folio);
 int __folio_lock_killable(struct folio *folio);
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-				unsigned int flags);
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf);
 void unlock_page(struct page *page);
 void folio_unlock(struct folio *folio);
 
@@ -772,14 +771,13 @@ static inline int lock_page_killable(struct page *page)
  * Return value and mmap_lock implications depend on flags; see
  * __folio_lock_or_retry().
  */
-static inline bool lock_page_or_retry(struct page *page, struct mm_struct *mm,
-				     unsigned int flags)
+static inline bool lock_page_or_retry(struct page *page, struct vm_fault *vmf)
 {
 	struct folio *folio;
 	might_sleep();
 
 	folio = page_folio(page);
-	return folio_trylock(folio) || __folio_lock_or_retry(folio, mm, flags);
+	return folio_trylock(folio) || __folio_lock_or_retry(folio, vmf);
 }
 
 /*
diff --git a/mm/filemap.c b/mm/filemap.c
index 07c654202870..ff8d19b7ce1d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1695,9 +1695,11 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait)
  * If neither ALLOW_RETRY nor KILLABLE are set, will always return true
  * with the folio locked and the mmap_lock unperturbed.
  */
-bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm,
-			 unsigned int flags)
+bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 {
+	unsigned int flags = vmf->flags;
+	struct mm_struct *mm = vmf->vma->vm_mm;
+
 	if (fault_flag_allow_retry_first(flags)) {
 		/*
 		 * CAUTION! In this case, mmap_lock is not released
diff --git a/mm/memory.c b/mm/memory.c
index b00cd60fc368..bec6a5d5ee7c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3443,7 +3443,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct mmu_notifier_range range;
 
-	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+	if (!lock_page_or_retry(page, vmf))
 		return VM_FAULT_RETRY;
 	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
 				vma->vm_mm, vmf->address & PAGE_MASK,
@@ -3576,7 +3576,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 	}
 
-	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+	locked = lock_page_or_retry(page, vmf);
 
 	delayacct_clear_flag(current, DELAYACCT_PF_SWAPIN);
 	if (!locked) {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (3 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

When we call __pte_alloc() or other friends, a huge pmd might
be created from a different thread. This is why
pmd_trans_unstable() will now be called after __pte_alloc()
or other friends return.

This patch add pmd_installed_type return for __pte_alloc() and other
friends, then we can check the huge pmd through the return value
instead of calling pmd_trans_unstable() again.

This patch has no functional change, just some preparations
for the future patches.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h    | 20 +++++++++++++++++---
 mm/debug_vm_pgtable.c |  2 +-
 mm/filemap.c          | 11 +++++++----
 mm/gup.c              |  2 +-
 mm/internal.h         |  3 ++-
 mm/memory.c           | 39 ++++++++++++++++++++++++++-------------
 mm/migrate.c          | 17 ++---------------
 mm/mremap.c           |  2 +-
 mm/userfaultfd.c      | 24 +++++++++++++++---------
 9 files changed, 72 insertions(+), 48 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 706da081b9f8..52f36fde2f11 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2306,13 +2306,27 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_alloc(mm, pmd) (unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, pmd))
+enum pmd_installed_type {
+	INSTALLED_PTE,
+	INSTALLED_HUGE_PMD,
+};
+
+static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
+{
+	if (unlikely(pmd_none(*(pmd))))
+		return __pte_alloc(mm, pmd);
+	if (unlikely(is_huge_pmd(*pmd)))
+		return INSTALLED_HUGE_PMD;
+
+	return INSTALLED_PTE;
+}
+#define pte_alloc pte_alloc
 
 #define pte_alloc_map(mm, pmd, address)			\
-	(pte_alloc(mm, pmd) ? NULL : pte_offset_map(pmd, address))
+	(pte_alloc(mm, pmd) < 0 ? NULL : pte_offset_map(pmd, address))
 
 #define pte_alloc_map_lock(mm, pmd, address, ptlp)	\
-	(pte_alloc(mm, pmd) ?			\
+	(pte_alloc(mm, pmd) < 0 ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
 #define pte_alloc_kernel(pmd, address)			\
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 228e3954b90c..b8322c55e65d 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1170,7 +1170,7 @@ static int __init init_args(struct pgtable_debug_args *args)
 	args->start_pmdp = pmd_offset(args->pudp, 0UL);
 	WARN_ON(!args->start_pmdp);
 
-	if (pte_alloc(args->mm, args->pmdp)) {
+	if (pte_alloc(args->mm, args->pmdp) < 0) {
 		pr_err("Failed to allocate pte entries\n");
 		ret = -ENOMEM;
 		goto error;
diff --git a/mm/filemap.c b/mm/filemap.c
index ff8d19b7ce1d..23363f8ddbbe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,12 +3217,15 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		}
 	}
 
-	if (pmd_none(*vmf->pmd))
-		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
+	if (pmd_none(*vmf->pmd)) {
+		int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
-	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd))
+		if (unlikely(ret == INSTALLED_HUGE_PMD))
+			goto out;
+	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+		/* See comment in handle_pte_fault() */
 		goto out;
+	}
 
 	return false;
 
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..2def775232a3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -699,7 +699,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		} else {
 			spin_unlock(ptl);
 			split_huge_pmd(vma, pmd, address);
-			ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
+			ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
 		}
 
 		return ret ? ERR_PTR(ret) :
diff --git a/mm/internal.h b/mm/internal.h
index 3b79a5c9427a..474d6e3443f8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -67,7 +67,8 @@ bool __folio_end_writeback(struct folio *folio);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+				    pgtable_t *pte);
 
 static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
diff --git a/mm/memory.c b/mm/memory.c
index bec6a5d5ee7c..8a39c0e58324 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -437,8 +437,10 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 }
 
-void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
+				    pgtable_t *pte)
 {
+	int ret = INSTALLED_PTE;
 	spinlock_t *ptl = pmd_lock(mm, pmd);
 
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
@@ -459,20 +461,26 @@ void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
 		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 		pmd_populate(mm, pmd, *pte);
 		*pte = NULL;
+	} else if (is_huge_pmd(*pmd)) {
+		/* See comment in handle_pte_fault() */
+		ret = INSTALLED_HUGE_PMD;
 	}
 	spin_unlock(ptl);
+
+	return ret;
 }
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
+	enum pmd_installed_type ret;
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
 
-	pmd_install(mm, pmd, &new);
+	ret = pmd_install(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
-	return 0;
+	return ret;
 }
 
 int __pte_alloc_kernel(pmd_t *pmd)
@@ -1813,7 +1821,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 
 	/* Allocate the PTE if necessary; takes PMD lock once only. */
 	ret = -ENOMEM;
-	if (pte_alloc(mm, pmd))
+	if (pte_alloc(mm, pmd) < 0)
 		goto out;
 
 	while (pages_to_write_in_pmd) {
@@ -3713,6 +3721,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	struct page *page;
 	vm_fault_t ret = 0;
 	pte_t entry;
+	int alloc_ret;
 
 	/* File mapping without ->vm_ops ? */
 	if (vma->vm_flags & VM_SHARED)
@@ -3728,11 +3737,11 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	 *
 	 * Here we only have mmap_read_lock(mm).
 	 */
-	if (pte_alloc(vma->vm_mm, vmf->pmd))
+	alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);
+	if (alloc_ret < 0)
 		return VM_FAULT_OOM;
-
 	/* See comment in handle_pte_fault() */
-	if (unlikely(pmd_trans_unstable(vmf->pmd)))
+	if (unlikely(alloc_ret == INSTALLED_HUGE_PMD))
 		return 0;
 
 	/* Use the zero-page for reads */
@@ -4023,6 +4032,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	}
 
 	if (pmd_none(*vmf->pmd)) {
+		int alloc_ret;
+
 		if (PageTransCompound(page)) {
 			ret = do_set_pmd(vmf, page);
 			if (ret != VM_FAULT_FALLBACK)
@@ -4030,14 +4041,16 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		}
 
 		if (vmf->prealloc_pte)
-			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
-		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
-			return VM_FAULT_OOM;
-	}
+			alloc_ret = pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+		else
+			alloc_ret = pte_alloc(vma->vm_mm, vmf->pmd);
 
-	/* See comment in handle_pte_fault() */
-	if (pmd_devmap_trans_unstable(vmf->pmd))
+		if (unlikely(alloc_ret != INSTALLED_PTE))
+			return alloc_ret < 0 ? VM_FAULT_OOM : 0;
+	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+		/* See comment in handle_pte_fault() */
 		return 0;
+	}
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
 				      vmf->address, &vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index cf25b00f03c8..bdfdfd3b50be 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2731,21 +2731,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
 		goto abort;
 
-	/*
-	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
-	 * pte_offset_map() on pmds where a huge pmd might be created
-	 * from a different thread.
-	 *
-	 * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
-	 * parallel threads are excluded by other means.
-	 *
-	 * Here we only have mmap_read_lock(mm).
-	 */
-	if (pte_alloc(mm, pmdp))
-		goto abort;
-
-	/* See the comment in pte_alloc_one_map() */
-	if (unlikely(pmd_trans_unstable(pmdp)))
+	/* See the comment in do_anonymous_page() */
+	if (unlikely(pte_alloc(mm, pmdp) != INSTALLED_PTE))
 		goto abort;
 
 	if (unlikely(anon_vma_prepare(vma)))
diff --git a/mm/mremap.c b/mm/mremap.c
index c6e9da09dd0a..fc5c56858883 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -551,7 +551,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 				continue;
 		}
 
-		if (pte_alloc(new_vma->vm_mm, new_pmd))
+		if (pte_alloc(new_vma->vm_mm, new_pmd) < 0)
 			break;
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
 			  new_pmd, new_addr, need_rmap_locks);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..2cea08e7f076 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -592,15 +592,21 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			err = -EEXIST;
 			break;
 		}
-		if (unlikely(pmd_none(dst_pmdval)) &&
-		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
-			err = -ENOMEM;
-			break;
-		}
-		/* If an huge pmd materialized from under us fail */
-		if (unlikely(pmd_trans_huge(*dst_pmd))) {
-			err = -EFAULT;
-			break;
+
+		if (unlikely(pmd_none(dst_pmdval))) {
+			int ret = __pte_alloc(dst_mm, dst_pmd);
+
+			/*
+			 * If there is not enough memory or an huge pmd
+			 * materialized from under us
+			 */
+			if (unlikely(ret < 0)) {
+				err = -ENOMEM;
+				break;
+			} else if (unlikely(ret == INSTALLED_HUGE_PMD)) {
+				err = -EFAULT;
+				break;
+			}
 		}
 
 		BUG_ON(pmd_none(*dst_pmd));
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 06/15] mm: introduce refcount for user PTE page table page
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (4 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-11  0:37   ` kernel test robot
  2021-11-10 10:54 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

1. Preface
==========

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
physical memory for the following reasons::

 First of all, we should hold as few write locks of mmap_lock as possible,
 since the mmap_lock semaphore has long been a contention point in the
 memory management subsystem. The mmap()/munmap() hold the write lock, and
 the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
 madvise() instead of munmap() to released physical memory can reduce the
 competition of the mmap_lock.

 Secondly, after using madvise() to release physical memory, there is no
 need to build vma and allocate page tables again when accessing the same
 virtual address again, which can also save some time.

The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.

+---------------------------+--------+---------+
|                           | 32-bit | 64-bit  |
+===========================+========+=========+
| user PTE page table pages | 3 MiB  | 512 GiB |
+---------------------------+--------+---------+
| user PMD page table pages | 3 KiB  | 1 GiB   |
+---------------------------+--------+---------+

(for 32-bit, take 3G user address space, 4K page size as an example;
 for 64-bit, take 48-bit address width, 4K page size as an example.)

After using madvise(), everything looks good, but as can be seen from the
above table, a single process can create a large number of PTE page tables
on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
release page table memory. And before the process exits or calls munmap(),
the kernel cannot reclaim these pages even if these PTE page tables do not
map anything.

Therefore, we decided to introduce reference count to manage the PTE page
table life cycle, so that some free PTE page table memory in the system
can be dynamically released.

2. The reference count of user PTE page table pages
===================================================

We introduce two members for the struct page of the user PTE page table
page::

 union {
	pgtable_t pmd_huge_pte; /* protected by page->ptl */
	pmd_t *pmd;             /* PTE page only */
 };
 union {
	struct mm_struct *pt_mm; /* x86 pgds only */
	atomic_t pt_frag_refcount; /* powerpc */
	atomic_t pte_refcount;  /* PTE page only */
 };

The pmd member record the pmd entry that maps the user PTE page table page,
the pte_refcount member keep track of how many references to the user PTE
page table page.

The following people will hold a reference on the user PTE page table
page::

 The !pte_none() entry, such as regular page table entry that map physical
 pages, or swap entry, or migrate entry, etc.

 Visitor to the PTE page table entries, such as page table walker.

Any ``!pte_none()`` entry and visitor can be regarded as the user of its
PTE page table page. When the ``pte_refcount`` is reduced to 0, it means
that no one is using the PTE page table page, then this free PTE page
table page can be released back to the system at this time.

3. Helpers
==========

+---------------------+-------------------------------------------------+
| pte_ref_init        | Initialize the pte_refcount and pmd             |
+---------------------+-------------------------------------------------+
| pte_to_pmd          | Get the corresponding pmd                       |
+---------------------+-------------------------------------------------+
| pte_update_pmd      | Update the corresponding pmd                    |
+---------------------+-------------------------------------------------+
| pte_get             | Increment a pte_refcount                        |
+---------------------+-------------------------------------------------+
| pte_get_many        | Add a value to a pte_refcount                   |
+---------------------+-------------------------------------------------+
| pte_get_unless_zero | Increment a pte_refcount unless it is 0         |
+---------------------+-------------------------------------------------+
| pte_try_get         | Try to increment a pte_refcount                 |
+---------------------+-------------------------------------------------+
| pte_tryget_map      | Try to increment a pte_refcount before          |
|                     | pte_offset_map()                                |
+---------------------+-------------------------------------------------+
| pte_tryget_map_lock | Try to increment a pte_refcount before          |
|                     | pte_offset_map_lock()                           |
+---------------------+-------------------------------------------------+
| pte_put             | Decrement a pte_refcount                        |
+---------------------+-------------------------------------------------+
| pte_put_many        | Sub a value to a pte_refcount                   |
+---------------------+-------------------------------------------------+
| pte_put_vmf         | Decrement a pte_refcount in the page fault path |
+---------------------+-------------------------------------------------+

4. About this commit
====================
This commit just introduces some dummy helpers, the actual logic will
be implemented in future commits.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm_types.h |  6 +++-
 include/linux/pte_ref.h  | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/Makefile              |  4 +--
 mm/pte_ref.c             | 55 ++++++++++++++++++++++++++++++
 4 files changed, 149 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb8c6f5f19bc..c599008d54fe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -149,11 +149,15 @@ struct page {
 		};
 		struct {	/* Page table pages */
 			unsigned long _pt_pad_1;	/* compound_head */
-			pgtable_t pmd_huge_pte; /* protected by page->ptl */
+			union {
+				pgtable_t pmd_huge_pte; /* protected by page->ptl */
+				pmd_t *pmd;             /* PTE page only */
+			};
 			unsigned long _pt_pad_2;	/* mapping */
 			union {
 				struct mm_struct *pt_mm; /* x86 pgds only */
 				atomic_t pt_frag_refcount; /* powerpc */
+				atomic_t pte_refcount;  /* PTE page only */
 			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
new file mode 100644
index 000000000000..b6d8335bdc59
--- /dev/null
+++ b/include/linux/pte_ref.h
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+#ifndef _LINUX_PTE_REF_H
+#define _LINUX_PTE_REF_H
+
+#include <linux/pgtable.h>
+
+enum pte_tryget_type {
+	TRYGET_SUCCESSED,
+	TRYGET_FAILED_ZERO,
+	TRYGET_FAILED_NONE,
+	TRYGET_FAILED_HUGE_PMD,
+};
+
+bool pte_get_unless_zero(pmd_t *pmd);
+enum pte_tryget_type pte_try_get(pmd_t *pmd);
+void pte_put_vmf(struct vm_fault *vmf);
+
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+	return NULL;
+}
+
+static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+}
+
+static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
+{
+}
+
+/*
+ * pte_get - Increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Similar to the mechanism of page refcount, the user of PTE page table
+ * should hold a refcount to it before accessing.
+ */
+static inline void pte_get(pmd_t *pmd)
+{
+	pte_get_many(pmd, 1);
+}
+
+static inline pte_t *pte_tryget_map(pmd_t *pmd, unsigned long address)
+{
+	if (pte_try_get(pmd))
+		return NULL;
+
+	return pte_offset_map(pmd, address);
+}
+
+static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
+					 unsigned long address, spinlock_t **ptlp)
+{
+	if (pte_try_get(pmd))
+		return NULL;
+
+	return pte_offset_map_lock(mm, pmd, address, ptlp);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, unsigned int nr)
+{
+}
+
+/*
+ * pte_put - Decrement refcount for the PTE page table.
+ * @mm: the mm_struct of the target address space.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ * @addr: the start address of the tlb range to be flushed.
+ *
+ * The PTE page table page will be freed when the last refcount is dropped.
+ */
+static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+	pte_put_many(mm, pmd, addr, 1);
+}
+
+#endif
diff --git a/mm/Makefile b/mm/Makefile
index d6c0042e3aa0..ea679bf75a5f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -38,8 +38,8 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o
-
+			   pgtable-generic.o rmap.o vmalloc.o \
+			   pte_ref.o
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
 mmu-$(CONFIG_MMU)	+= process_vm_access.o
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
new file mode 100644
index 000000000000..de109905bc8f
--- /dev/null
+++ b/mm/pte_ref.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, ByteDance. All rights reserved.
+ *
+ * 	Author: Qi Zheng <zhengqi.arch@bytedance.com>
+ */
+
+#include <linux/pte_ref.h>
+#include <linux/mm.h>
+
+/*
+ * pte_get_unless_zero - Increment refcount for the PTE page table
+ *			 unless it is zero.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ */
+bool pte_get_unless_zero(pmd_t *pmd)
+{
+	return true;
+}
+
+/*
+ * pte_try_get - Try to increment refcount for the PTE page table.
+ * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
+ *
+ * Return true if the increment succeeded. Otherwise return false.
+ *
+ * Before Operating the PTE page table, we need to hold a refcount
+ * to protect against the concurrent release of the PTE page table.
+ * But we will fail in the following case:
+ * 	- The content mapped in @pmd is not a PTE page
+ * 	- The refcount of the PTE page table is zero, it will be freed
+ */
+enum pte_tryget_type pte_try_get(pmd_t *pmd)
+{
+	if (unlikely(pmd_none(*pmd)))
+		return TRYGET_FAILED_NONE;
+	if (unlikely(is_huge_pmd(*pmd)))
+		return TRYGET_FAILED_HUGE_PMD;
+
+	return TRYGET_SUCCESSED;
+}
+
+/*
+ * pte_put_vmf - Decrement refcount for the PTE page table.
+ * @vmf: fault information
+ *
+ * The mmap_lock may be unlocked in advance in some cases
+ * in handle_pte_fault(), then the pmd entry will no longer
+ * be stable. For example, the corresponds of the PTE page may
+ * be replaced(e.g. mremap), so we should ensure the pte_put()
+ * is performed in the critical section of the mmap_lock.
+ */
+void pte_put_vmf(struct vm_fault *vmf)
+{
+}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (5 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-11 15:17   ` kernel test robot
  2021-11-10 10:54 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

When the PTE page table page is allocated and installed into the
pmd entry, it needs to take an initial reference count to prevent
the release of PTE page table page by other threads, and the caller
of pte_alloc()(or other friends) needs to reduce this reference count.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h    |  7 +++++--
 mm/debug_vm_pgtable.c |  1 +
 mm/filemap.c          |  8 ++++++--
 mm/gup.c              | 10 +++++++---
 mm/memory.c           | 51 +++++++++++++++++++++++++++++++++++++++++----------
 mm/migrate.c          |  9 ++++++---
 mm/mlock.c            |  1 +
 mm/mremap.c           |  1 +
 mm/userfaultfd.c      | 16 +++++++++++++++-
 9 files changed, 83 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52f36fde2f11..753a9435e0d0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -26,6 +26,7 @@
 #include <linux/err.h>
 #include <linux/page-flags.h>
 #include <linux/page_ref.h>
+#include <linux/pte_ref.h>
 #include <linux/memremap.h>
 #include <linux/overflow.h>
 #include <linux/sizes.h>
@@ -2313,9 +2314,11 @@ enum pmd_installed_type {
 
 static inline int pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
-	if (unlikely(pmd_none(*(pmd))))
+	enum pte_tryget_type ret = pte_try_get(pmd);
+
+	if (ret == TRYGET_FAILED_NONE || ret == TRYGET_FAILED_ZERO)
 		return __pte_alloc(mm, pmd);
-	if (unlikely(is_huge_pmd(*pmd)))
+	else if (ret == TRYGET_FAILED_HUGE_PMD)
 		return INSTALLED_HUGE_PMD;
 
 	return INSTALLED_PTE;
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index b8322c55e65d..52f006654664 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1048,6 +1048,7 @@ static void __init destroy_args(struct pgtable_debug_args *args)
 
 	/* Free page table entries */
 	if (args->start_ptep) {
+		pte_put(args->mm, args->start_pmdp, args->vaddr);
 		pte_free(args->mm, args->start_ptep);
 		mm_dec_nr_ptes(args->mm);
 	}
diff --git a/mm/filemap.c b/mm/filemap.c
index 23363f8ddbbe..1e7e9e4fd759 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3217,6 +3217,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		}
 	}
 
+retry:
 	if (pmd_none(*vmf->pmd)) {
 		int ret = pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
@@ -3225,6 +3226,8 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
 		/* See comment in handle_pte_fault() */
 		goto out;
+	} else if (pte_try_get(vmf->pmd) == TRYGET_FAILED_ZERO) {
+		goto retry;
 	}
 
 	return false;
@@ -3301,7 +3304,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
 	pgoff_t last_pgoff = start_pgoff;
-	unsigned long addr;
+	unsigned long addr, start;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct page *head, *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
@@ -3317,7 +3320,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	start = addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
 	do {
 		page = find_subpage(head, xas.xa_index);
@@ -3348,6 +3351,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		put_page(head);
 	} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_put(vma->vm_mm, vmf->pmd, start);
 out:
 	rcu_read_unlock();
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
diff --git a/mm/gup.c b/mm/gup.c
index 2def775232a3..e084111103f0 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -694,7 +694,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			spin_unlock(ptl);
 			ret = 0;
 			split_huge_pmd(vma, pmd, address);
-			if (pmd_trans_unstable(pmd))
+			if (pte_try_get(pmd) == TRYGET_FAILED_HUGE_PMD)
 				ret = -EBUSY;
 		} else {
 			spin_unlock(ptl);
@@ -702,8 +702,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			ret = pte_alloc(mm, pmd) < 0 ? -ENOMEM : 0;
 		}
 
-		return ret ? ERR_PTR(ret) :
-			follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		if (ret)
+			return ERR_PTR(ret);
+
+		page = follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
+		pte_put(mm, pmd, address);
+		return page;
 	}
 	page = follow_trans_huge_pmd(vma, address, pmd, flags);
 	spin_unlock(ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 8a39c0e58324..0b9af38cfa11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -441,10 +441,13 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
 				    pgtable_t *pte)
 {
 	int ret = INSTALLED_PTE;
-	spinlock_t *ptl = pmd_lock(mm, pmd);
+	spinlock_t *ptl;
 
+retry:
+	ptl = pmd_lock(mm, pmd);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm_inc_nr_ptes(mm);
+		pte_ref_init(*pte, pmd, 1);
 		/*
 		 * Ensure all pte setup (eg. pte page lock and page clearing) are
 		 * visible before the pte is made visible to other CPUs by being
@@ -464,6 +467,9 @@ enum pmd_installed_type pmd_install(struct mm_struct *mm, pmd_t *pmd,
 	} else if (is_huge_pmd(*pmd)) {
 		/* See comment in handle_pte_fault() */
 		ret = INSTALLED_HUGE_PMD;
+	} else if (!pte_get_unless_zero(pmd)) {
+		spin_unlock(ptl);
+		goto retry;
 	}
 	spin_unlock(ptl);
 
@@ -1028,6 +1034,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
 	struct page *prealloc = NULL;
+	unsigned long start = addr;
 
 again:
 	progress = 0;
@@ -1108,6 +1115,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	pte_unmap(orig_src_pte);
 	add_mm_rss_vec(dst_mm, rss);
 	pte_unmap_unlock(orig_dst_pte, dst_ptl);
+	pte_put(dst_mm, dst_pmd, start);
 	cond_resched();
 
 	if (ret == -EIO) {
@@ -1778,6 +1786,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
 		goto out;
 	retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pte_to_pmd(pte), addr);
 out:
 	return retval;
 }
@@ -1810,6 +1819,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 	unsigned long remaining_pages_total = *num;
 	unsigned long pages_to_write_in_pmd;
 	int ret;
+	unsigned long start = addr;
 more:
 	ret = -EFAULT;
 	pmd = walk_to_pmd(mm, addr);
@@ -1836,7 +1846,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 				pte_unmap_unlock(start_pte, pte_lock);
 				ret = err;
 				remaining_pages_total -= pte_idx;
-				goto out;
+				goto put;
 			}
 			addr += PAGE_SIZE;
 			++curr_page_idx;
@@ -1845,9 +1855,13 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
 		pages_to_write_in_pmd -= batch_size;
 		remaining_pages_total -= batch_size;
 	}
-	if (remaining_pages_total)
+	if (remaining_pages_total) {
+		pte_put(mm, pmd, start);
 		goto more;
+	}
 	ret = 0;
+put:
+	pte_put(mm, pmd, start);
 out:
 	*num = remaining_pages_total;
 	return ret;
@@ -2075,6 +2089,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pte_to_pmd(pte), addr);
 	return VM_FAULT_NOPAGE;
 }
 
@@ -2275,6 +2290,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
+	unsigned long start = addr;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 	int err = 0;
@@ -2294,6 +2310,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_put(mm, pmd, start);
 	return err;
 }
 
@@ -2503,6 +2520,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     pte_fn_t fn, void *data, bool create,
 				     pgtbl_mod_mask *mask)
 {
+	unsigned long start = addr;
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
@@ -2536,8 +2554,11 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm != &init_mm) {
 		pte_unmap_unlock(mapped_pte, ptl);
+		if (create)
+			pte_put(mm, pmd, start);
+	}
 	return err;
 }
 
@@ -3761,7 +3782,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 		/* Deliver the page fault to userland, check inside PT lock */
 		if (userfaultfd_missing(vma)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
-			return handle_userfault(vmf, VM_UFFD_MISSING);
+			ret = handle_userfault(vmf, VM_UFFD_MISSING);
+			goto put;
 		}
 		goto setpte;
 	}
@@ -3804,7 +3826,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	if (userfaultfd_missing(vma)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		put_page(page);
-		return handle_userfault(vmf, VM_UFFD_MISSING);
+		ret = handle_userfault(vmf, VM_UFFD_MISSING);
+		goto put;
 	}
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
@@ -3817,14 +3840,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	update_mmu_cache(vma, vmf->address, vmf->pte);
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return ret;
+	goto put;
 release:
 	put_page(page);
 	goto unlock;
 oom_free_page:
 	put_page(page);
 oom:
-	return VM_FAULT_OOM;
+	ret = VM_FAULT_OOM;
+put:
+	pte_put(vma->vm_mm, vmf->pmd, vmf->address);
+	return ret;
 }
 
 /*
@@ -4031,7 +4057,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
-	if (pmd_none(*vmf->pmd)) {
+retry:
+	ret = pte_try_get(vmf->pmd);
+	if (ret == TRYGET_FAILED_NONE) {
 		int alloc_ret;
 
 		if (PageTransCompound(page)) {
@@ -4047,9 +4075,11 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 		if (unlikely(alloc_ret != INSTALLED_PTE))
 			return alloc_ret < 0 ? VM_FAULT_OOM : 0;
-	} else if (pmd_devmap_trans_unstable(vmf->pmd)) {
+	} else if (ret == TRYGET_FAILED_HUGE_PMD) {
 		/* See comment in handle_pte_fault() */
 		return 0;
+	} else if (ret == TRYGET_FAILED_ZERO) {
+		goto retry;
 	}
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -4063,6 +4093,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 
 	update_mmu_tlb(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_put(vma->vm_mm, vmf->pmd, vmf->address);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index bdfdfd3b50be..26f16a4836d8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2736,9 +2736,9 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	if (unlikely(anon_vma_prepare(vma)))
-		goto abort;
+		goto put;
 	if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
-		goto abort;
+		goto put;
 
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
@@ -2764,7 +2764,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 			 * device memory.
 			 */
 			pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
-			goto abort;
+			goto put;
 		}
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
@@ -2811,11 +2811,14 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	}
 
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmdp, addr);
 	*src = MIGRATE_PFN_MIGRATE;
 	return;
 
 unlock_abort:
 	pte_unmap_unlock(ptep, ptl);
+put:
+	pte_put(mm, pmdp, addr);
 abort:
 	*src &= ~MIGRATE_PFN_MIGRATE;
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index e263d62ae2d0..a4ef20ba9627 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -398,6 +398,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 			break;
 	}
 	pte_unmap_unlock(pte, ptl);
+	pte_put(vma->vm_mm, pte_to_pmd(pte), start);
 	return start;
 }
 
diff --git a/mm/mremap.c b/mm/mremap.c
index fc5c56858883..f80c628db25d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -555,6 +555,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			break;
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
 			  new_pmd, new_addr, need_rmap_locks);
+		pte_put(new_vma->vm_mm, new_pmd, new_addr);
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 2cea08e7f076..37df899a1b9d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -574,6 +574,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 	while (src_addr < src_start + len) {
 		pmd_t dst_pmdval;
+		enum pte_tryget_type tryget_type;
 
 		BUG_ON(dst_addr >= dst_start + len);
 
@@ -583,6 +584,14 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 		}
 
+again:
+		/*
+		 * After the management of the PTE page changes to the refcount
+		 * mode, the PTE page may be released by another thread(rcu mode),
+		 * so the rcu lock is held here to prevent the PTE page from
+		 * being released.
+		 */
+		rcu_read_lock();
 		dst_pmdval = pmd_read_atomic(dst_pmd);
 		/*
 		 * If the dst_pmd is mapped as THP don't
@@ -593,7 +602,9 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 			break;
 		}
 
-		if (unlikely(pmd_none(dst_pmdval))) {
+		tryget_type = pte_try_get(&dst_pmdval);
+		rcu_read_unlock();
+		if (unlikely(tryget_type == TRYGET_FAILED_NONE)) {
 			int ret = __pte_alloc(dst_mm, dst_pmd);
 
 			/*
@@ -607,6 +618,8 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 				err = -EFAULT;
 				break;
 			}
+		} else if (unlikely(tryget_type == TRYGET_FAILED_ZERO)) {
+			goto again;
 		}
 
 		BUG_ON(pmd_none(*dst_pmd));
@@ -614,6 +627,7 @@ static __always_inline ssize_t __mcopy_atomic(struct mm_struct *dst_mm,
 
 		err = mfill_atomic_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
 				       src_addr, &page, mcopy_mode, wp_copy);
+		pte_put(dst_mm, dst_pmd, dst_addr);
 		cond_resched();
 
 		if (unlikely(err == -ENOENT)) {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (6 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

When we split the PMD-mapped THP to the PTE-mapped THP, we should
initialize the refcount of the withdrawn PTE page table page to
HPAGE_PMD_NR, which ensures that we can release the PTE page table
page when it is free(the refcount is 0).

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/pgtable-generic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..523053e09dfa 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -186,6 +186,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 							  struct page, lru);
 	if (pmd_huge_pte(mm, pmdp))
 		list_del(&pgtable->lru);
+	pte_ref_init(pgtable, pmdp, HPAGE_PMD_NR);
 	return pgtable;
 }
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user PTE page table page
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (7 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 10/15] mm/pte_ref: add support for page fault path Qi Zheng
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The !pte_none() entry will take a reference on the user PTE page
table page, such as regular page table entry that map physical
pages, or swap entry, or migrate entry, etc.

So a pte_none() entry is mapped, it needs to increase the refcount
of the PTE page table page. When a !pte_none() entry becomes none,
the refcount of the PTE page table page needs to be decreased.

For swap or migrate cases, which only change the content of
the PTE entry, we keep the refcount unchanged.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 kernel/events/uprobes.c |  2 ++
 mm/filemap.c            |  3 +++
 mm/madvise.c            |  5 +++++
 mm/memory.c             | 42 +++++++++++++++++++++++++++++++++++-------
 mm/migrate.c            |  1 +
 mm/mremap.c             |  7 +++++++
 mm/rmap.c               | 10 ++++++++++
 mm/userfaultfd.c        |  2 ++
 8 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..96dd2959e1ac 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -200,6 +200,8 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	if (new_page)
 		set_pte_at_notify(mm, addr, pvmw.pte,
 				  mk_pte(new_page, vma->vm_page_prot));
+	else
+		pte_put(mm, pte_to_pmd(pvmw.pte), addr);
 
 	page_remove_rmap(old_page, false);
 	if (!page_mapped(old_page))
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e7e9e4fd759..aa47ee11a3d8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3309,6 +3309,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct page *head, *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 	vm_fault_t ret = 0;
+	unsigned int nr_get = 0;
 
 	rcu_read_lock();
 	head = first_map_page(mapping, &xas, end_pgoff);
@@ -3342,6 +3343,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 			ret = VM_FAULT_NOPAGE;
 
 		do_set_pte(vmf, page, addr);
+		nr_get++;
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, addr, vmf->pte);
 		unlock_page(head);
@@ -3351,6 +3353,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		put_page(head);
 	} while ((head = next_map_page(mapping, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	pte_get_many(vmf->pmd, nr_get);
 	pte_put(vma->vm_mm, vmf->pmd, start);
 out:
 	rcu_read_unlock();
diff --git a/mm/madvise.c b/mm/madvise.c
index 0734db8d53a7..82fc40b6dcbf 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -580,6 +580,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	int nr_swap = 0;
 	unsigned long next;
+	unsigned int nr_put = 0;
+	unsigned long start = addr;
 
 	next = pmd_addr_end(addr, end);
 	if (pmd_trans_huge(*pmd))
@@ -612,6 +614,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			nr_swap--;
 			free_swap_and_cache(entry);
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			nr_put++;
 			continue;
 		}
 
@@ -696,6 +699,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+	if (nr_put)
+		pte_put_many(mm, pmd, start, nr_put);
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 0b9af38cfa11..ea4d651ac8c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -878,6 +878,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_swp_clear_uffd_wp(pte);
 	set_pte_at(dst_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -946,6 +947,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 		/* Uffd-wp needs to be delivered to dest pte as well */
 		pte = pte_wrprotect(pte_mkuffd_wp(pte));
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -998,6 +1000,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		pte = pte_clear_uffd_wp(pte);
 
 	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	pte_get(pte_to_pmd(dst_pte));
 	return 0;
 }
 
@@ -1335,6 +1338,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	pte_t *start_pte;
 	pte_t *pte;
 	swp_entry_t entry;
+	unsigned int nr_put = 0;
+	unsigned long start = addr;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
 again:
@@ -1359,6 +1364,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
+			nr_put++;
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;
@@ -1392,6 +1398,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			if (unlikely(zap_skip_check_mapping(details, page)))
 				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			nr_put++;
 			rss[mm_counter(page)]--;
 
 			if (is_device_private_entry(entry))
@@ -1416,6 +1423,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		if (unlikely(!free_swap_and_cache(entry)))
 			print_bad_pte(vma, addr, ptent, NULL);
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+		nr_put++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 
 	add_mm_rss_vec(mm, rss);
@@ -1442,6 +1450,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		goto again;
 	}
 
+	if (nr_put)
+		pte_put_many(mm, pmd, start, nr_put);
+
 	return addr;
 }
 
@@ -1759,6 +1770,7 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
 	inc_mm_counter_fast(mm, mm_counter_file(page));
 	page_add_file_rmap(page, false);
 	set_pte_at(mm, addr, pte, mk_pte(page, prot));
+	pte_get(pte_to_pmd(pte));
 	return 0;
 }
 
@@ -2085,6 +2097,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	set_pte_at(mm, addr, pte, entry);
+	pte_get(pte_to_pmd(pte));
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
 
 out_unlock:
@@ -2291,6 +2304,7 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			unsigned long pfn, pgprot_t prot)
 {
 	unsigned long start = addr;
+	unsigned int nr_get = 0;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 	int err = 0;
@@ -2306,10 +2320,12 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			break;
 		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
+		nr_get++;
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_get_many(pmd, nr_get);
 	pte_put(mm, pmd, start);
 	return err;
 }
@@ -2524,6 +2540,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	pte_t *pte, *mapped_pte;
 	int err = 0;
 	spinlock_t *ptl;
+	unsigned int nr_put = 0, nr_get = 0;
 
 	if (create) {
 		mapped_pte = pte = (mm == &init_mm) ?
@@ -2531,6 +2548,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			pte_alloc_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
 			return -ENOMEM;
+		nr_put++;
 	} else {
 		mapped_pte = pte = (mm == &init_mm) ?
 			pte_offset_kernel(pmd, addr) :
@@ -2543,11 +2561,17 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	if (fn) {
 		do {
-			if (create || !pte_none(*pte)) {
+			if (create) {
 				err = fn(pte++, addr, data);
-				if (err)
-					break;
+				if (mm != &init_mm && !pte_none(*(pte-1)))
+					nr_get++;
+			} else if (!pte_none(*pte)) {
+				err = fn(pte++, addr, data);
+				if (mm != &init_mm && pte_none(*(pte-1)))
+					nr_put++;
 			}
+			if (err)
+				break;
 		} while (addr += PAGE_SIZE, addr != end);
 	}
 	*mask |= PGTBL_PTE_MODIFIED;
@@ -2556,8 +2580,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	if (mm != &init_mm) {
 		pte_unmap_unlock(mapped_pte, ptl);
-		if (create)
-			pte_put(mm, pmd, start);
+		pte_get_many(pmd, nr_get);
+		if (nr_put)
+			pte_put_many(mm, pmd, start, nr_put);
 	}
 	return err;
 }
@@ -3835,6 +3860,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	lru_cache_add_inactive_or_unevictable(page, vma);
 setpte:
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
+	pte_get(vmf->pmd);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(vma, vmf->address, vmf->pte);
@@ -4086,10 +4112,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 				      vmf->address, &vmf->ptl);
 	ret = 0;
 	/* Re-check under ptl */
-	if (likely(pte_none(*vmf->pte)))
+	if (likely(pte_none(*vmf->pte))) {
 		do_set_pte(vmf, page, vmf->address);
-	else
+		pte_get(vmf->pmd);
+	} else {
 		ret = VM_FAULT_NOPAGE;
+	}
 
 	update_mmu_tlb(vma, vmf->address, vmf->pte);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
diff --git a/mm/migrate.c b/mm/migrate.c
index 26f16a4836d8..c03ac25f42a9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2807,6 +2807,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	} else {
 		/* No need to invalidate - it was non-present before */
 		set_pte_at(mm, addr, ptep, entry);
+		pte_get(pmdp);
 		update_mmu_cache(vma, addr, ptep);
 	}
 
diff --git a/mm/mremap.c b/mm/mremap.c
index f80c628db25d..088a7a75cb4b 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -141,6 +141,8 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
 	unsigned long len = old_end - old_addr;
+	unsigned int nr_put = 0, nr_get = 0;
+	unsigned long old_start = old_addr;
 
 	/*
 	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -181,6 +183,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 			continue;
 
 		pte = ptep_get_and_clear(mm, old_addr, old_pte);
+		nr_put++;
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -197,6 +200,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 		set_pte_at(mm, new_addr, new_pte, pte);
+		nr_get++;
 	}
 
 	arch_leave_lazy_mmu_mode();
@@ -206,6 +210,9 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		spin_unlock(new_ptl);
 	pte_unmap(new_pte - 1);
 	pte_unmap_unlock(old_pte - 1, old_ptl);
+	pte_get_many(new_pmd, nr_get);
+	if (nr_put)
+		pte_put_many(mm, old_pmd, old_start, nr_put);
 	if (need_rmap_locks)
 		drop_rmap_locks(vma);
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 2908d637bcad..630ce8a036b5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1404,6 +1404,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	unsigned int nr_put = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1551,6 +1552,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
+			nr_put++;
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1564,6 +1566,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				/* We have to invalidate as we cleared the pte */
 				mmu_notifier_invalidate_range(mm, address,
 							address + PAGE_SIZE);
+				pte_put(mm, pvmw.pmd, address);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1575,6 +1578,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
 					dec_mm_counter(mm, MM_ANONPAGES);
+					nr_put++;
 					goto discard;
 				}
 
@@ -1630,6 +1634,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * See Documentation/vm/mmu_notifier.rst
 			 */
 			dec_mm_counter(mm, mm_counter_file(page));
+			nr_put++;
 		}
 discard:
 		/*
@@ -1641,6 +1646,10 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		 */
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
+		if (nr_put) {
+			pte_put_many(mm, pvmw.pmd, address, nr_put);
+			nr_put = 0;
+		}
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
@@ -1871,6 +1880,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 						      address + PAGE_SIZE);
+			pte_put(mm, pvmw.pmd, address);
 		} else {
 			swp_entry_t entry;
 			pte_t swp_pte;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 37df899a1b9d..b87c61b94065 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -110,6 +110,7 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		lru_cache_add_inactive_or_unevictable(page, dst_vma);
 
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	pte_get(dst_pmd);
 
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
@@ -204,6 +205,7 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm,
 	if (!pte_none(*dst_pte))
 		goto out_unlock;
 	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
+	pte_get(dst_pmd);
 	/* No need to invalidate - it was non-present before */
 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
 	ret = 0;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 10/15] mm/pte_ref: add support for page fault path
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (8 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 11/15] mm/pte_ref: take a refcount before accessing the PTE page table page Qi Zheng
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

In the page fault path, we need to take a reference of
the PTE page table page if the pmd entry is not none,
which ensures the PTE page table page will not be released
by other threads.

And the mmap_lock may be unlocked in advance in some cases
in handle_pte_fault(), then the pmd entry will no longer
be stable:

	thread A		thread B
	page fault		collapse_huge_page
	==========		==================

	mmap_read_unlock()
				mmap_write_lock()
				pgtable_trans_huge_deposit()
				set_pmd_at()
	/* pmd entry is changed! */
	pte_put()

So we should call pte_put() before dropping the mmap_lock.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/userfaultfd.c |  1 +
 mm/filemap.c     |  2 ++
 mm/internal.h    |  1 +
 mm/khugepaged.c  |  8 +++++++-
 mm/memory.c      | 33 ++++++++++++++++++++++++---------
 5 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 22bf14ab2d16..ddbcefa7e0a6 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -509,6 +509,7 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 		must_wait = userfaultfd_huge_must_wait(ctx, vmf->vma,
 						       vmf->address,
 						       vmf->flags, reason);
+	pte_put_vmf(vmf);
 	mmap_read_unlock(mm);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
diff --git a/mm/filemap.c b/mm/filemap.c
index aa47ee11a3d8..4fdc74dc6736 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1708,6 +1708,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
 			return false;
 
+		pte_put_vmf(vmf);
 		mmap_read_unlock(mm);
 		if (flags & FAULT_FLAG_KILLABLE)
 			folio_wait_locked_killable(folio);
@@ -1720,6 +1721,7 @@ bool __folio_lock_or_retry(struct folio *folio, struct vm_fault *vmf)
 
 		ret = __folio_lock_killable(folio);
 		if (ret) {
+			pte_put_vmf(vmf);
 			mmap_read_unlock(mm);
 			return false;
 		}
diff --git a/mm/internal.h b/mm/internal.h
index 474d6e3443f8..460418828a76 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -488,6 +488,7 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
 	if (fault_flag_allow_retry_first(flags) &&
 	    !(flags & FAULT_FLAG_RETRY_NOWAIT)) {
 		fpin = get_file(vmf->vma->vm_file);
+		pte_put_vmf(vmf);
 		mmap_read_unlock(vmf->vma->vm_mm);
 	}
 	return fpin;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e99101162f1a..92b0494f4a00 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1019,10 +1019,13 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 			.pmd = pmd,
 		};
 
-		vmf.pte = pte_offset_map(pmd, address);
+		vmf.pte = pte_tryget_map(pmd, address);
+		if (!vmf.pte)
+			continue;
 		vmf.orig_pte = *vmf.pte;
 		if (!is_swap_pte(vmf.orig_pte)) {
 			pte_unmap(vmf.pte);
+			pte_put_vmf(&vmf);
 			continue;
 		}
 		swapped_in++;
@@ -1041,7 +1044,10 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm,
 				trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
 				return false;
 			}
+		} else {
+			pte_put_vmf(&vmf);
 		}
+
 		if (ret & VM_FAULT_ERROR) {
 			trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
 			return false;
diff --git a/mm/memory.c b/mm/memory.c
index ea4d651ac8c7..5cc4ce0af665 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4571,8 +4571,10 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 {
 	pte_t entry;
+	vm_fault_t ret;
 
-	if (unlikely(pmd_none(*vmf->pmd))) {
+retry:
+	if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) {
 		/*
 		 * Leave __pte_alloc() until later: because vm_ops->fault may
 		 * want to allocate huge page, and if we expose page table
@@ -4595,13 +4597,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		 */
 		if (pmd_devmap_trans_unstable(vmf->pmd))
 			return 0;
+
 		/*
 		 * A regular pmd is established and it can't morph into a huge
 		 * pmd from under us anymore at this point because we hold the
 		 * mmap_lock read mode and khugepaged takes it in write mode.
 		 * So now it's safe to run pte_offset_map().
 		 */
-		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+		vmf->pte = pte_tryget_map(vmf->pmd, vmf->address);
+		if (!vmf->pte)
+			goto retry;
 		vmf->orig_pte = *vmf->pte;
 
 		/*
@@ -4616,6 +4621,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		if (pte_none(vmf->orig_pte)) {
 			pte_unmap(vmf->pte);
 			vmf->pte = NULL;
+			pte_put_vmf(vmf);
 		}
 	}
 
@@ -4626,11 +4632,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			return do_fault(vmf);
 	}
 
-	if (!pte_present(vmf->orig_pte))
-		return do_swap_page(vmf);
+	if (!pte_present(vmf->orig_pte)) {
+		ret = do_swap_page(vmf);
+		goto put;
+	}
 
-	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
-		return do_numa_page(vmf);
+	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) {
+		ret = do_numa_page(vmf);
+		goto put;
+	}
 
 	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 	spin_lock(vmf->ptl);
@@ -4640,8 +4650,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		goto unlock;
 	}
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
-			return do_wp_page(vmf);
+		if (!pte_write(entry)) {
+			ret = do_wp_page(vmf);
+			goto put;
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -4663,7 +4675,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	}
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return 0;
+	ret = 0;
+put:
+	pte_put_vmf(vmf);
+	return ret;
 }
 
 /*
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 11/15] mm/pte_ref: take a refcount before accessing the PTE page table page
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (9 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 10/15] mm/pte_ref: add support for page fault path Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 12/15] mm/pte_ref: update the pmd entry in move_normal_pmd() Qi Zheng
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Now, the user page table will only be released by calling
free_pgtables() when the process exits or unmap_region() is called
(e.g. munmap() path). So other threads only need to ensure mutual
exclusion with these paths to ensure that the page table is not
released. For example::

	thread A			thread B
	page table walker		munmap
	=================		======

	mmap_read_lock()
	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
		pte_offset_map_lock()
		*walk page table*
		pte_unmap_unlock()
	}
	mmap_read_unlock()

					mmap_write_lock_killable()
					detach_vmas_to_be_unmapped()
					unmap_region()
					--> free_pgtables()

But after we introduce the reference count for the user PTE page table
page, these existing balances will be broken. The page can be released
at any time when its pte_refcount is reduced to 0. Therefore, the
following case may happen::

	thread A		thread B
	page table walker	madvise(MADV_DONTNEED)
	=================	======================

	mmap_read_lock()
	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {

				mmap_read_lock()
				unmap_page_range()
				--> zap_pte_range()
				    *the pte_refcount is reduced to 0*
				    --> *free PTE page table page*

		/* broken!! */
		pte_offset_map_lock()

As we can see, all of the thread A and B hold the read lock of
mmap_lock, so they can execute concurrently. When thread B releases the
PTE page table page, the value in the corresponding pmd entry will
become unstable, which may be none or huge pmd, or map a new PTE page
table page again. This will cause system chaos and even panic.

So we need to try to take a reference to the PTE page table page before
walking page table, then the system will become orderly again::

	thread A		thread B
	page table walker	madvise(MADV_DONTNEED)
	=================	======================

	mmap_read_lock()
	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
		pte_try_get()
		--> pte_get_unless_zero
		*if successfully, then:*

				mmap_read_lock()
				unmap_page_range()
				--> zap_pte_range()
				    *the pte_refcount is reduced to 1*

		pte_offset_map_lock()
		*walk page table*
		pte_unmap_unlock()
		pte_put()
		--> *the pte_refcount is reduced to 0*
		    --> *free PTE page table page*

There is also a lock-less scenario(such as fast GUP). Fortunately, we
don't need to do any additional operations to ensure that the system is
in order. Take fast GUP as an example::

	thread A		thread B
	fast GUP		madvise(MADV_DONTNEED)
	========		======================

	get_user_pages_fast_only()
	--> local_irq_save();
				*free PTE page table page*
				--> unhook page
				    /* The CPU where thread A is
				     * located closed the local
				     * interrupt and cannot respond to
				     * IPI, so it will block here
				     */
				    TLB invalidate page
	    gup_pgd_range();
	    local_irq_restore();
	    			    *free page*

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 fs/proc/task_mmu.c   | 24 ++++++++++++++---
 fs/userfaultfd.c     |  8 ++++--
 include/linux/rmap.h |  2 ++
 mm/damon/vaddr.c     | 12 +++++++--
 mm/gup.c             | 13 ++++++++--
 mm/hmm.c             |  5 +++-
 mm/khugepaged.c      | 13 ++++++++--
 mm/ksm.c             |  6 ++++-
 mm/madvise.c         | 16 +++++++++---
 mm/memcontrol.c      | 12 +++++++--
 mm/memory-failure.c  | 11 ++++++--
 mm/memory.c          | 73 +++++++++++++++++++++++++++++++++++++---------------
 mm/mempolicy.c       |  6 ++++-
 mm/migrate.c         | 27 ++++++++++++-------
 mm/mincore.c         |  7 ++++-
 mm/mprotect.c        | 11 +++++---
 mm/page_vma_mapped.c |  4 +++
 mm/pagewalk.c        | 15 ++++++++---
 mm/swapfile.c        |  3 +++
 19 files changed, 209 insertions(+), 59 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ad667dbc96f5..82dd5fd540ce 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -581,6 +581,7 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -596,10 +597,13 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	 * keeps khugepaged out of here and from collapsing things
 	 * in here.
 	 */
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		goto out;
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		smaps_pte_entry(pte, addr, walk);
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 out:
 	cond_resched();
 	return 0;
@@ -1124,6 +1128,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	pte_t *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -1149,7 +1154,9 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
@@ -1171,6 +1178,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ClearPageReferenced(page);
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	cond_resched();
 	return 0;
 }
@@ -1410,6 +1418,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	spinlock_t *ptl;
 	pte_t *pte, *orig_pte;
 	int err = 0;
+	unsigned long start = addr;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	ptl = pmd_trans_huge_lock(pmdp, vma);
@@ -1482,7 +1491,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 	 * We can assume that @vma always points to a valid one and @end never
 	 * goes beyond vma->vm_end.
 	 */
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(walk->mm, pmdp, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		pagemap_entry_t pme;
 
@@ -1492,6 +1503,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			break;
 	}
 	pte_unmap_unlock(orig_pte, ptl);
+	pte_put(walk->mm, pmdp, start);
 
 	cond_resched();
 
@@ -1798,6 +1810,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *orig_pte;
 	pte_t *pte;
+	unsigned long start = addr;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	ptl = pmd_trans_huge_lock(pmd, vma);
@@ -1815,7 +1828,9 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
-	orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	do {
 		struct page *page = can_gather_numa_stats(*pte, vma, addr);
 		if (!page)
@@ -1824,6 +1839,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
 
 	} while (pte++, addr += PAGE_SIZE, addr != end);
 	pte_unmap_unlock(orig_pte, ptl);
+	pte_put(walk->mm, pmd, start);
 	cond_resched();
 	return 0;
 }
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index ddbcefa7e0a6..d1e18e5f3a13 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -297,6 +297,8 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	if (!pud_present(*pud))
 		goto out;
 	pmd = pmd_offset(pud, address);
+
+retry:
 	/*
 	 * READ_ONCE must function as a barrier with narrower scope
 	 * and it must be equivalent to:
@@ -323,7 +325,9 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	 * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
 	 * and use the standard pte_offset_map() instead of parsing _pmd.
 	 */
-	pte = pte_offset_map(pmd, address);
+	pte = pte_tryget_map(pmd, address);
+	if (!pte)
+		goto retry;
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
 	 * changes under us.
@@ -333,7 +337,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	if (!pte_write(*pte) && (reason & VM_UFFD_WP))
 		ret = true;
 	pte_unmap(pte);
-
+	pte_put(mm, pmd, address);
 out:
 	return ret;
 }
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 221c3c6438a7..5bd76fb8b93a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -222,6 +222,8 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		pte_unmap(pvmw->pte);
 	if (pvmw->ptl)
 		spin_unlock(pvmw->ptl);
+	if (pvmw->pte && !PageHuge(pvmw->page))
+		pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address);
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 35fe49080ee9..8b816f92a563 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -373,6 +373,7 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 	pte_t *pte;
 	spinlock_t *ptl;
 
+retry:
 	if (pmd_huge(*pmd)) {
 		ptl = pmd_lock(walk->mm, pmd);
 		if (pmd_huge(*pmd)) {
@@ -385,12 +386,15 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		return 0;
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		goto retry;
 	if (!pte_present(*pte))
 		goto out;
 	damon_ptep_mkold(pte, walk->mm, addr);
 out:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(walk->mm, pmd, addr);
 	return 0;
 }
 
@@ -446,6 +450,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 	struct page *page;
 	struct damon_young_walk_private *priv = walk->private;
 
+retry:
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	if (pmd_huge(*pmd)) {
 		ptl = pmd_lock(walk->mm, pmd);
@@ -473,7 +478,9 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		return -EINVAL;
-	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		goto retry;
 	if (!pte_present(*pte))
 		goto out;
 	page = damon_get_page(pte_pfn(*pte));
@@ -487,6 +494,7 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
 	put_page(page);
 out:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(walk->mm, pmd, addr);
 	return 0;
 }
 
diff --git a/mm/gup.c b/mm/gup.c
index e084111103f0..7b6d024ad5c7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -488,7 +488,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmd, address, &ptl);
+	if (!ptep)
+		return no_page_table(vma, flags);
 	pte = *ptep;
 	if (!pte_present(pte)) {
 		swp_entry_t entry;
@@ -505,6 +507,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		if (!is_migration_entry(entry))
 			goto no_page;
 		pte_unmap_unlock(ptep, ptl);
+		pte_put(mm, pmd, address);
 		migration_entry_wait(mm, pmd, address);
 		goto retry;
 	}
@@ -512,6 +515,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto no_page;
 	if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
 		pte_unmap_unlock(ptep, ptl);
+		pte_put(mm, pmd, address);
 		return NULL;
 	}
 
@@ -600,9 +604,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 out:
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmd, address);
 	return page;
 no_page:
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmd, address);
 	if (!pte_none(pte))
 		return NULL;
 	return no_page_table(vma, flags);
@@ -885,7 +891,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	if (!pmd_present(*pmd))
 		return -EFAULT;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
-	pte = pte_offset_map(pmd, address);
+	pte = pte_tryget_map(pmd, address);
+	if (!pte)
+		return -EFAULT;
 	if (pte_none(*pte))
 		goto unmap;
 	*vma = get_gate_vma(mm);
@@ -905,6 +913,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
 	ret = 0;
 unmap:
 	pte_unmap(pte);
+	pte_put(mm, pmd, address);
 	return ret;
 }
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 842e26599238..b8917a5ae442 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -383,7 +383,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	ptep = pte_offset_map(pmdp, addr);
+	ptep = pte_tryget_map(pmdp, addr);
+	if (!ptep)
+		goto again;
 	for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) {
 		int r;
 
@@ -394,6 +396,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		}
 	}
 	pte_unmap(ptep - 1);
+	pte_put(walk->mm, pmdp, start);
 	return 0;
 }
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 92b0494f4a00..5842c0774d70 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1249,7 +1249,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	}
 
 	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
-	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+	pte = pte_tryget_map_lock(mm, pmd, address, &ptl);
+	if (!pte) {
+		result = SCAN_PMD_NULL;
+		goto out;
+	}
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
 		pte_t pteval = *_pte;
@@ -1370,6 +1374,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	}
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	pte_put(mm, pmd, address);
 	if (ret) {
 		node = khugepaged_find_target_node();
 		/* collapse_huge_page will return with the mmap_lock released */
@@ -1472,7 +1477,9 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	if (!pmd)
 		goto drop_hpage;
 
-	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+	start_pte = pte_tryget_map_lock(mm, pmd, haddr, &ptl);
+	if (!start_pte)
+		goto drop_hpage;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1510,6 +1517,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
+	pte_put(mm, pmd, haddr);
 
 	/* step 3: set proper refcount and mm_counters. */
 	if (count) {
@@ -1531,6 +1539,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 
 abort:
 	pte_unmap_unlock(start_pte, ptl);
+	pte_put(mm, pmd, haddr);
 	goto drop_hpage;
 }
 
diff --git a/mm/ksm.c b/mm/ksm.c
index 0662093237e4..94aeaed42c1f 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1140,9 +1140,12 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 				addr + PAGE_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 
-	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out_mn;
 	if (!pte_same(*ptep, orig_pte)) {
 		pte_unmap_unlock(ptep, ptl);
+		pte_put(mm, pmd, addr);
 		goto out_mn;
 	}
 
@@ -1182,6 +1185,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	put_page(page);
 
 	pte_unmap_unlock(ptep, ptl);
+	pte_put(mm, pmd, addr);
 	err = 0;
 out_mn:
 	mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/madvise.c b/mm/madvise.c
index 82fc40b6dcbf..5cf2832abb98 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -200,9 +200,12 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 		struct page *page;
 		spinlock_t *ptl;
 
-		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+		orig_pte = pte_tryget_map_lock(vma->vm_mm, pmd, start, &ptl);
+		if (!orig_pte)
+			continue;
 		pte = *(orig_pte + ((index - start) / PAGE_SIZE));
 		pte_unmap_unlock(orig_pte, ptl);
+		pte_put(vma->vm_mm, pmd, start);
 
 		if (pte_present(pte) || pte_none(pte))
 			continue;
@@ -317,6 +320,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	pte_t *orig_pte, *pte, ptent;
 	spinlock_t *ptl;
 	struct page *page = NULL;
+	unsigned long start = addr;
 	LIST_HEAD(page_list);
 
 	if (fatal_signal_pending(current))
@@ -393,7 +397,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		return 0;
 #endif
 	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
@@ -471,6 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	if (pageout)
 		reclaim_pages(&page_list);
 	cond_resched();
@@ -592,7 +599,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		return 0;
 
 	tlb_change_page_size(tlb, PAGE_SIZE);
-	orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	orig_pte = pte = pte_tryget_map_lock(mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
+	nr_put++;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 781605e92015..7283044d4f64 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5773,6 +5773,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -5789,11 +5790,14 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	if (pmd_trans_unstable(pmd))
 		return 0;
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; pte++, addr += PAGE_SIZE)
 		if (get_mctgt_type(vma, addr, *pte, NULL))
 			mc.precharge++;	/* increment precharge temporarily */
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	cond_resched();
 
 	return 0;
@@ -5973,6 +5977,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct page *page;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -6008,7 +6013,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 retry:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	pte = pte_tryget_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
 		bool device = false;
@@ -6058,6 +6065,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		}
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	pte_put(vma->vm_mm, pmd, start);
 	cond_resched();
 
 	if (addr != end) {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f64ebb6226cb..6f281e827c32 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -331,10 +331,13 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
 		return 0;
 	if (pmd_devmap(*pmd))
 		return PMD_SHIFT;
-	pte = pte_offset_map(pmd, address);
+	pte = pte_tryget_map(pmd, address);
+	if (!pte)
+		return 0;
 	if (pte_present(*pte) && pte_devmap(*pte))
 		ret = PAGE_SHIFT;
 	pte_unmap(pte);
+	pte_put(vma->vm_mm, pmd, address);
 	return ret;
 }
 
@@ -634,6 +637,7 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	int ret = 0;
 	pte_t *ptep, *mapped_pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmdp, walk->vma);
 	if (ptl) {
@@ -645,8 +649,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 	if (pmd_trans_unstable(pmdp))
 		goto out;
 
-	mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
+	mapped_pte = ptep = pte_tryget_map_lock(walk->vma->vm_mm, pmdp,
 						addr, &ptl);
+	if (!ptep)
+		goto out;
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
 					     hwp->pfn, &hwp->tk);
@@ -654,6 +660,7 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
 			break;
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_put(walk->vma->vm_mm, pmdp, start);
 out:
 	cond_resched();
 	return ret;
diff --git a/mm/memory.c b/mm/memory.c
index 5cc4ce0af665..e360ecd37a71 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1165,7 +1165,8 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_huge_pmd(*src_pmd)) {
+retry:
+		if (is_huge_pmd(READ_ONCE(*src_pmd))) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1178,9 +1179,14 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
+		if (pte_try_get(src_pmd))
+			goto retry;
 		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
-				   addr, next))
+				   addr, next)) {
+			pte_put(src_mm, src_pmd, addr);
 			return -ENOMEM;
+		}
+		pte_put(src_mm, src_pmd, addr);
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
@@ -1494,7 +1500,10 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		 */
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			goto next;
+		if (pte_try_get(pmd))
+			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
+		pte_put(tlb->mm, pmd, addr);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
@@ -2606,18 +2615,26 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 		pmd = pmd_offset(pud, addr);
 	}
 	do {
+		pmd_t pmdval;
+
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*pmd) && !create)
+retry:
+		pmdval = READ_ONCE(*pmd);
+		if (pmd_none(pmdval) && !create)
 			continue;
-		if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+		if (WARN_ON_ONCE(pmd_leaf(pmdval)))
 			return -EINVAL;
-		if (!pmd_none(*pmd) && WARN_ON_ONCE(pmd_bad(*pmd))) {
+		if (!pmd_none(pmdval) && WARN_ON_ONCE(pmd_bad(pmdval))) {
 			if (!create)
 				continue;
 			pmd_clear_bad(pmd);
 		}
+		if (!create && pte_try_get(pmd))
+			goto retry;
 		err = apply_to_pte_range(mm, pmd, addr, next,
 					 fn, data, create, mask);
+		if (!create)
+			pte_put(mm, pmd, addr);
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
@@ -4343,26 +4360,31 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 		 * If we find a migration pmd entry or a none pmd entry, which
 		 * should never happen, return SIGBUS
 		 */
-		if (unlikely(!pmd_present(*vmf->pmd)))
+		if (unlikely(!pmd_present(*vmf->pmd))) {
 			ret = VM_FAULT_SIGBUS;
-		else {
-			vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
+		} else {
+			vmf->pte = pte_tryget_map_lock(vmf->vma->vm_mm,
 						       vmf->pmd,
 						       vmf->address,
 						       &vmf->ptl);
-			/*
-			 * Make sure this is not a temporary clearing of pte
-			 * by holding ptl and checking again. A R/M/W update
-			 * of pte involves: take ptl, clearing the pte so that
-			 * we don't have concurrent modification by hardware
-			 * followed by an update.
-			 */
-			if (unlikely(pte_none(*vmf->pte)))
+			if (vmf->pte) {
+				/*
+				 * Make sure this is not a temporary clearing of pte
+				 * by holding ptl and checking again. A R/M/W update
+				 * of pte involves: take ptl, clearing the pte so that
+				 * we don't have concurrent modification by hardware
+				 * followed by an update.
+				 */
+				if (unlikely(pte_none(*vmf->pte)))
+					ret = VM_FAULT_SIGBUS;
+				else
+					ret = VM_FAULT_NOPAGE;
+
+				pte_unmap_unlock(vmf->pte, vmf->ptl);
+				pte_put(vma->vm_mm, vmf->pmd, vmf->address);
+			} else {
 				ret = VM_FAULT_SIGBUS;
-			else
-				ret = VM_FAULT_NOPAGE;
-
-			pte_unmap_unlock(vmf->pte, vmf->ptl);
+			}
 		}
 	} else if (!(vmf->flags & FAULT_FLAG_WRITE))
 		ret = do_read_fault(vmf);
@@ -5016,13 +5038,22 @@ int follow_invalidate_pte(struct mm_struct *mm, unsigned long address,
 					(address & PAGE_MASK) + PAGE_SIZE);
 		mmu_notifier_invalidate_range_start(range);
 	}
-	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+	ptep = pte_tryget_map_lock(mm, pmd, address, ptlp);
+	if (!ptep)
+		goto out;
 	if (!pte_present(*ptep))
 		goto unlock;
+	/*
+	 * when we reach here, it means that the refcount of the pte is at least
+	 * one and the contents of the PTE page table are stable until @ptlp is
+	 * released, so we can put pte safely.
+	 */
+	pte_put(mm, pmd, address);
 	*ptepp = ptep;
 	return 0;
 unlock:
 	pte_unmap_unlock(ptep, *ptlp);
+	pte_put(mm, pmd, address);
 	if (range)
 		mmu_notifier_invalidate_range_end(range);
 out:
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 774a3d3183a7..18e57ba515dc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -508,6 +508,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	bool has_unmovable = false;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
+	unsigned long start = addr;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -520,7 +521,9 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	if (pmd_trans_unstable(pmd))
 		return 0;
 
-	mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	mapped_pte = pte = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
@@ -553,6 +556,7 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 			break;
 	}
 	pte_unmap_unlock(mapped_pte, ptl);
+	pte_put(walk->mm, pmd, start);
 	cond_resched();
 
 	if (has_unmovable)
diff --git a/mm/migrate.c b/mm/migrate.c
index c03ac25f42a9..5a234ddf36b1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -323,8 +323,12 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
 				unsigned long address)
 {
 	spinlock_t *ptl = pte_lockptr(mm, pmd);
-	pte_t *ptep = pte_offset_map(pmd, address);
-	__migration_entry_wait(mm, ptep, ptl);
+	pte_t *ptep = pte_tryget_map(pmd, address);
+
+	if (ptep) {
+		__migration_entry_wait(mm, ptep, ptl);
+		pte_put(mm, pmd, address);
+	}
 }
 
 void migration_entry_wait_huge(struct vm_area_struct *vma,
@@ -2249,21 +2253,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	pmd_t pmdval;
 
 again:
-	if (pmd_none(*pmdp))
+	pmdval = READ_ONCE(*pmdp);
+	if (pmd_none(pmdvalp))
 		return migrate_vma_collect_hole(start, end, -1, walk);
 
-	if (pmd_trans_huge(*pmdp)) {
+	if (pmd_trans_huge(pmdval)) {
 		struct page *page;
 
 		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (unlikely(!pmd_trans_huge(pmdval))) {
 			spin_unlock(ptl);
 			goto again;
 		}
 
-		page = pmd_page(*pmdp);
+		page = pmd_page(pmdval);
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
 			split_huge_pmd(vma, pmdp, addr);
@@ -2284,16 +2290,18 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			if (ret)
 				return migrate_vma_collect_skip(start, end,
 								walk);
-			if (pmd_none(*pmdp))
+			if (pmd_none(pmdval))
 				return migrate_vma_collect_hole(start, end, -1,
 								walk);
 		}
 	}
 
-	if (unlikely(pmd_bad(*pmdp)))
+	if (unlikely(pmd_bad(pmdval)))
 		return migrate_vma_collect_skip(start, end, walk);
 
-	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+	ptep = pte_tryget_map_lock(mm, pmdp, addr, &ptl);
+	if (!ptep)
+		goto again;
 	arch_enter_lazy_mmu_mode();
 
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
@@ -2416,6 +2424,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	}
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(ptep - 1, ptl);
+	pte_put(mm, pmdp, start);
 
 	/* Only flush the TLB if we actually modified any entries */
 	if (unmapped)
diff --git a/mm/mincore.c b/mm/mincore.c
index 9122676b54d6..92e56cef2473 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -104,7 +104,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *ptep;
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	unsigned long start = addr;
 
+retry:
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		memset(vec, 1, nr);
@@ -117,7 +119,9 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		goto out;
 	}
 
-	ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	ptep = pte_tryget_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto retry;
 	for (; addr != end; ptep++, addr += PAGE_SIZE) {
 		pte_t pte = *ptep;
 
@@ -148,6 +152,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		vec++;
 	}
 	pte_unmap_unlock(ptep - 1, ptl);
+	pte_put(walk->mm, pmd, start);
 out:
 	walk->private += nr;
 	cond_resched();
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2d5064a4631c..5c663270b816 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -234,9 +234,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 	pmd = pmd_offset(pud, addr);
 	do {
 		unsigned long this_pages;
+		pmd_t pmdval;
 
 		next = pmd_addr_end(addr, end);
-
+retry:
+		pmdval = READ_ONCE(*pmd);
 		/*
 		 * Automatic NUMA balancing walks the tables with mmap_lock
 		 * held for read. It's possible a parallel update to occur
@@ -245,7 +247,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * Hence, it's necessary to atomically read the PMD value
 		 * for all the checks.
 		 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
+		if (!is_swap_pmd(pmdval) && !pmd_devmap(pmdval) &&
 		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			goto next;
 
@@ -257,7 +259,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (is_huge_pmd(*pmd)) {
+		if (is_huge_pmd(pmdval)) {
 			if (next - addr != HPAGE_PMD_SIZE) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			} else {
@@ -276,8 +278,11 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 			}
 			/* fall through, the trans huge pmd just split */
 		}
+		if (pte_try_get(pmd))
+			goto retry;
 		this_pages = change_pte_range(vma, pmd, addr, next, newprot,
 					      cp_flags);
+		pte_put(vma->vm_mm, pmd, addr);
 		pages += this_pages;
 next:
 		cond_resched();
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f7b331081791..4725a2f78f09 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -211,6 +211,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		}
 
 		pvmw->pmd = pmd_offset(pud, pvmw->address);
+retry:
 		/*
 		 * Make sure the pmd value isn't cached in a register by the
 		 * compiler and used as a stale value after we've observed a
@@ -258,6 +259,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			step_forward(pvmw, PMD_SIZE);
 			continue;
 		}
+		if (pte_try_get(pvmw->pmd))
+			goto retry;
 		if (!map_pte(pvmw))
 			goto next_pte;
 this_pte:
@@ -275,6 +278,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					pvmw->ptl = NULL;
 				}
 				pte_unmap(pvmw->pte);
+				pte_put(pvmw->vma->vm_mm, pvmw->pmd, pvmw->address);
 				pvmw->pte = NULL;
 				goto restart;
 			}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9b3db11a4d1d..72074a34beea 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -110,6 +110,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	do {
 again:
 		next = pmd_addr_end(addr, end);
+retry:
 		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
@@ -147,10 +148,18 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 				goto again;
 		}
 
-		if (is_hugepd(__hugepd(pmd_val(*pmd))))
+		if (is_hugepd(__hugepd(pmd_val(*pmd)))) {
 			err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT);
-		else
-			err = walk_pte_range(pmd, addr, next, walk);
+		} else {
+			if (!walk->no_vma) {
+				if (pte_try_get(pmd))
+					goto retry;
+				err = walk_pte_range(pmd, addr, next, walk);
+				pte_put(walk->mm, pmd, addr);
+			} else {
+				err = walk_pte_range(pmd, addr, next, walk);
+			}
+		}
 		if (err)
 			break;
 	} while (pmd++, addr = next, addr != end);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e59e08ef46e1..175b35fec758 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2023,8 +2023,11 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
+		if (pte_try_get(pmd))
+			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, type,
 				      frontswap, fs_pages_to_unuse);
+		pte_put(vma->vm_mm, pmd, addr);
 		if (ret)
 			return ret;
 	} while (pmd++, addr = next, addr != end);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 12/15] mm/pte_ref: update the pmd entry in move_normal_pmd()
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (10 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 11/15] mm/pte_ref: take a refcount before accessing the PTE page table page Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 10:54 ` [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages Qi Zheng
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

The ->pmd member record the pmd entry that maps the user PTE page
table page. When the pmd entry changes, ->pmd needs to be updated
synchronously.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 mm/mremap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 088a7a75cb4b..4661cdec79dc 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -278,6 +278,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 	VM_BUG_ON(!pmd_none(*new_pmd));
 
 	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
+	pte_update_pmd(pmd, new_pmd);
 	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (11 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 12/15] mm/pte_ref: update the pmd entry in move_normal_pmd() Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-14 14:43   ` [mm/pte_ref] afcc9fb874: kernel_BUG_at_include/linux/pte_ref.h kernel test robot
  2021-11-10 10:54 ` [PATCH v3 14/15] Documentation: add document for pte_ref Qi Zheng
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

This commit introduces CONFIG_FREE_USER_PTE option and
implements the actual logic to the dummy helpers about
pte_ref.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/mm.h      |  2 ++
 include/linux/pgtable.h |  3 +-
 include/linux/pte_ref.h | 53 ++++++++++++++++++++++++----
 mm/Kconfig              |  4 +++
 mm/debug_vm_pgtable.c   |  2 ++
 mm/memory.c             | 15 ++++++++
 mm/pte_ref.c            | 91 +++++++++++++++++++++++++++++++++++++++++++++----
 7 files changed, 156 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 753a9435e0d0..18fbf9e0996a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -437,6 +437,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_PTE_GET: This means the refcount of the pte has been got.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -468,6 +469,7 @@ enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_PTE_GET =		1 << 10,
 };
 
 /*
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c8f045705c1e..6ac51d58f11a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -480,7 +480,6 @@ static inline pte_t ptep_get_lockless(pte_t *ptep)
 }
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR
 static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address,
@@ -491,6 +490,8 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
 	return pmd;
 }
 #endif /* __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR */
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR
 static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 					    unsigned long address,
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index b6d8335bdc59..8a26eaba83ef 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -8,6 +8,7 @@
 #define _LINUX_PTE_REF_H
 
 #include <linux/pgtable.h>
+#include <linux/page-flags.h>
 
 enum pte_tryget_type {
 	TRYGET_SUCCESSED,
@@ -16,12 +17,49 @@ enum pte_tryget_type {
 	TRYGET_FAILED_HUGE_PMD,
 };
 
-bool pte_get_unless_zero(pmd_t *pmd);
-enum pte_tryget_type pte_try_get(pmd_t *pmd);
 void pte_put_vmf(struct vm_fault *vmf);
+enum pte_tryget_type pte_try_get(pmd_t *pmd);
+bool pte_get_unless_zero(pmd_t *pmd);
+
+#ifdef CONFIG_FREE_USER_PTE
+void free_user_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr);
 
 static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
 {
+	pte->pmd = pmd;
+	atomic_set(&pte->pte_refcount, count);
+}
+
+static inline pmd_t *pte_to_pmd(pte_t *pte)
+{
+	return virt_to_page(pte)->pmd;
+}
+
+static inline void pte_update_pmd(pmd_t old_pmd, pmd_t *new_pmd)
+{
+	pmd_pgtable(old_pmd)->pmd = new_pmd;
+}
+
+static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
+{
+	pgtable_t pte = pmd_pgtable(*pmd);
+
+	VM_BUG_ON(!PageTable(pte));
+	atomic_add(nr, &pte->pte_refcount);
+}
+
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, unsigned int nr)
+{
+	pgtable_t pte = pmd_pgtable(*pmd);
+
+	VM_BUG_ON(!PageTable(pte));
+	if (atomic_sub_and_test(nr, &pte->pte_refcount))
+		free_user_pte_table(mm, pmd, addr & PMD_MASK);
+}
+#else
+static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
+{
 }
 
 static inline pmd_t *pte_to_pmd(pte_t *pte)
@@ -37,6 +75,12 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
 {
 }
 
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, unsigned int nr)
+{
+}
+#endif /* CONFIG_FREE_USER_PTE */
+
 /*
  * pte_get - Increment refcount for the PTE page table.
  * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
@@ -66,11 +110,6 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	return pte_offset_map_lock(mm, pmd, address, ptlp);
 }
 
-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long addr, unsigned int nr)
-{
-}
-
 /*
  * pte_put - Decrement refcount for the PTE page table.
  * @mm: the mm_struct of the target address space.
diff --git a/mm/Kconfig b/mm/Kconfig
index 5c5508fafcec..44549d287869 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -898,6 +898,10 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+config FREE_USER_PTE
+	def_bool y
+	depends on X86_64
+
 source "mm/damon/Kconfig"
 
 endmenu
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 52f006654664..757dd84ee254 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -1049,8 +1049,10 @@ static void __init destroy_args(struct pgtable_debug_args *args)
 	/* Free page table entries */
 	if (args->start_ptep) {
 		pte_put(args->mm, args->start_pmdp, args->vaddr);
+#ifndef CONFIG_FREE_USER_PTE
 		pte_free(args->mm, args->start_ptep);
 		mm_dec_nr_ptes(args->mm);
+#endif
 	}
 
 	if (args->start_pmdp) {
diff --git a/mm/memory.c b/mm/memory.c
index e360ecd37a71..4d1ede78d1b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -219,6 +219,17 @@ static void check_sync_rss_stat(struct task_struct *task)
 
 #endif /* SPLIT_RSS_COUNTING */
 
+#ifdef CONFIG_FREE_USER_PTE
+static inline void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
+				  unsigned long addr)
+{
+	/*
+	 * We should never reach here since the PTE page tables are
+	 * dynamically freed.
+	 */
+	BUG();
+}
+#else
 /*
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
@@ -231,6 +242,7 @@ static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
 	pte_free_tlb(tlb, token, addr);
 	mm_dec_nr_ptes(tlb->mm);
 }
+#endif /* CONFIG_FREE_USER_PTE */
 
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				unsigned long addr, unsigned long end,
@@ -4631,6 +4643,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			goto retry;
 		vmf->orig_pte = *vmf->pte;
 
+		if (IS_ENABLED(CONFIG_FREE_USER_PTE))
+			vmf->flags |= FAULT_FLAG_PTE_GET;
+
 		/*
 		 * some architectures can have larger ptes than wordsize,
 		 * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index de109905bc8f..728e61cea25e 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -7,7 +7,10 @@
 
 #include <linux/pte_ref.h>
 #include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <asm/tlbflush.h>
 
+#ifdef CONFIG_FREE_USER_PTE
 /*
  * pte_get_unless_zero - Increment refcount for the PTE page table
  *			 unless it is zero.
@@ -15,7 +18,10 @@
  */
 bool pte_get_unless_zero(pmd_t *pmd)
 {
-	return true;
+	pgtable_t pte = pmd_pgtable(*pmd);
+
+	VM_BUG_ON(!PageTable(pte));
+	return atomic_inc_not_zero(&pte->pte_refcount);
 }
 
 /*
@@ -32,12 +38,20 @@ bool pte_get_unless_zero(pmd_t *pmd)
  */
 enum pte_tryget_type pte_try_get(pmd_t *pmd)
 {
-	if (unlikely(pmd_none(*pmd)))
-		return TRYGET_FAILED_NONE;
-	if (unlikely(is_huge_pmd(*pmd)))
-		return TRYGET_FAILED_HUGE_PMD;
+	int retval = TRYGET_SUCCESSED;
+	pmd_t pmdval;
 
-	return TRYGET_SUCCESSED;
+	rcu_read_lock();
+	pmdval = READ_ONCE(*pmd);
+	if (unlikely(pmd_none(pmdval)))
+		retval = TRYGET_FAILED_NONE;
+	else if (unlikely(is_huge_pmd(pmdval)))
+		retval = TRYGET_FAILED_HUGE_PMD;
+	else if (!pte_get_unless_zero(&pmdval))
+		retval = TRYGET_FAILED_ZERO;
+	rcu_read_unlock();
+
+	return retval;
 }
 
 /*
@@ -52,4 +66,69 @@ enum pte_tryget_type pte_try_get(pmd_t *pmd)
  */
 void pte_put_vmf(struct vm_fault *vmf)
 {
+	if (!(vmf->flags & FAULT_FLAG_PTE_GET))
+		return;
+	vmf->flags &= ~FAULT_FLAG_PTE_GET;
+
+	pte_put(vmf->vma->vm_mm, vmf->pmd, vmf->address);
+}
+#else
+bool pte_get_unless_zero(pmd_t *pmd)
+{
+	return true;
+}
+
+enum pte_tryget_type pte_try_get(pmd_t *pmd)
+{
+	if (unlikely(pmd_none(*pmd)))
+		return TRYGET_FAILED_NONE;
+
+	if (unlikely(is_huge_pmd(*pmd)))
+		return TRYGET_FAILED_HUGE_PMD;
+
+	return TRYGET_SUCCESSED;
+}
+
+void pte_put_vmf(struct vm_fault *vmf)
+{
+}
+#endif /* CONFIG_FREE_USER_PTE */
+
+#ifdef CONFIG_DEBUG_VM
+static void pte_free_debug(pmd_t pmd)
+{
+	pte_t *ptep = (pte_t *)pmd_page_vaddr(pmd);
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PTE; i++)
+		BUG_ON(!pte_none(*ptep++));
+}
+#else
+static inline void pte_free_debug(pmd_t pmd)
+{
+}
+#endif
+
+static void pte_free_rcu(struct rcu_head *rcu)
+{
+	struct page *page = container_of(rcu, struct page, rcu_head);
+
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
+void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+{
+	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
+	spinlock_t *ptl;
+	pmd_t pmdval;
+
+	ptl = pmd_lock(mm, pmd);
+	pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
+	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+	spin_unlock(ptl);
+
+	pte_free_debug(pmdval);
+	mm_dec_nr_ptes(mm);
+	call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 14/15] Documentation: add document for pte_ref
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (12 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 14:39   ` Jonathan Corbet
  2021-11-10 10:54 ` [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages Qi Zheng
  2021-11-10 12:56 ` [PATCH v3 00/15] Free user " Jason Gunthorpe
  15 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

This commit adds document for pte_ref under `Documentation/vm/`.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 212 insertions(+)
 create mode 100644 Documentation/vm/pte_ref.rst

diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
new file mode 100644
index 000000000000..c5323a263464
--- /dev/null
+++ b/Documentation/vm/pte_ref.rst
@@ -0,0 +1,212 @@
+.. _pte_ref:
+
+============================================================================
+pte_ref: Tracking about how many references to each user PTE page table page
+============================================================================
+
+.. contents:: :local:
+
+1. Preface
+==========
+
+Now in order to pursue high performance, applications mostly use some
+high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
+These memory allocators use ``madvise(MADV_DONTNEED or MADV_FREE)`` to release
+physical memory for the following reasons::
+
+ First of all, we should hold as few write locks of mmap_lock as possible,since
+ the mmap_lock semaphore has long been a contention point in the memory
+ management subsystem. The mmap()/munmap() hold the write lock, and the
+ madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using madvise()
+ instead of munmap() to released physical memory can reduce the competition of
+ the mmap_lock.
+
+ Secondly, after using madvise() to release physical memory, there is no need to
+ build vma and allocate page tables again when accessing the same virtual
+ address again, which can also save some time.
+
+The following is the largest user PTE page table memory that can be allocated by
+a single user process in a 32-bit and a 64-bit system.
+
++---------------------------+--------+---------+
+|                           | 32-bit | 64-bit  |
++===========================+========+=========+
+| user PTE page table pages | 3 MiB  | 512 GiB |
++---------------------------+--------+---------+
+| user PMD page table pages | 3 KiB  | 1 GiB   |
++---------------------------+--------+---------+
+
+(for 32-bit, take 3G user address space, 4K page size as an example; for 64-bit,
+take 48-bit address width, 4K page size as an example.)
+
+After using ``madvise()``, everything looks good, but as can be seen from the
+above table, a single process can create a large number of PTE page tables on a
+64-bit system, since both of the ``MADV_DONTNEED`` and ``MADV_FREE`` will not
+release page table memory. And before the process exits or calls ``munmap()``,
+the kernel cannot reclaim these pages even if these PTE page tables do not map
+anything.
+
+Therefore, we decided to introduce reference count to manage the PTE page table
+life cycle, so that some free PTE page table memory in the system can be
+dynamically released.
+
+2. The reference count of user PTE page table pages
+===================================================
+
+We introduce two members for the ``struct page`` of the user PTE page table
+page::
+
+ union {
+	pgtable_t pmd_huge_pte; /* protected by page->ptl */
+	pmd_t *pmd;             /* PTE page only */
+ };
+ union {
+	struct mm_struct *pt_mm; /* x86 pgds only */
+	atomic_t pt_frag_refcount; /* powerpc */
+	atomic_t pte_refcount;  /* PTE page only */
+ };
+
+The ``pmd`` member record the pmd entry that maps the user PTE page table page,
+the ``pte_refcount`` member keep track of how many references to the user PTE
+page table page.
+
+The following people will hold a reference on the user PTE page table page::
+
+ The !pte_none() entry, such as regular page table entry that map physical
+ pages, or swap entry, or migrate entry, etc.
+
+ Visitor to the PTE page table entries, such as page table walker.
+
+Any ``!pte_none()`` entry and visitor can be regarded as the user of its PTE
+page table page. When the ``pte_refcount`` is reduced to 0, it means that no one
+is using the PTE page table page, then this free PTE page table page can be
+released back to the system at this time.
+
+3. Competitive relationship
+===========================
+
+Now, the user page table will only be released by calling ``free_pgtables()``
+when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path).
+So other threads only need to ensure mutual exclusion with these paths to ensure
+that the page table is not released. For example::
+
+	thread A			thread B
+	page table walker		munmap
+	=================		======
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+		pte_offset_map_lock()
+		*walk page table*
+		pte_unmap_unlock()
+	}
+	mmap_read_unlock()
+
+					mmap_write_lock_killable()
+					detach_vmas_to_be_unmapped()
+					unmap_region()
+					--> free_pgtables()
+
+But after we introduce the reference count for the user PTE page table page,
+these existing balances will be broken. The page can be released at any time
+when its ``pte_refcount`` is reduced to 0. Therefore, the following case may
+happen::
+
+	thread A		thread B			thread C
+	page table walker	madvise(MADV_DONTNEED)		page fault
+	=================	======================		==========
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+
+				mmap_read_lock()
+				unmap_page_range()
+				--> zap_pte_range()
+				    *the pte_refcount is reduced to 0*
+				    --> *free PTE page table page*
+
+		/* broken!! */					mmap_read_lock()
+		pte_offset_map_lock()
+
+As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so
+they can execute concurrently. When thread B releases the PTE page table page,
+the value in the corresponding pmd entry will become unstable, which may be
+none or huge pmd, or map a new PTE page table page again. This will cause system
+chaos and even panic.
+
+So as described in the section "The reference count of user PTE page table
+pages", we need to try to take a reference to the PTE page table page before
+walking page table, then the system will become orderly again::
+
+	thread A		thread B
+	page table walker	madvise(MADV_DONTNEED)
+	=================	======================
+
+	mmap_read_lock()
+	if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+		pte_try_get()
+		--> pte_get_unless_zero
+		*if successfully, then:*
+
+				mmap_read_lock()
+				unmap_page_range()
+				--> zap_pte_range()
+				    *the pte_refcount is reduced to 1*
+
+		pte_offset_map_lock()
+		*walk page table*
+		pte_unmap_unlock()
+		pte_put()
+		--> *the pte_refcount is reduced to 0*
+		    --> *free PTE page table page*
+
+There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
+to do any additional operations to ensure that the system is in order. Take fast
+GUP as an example::
+
+	thread A		thread B
+	fast GUP		madvise(MADV_DONTNEED)
+	========		======================
+
+	get_user_pages_fast_only()
+	--> local_irq_save();
+				*free PTE page table page*
+				--> unhook page
+				    /* The CPU where thread A is located closed
+				     * the local interrupt and cannot respond to
+				     * IPI, so it will block here */
+				    TLB invalidate page
+	    gup_pgd_range();
+	    local_irq_restore();
+	    			    *free page*
+
+4. Helpers
+==========
+
++---------------------+-------------------------------------------------+
+| pte_ref_init        | Initialize the pte_refcount and pmd             |
++---------------------+-------------------------------------------------+
+| pte_to_pmd          | Get the corresponding pmd                       |
++---------------------+-------------------------------------------------+
+| pte_update_pmd      | Update the corresponding pmd                    |
++---------------------+-------------------------------------------------+
+| pte_get             | Increment a pte_refcount                        |
++---------------------+-------------------------------------------------+
+| pte_get_many        | Add a value to a pte_refcount                   |
++---------------------+-------------------------------------------------+
+| pte_get_unless_zero | Increment a pte_refcount unless it is 0         |
++---------------------+-------------------------------------------------+
+| pte_try_get         | Try to increment a pte_refcount                 |
++---------------------+-------------------------------------------------+
+| pte_tryget_map      | Try to increment a pte_refcount before          |
+|                     | pte_offset_map()                                |
++---------------------+-------------------------------------------------+
+| pte_tryget_map_lock | Try to increment a pte_refcount before          |
+|                     | pte_offset_map_lock()                           |
++---------------------+-------------------------------------------------+
+| pte_put             | Decrement a pte_refcount                        |
++---------------------+-------------------------------------------------+
+| pte_put_many        | Sub a value to a pte_refcount                   |
++---------------------+-------------------------------------------------+
+| pte_put_vmf         | Decrement a pte_refcount in the page fault path |
++---------------------+-------------------------------------------------+
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (13 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 14/15] Documentation: add document for pte_ref Qi Zheng
@ 2021-11-10 10:54 ` Qi Zheng
  2021-11-10 12:56 ` [PATCH v3 00/15] Free user " Jason Gunthorpe
  15 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 10:54 UTC (permalink / raw)
  To: akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

In unmap_region() and other paths, we can reuse @tlb to
free PTE page table, which can reduce the number of tlb
flush.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 Documentation/vm/pte_ref.rst | 58 +++++++++++++++++++++++---------------------
 arch/x86/Kconfig             |  2 +-
 include/linux/pte_ref.h      | 34 ++++++++++++++++++++------
 mm/madvise.c                 |  4 +--
 mm/memory.c                  |  4 +--
 mm/mmu_gather.c              | 40 +++++++++++++-----------------
 mm/pte_ref.c                 | 13 +++++++---
 7 files changed, 90 insertions(+), 65 deletions(-)

diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
index c5323a263464..d304c0bfaae1 100644
--- a/Documentation/vm/pte_ref.rst
+++ b/Documentation/vm/pte_ref.rst
@@ -183,30 +183,34 @@ GUP as an example::
 4. Helpers
 ==========
 
-+---------------------+-------------------------------------------------+
-| pte_ref_init        | Initialize the pte_refcount and pmd             |
-+---------------------+-------------------------------------------------+
-| pte_to_pmd          | Get the corresponding pmd                       |
-+---------------------+-------------------------------------------------+
-| pte_update_pmd      | Update the corresponding pmd                    |
-+---------------------+-------------------------------------------------+
-| pte_get             | Increment a pte_refcount                        |
-+---------------------+-------------------------------------------------+
-| pte_get_many        | Add a value to a pte_refcount                   |
-+---------------------+-------------------------------------------------+
-| pte_get_unless_zero | Increment a pte_refcount unless it is 0         |
-+---------------------+-------------------------------------------------+
-| pte_try_get         | Try to increment a pte_refcount                 |
-+---------------------+-------------------------------------------------+
-| pte_tryget_map      | Try to increment a pte_refcount before          |
-|                     | pte_offset_map()                                |
-+---------------------+-------------------------------------------------+
-| pte_tryget_map_lock | Try to increment a pte_refcount before          |
-|                     | pte_offset_map_lock()                           |
-+---------------------+-------------------------------------------------+
-| pte_put             | Decrement a pte_refcount                        |
-+---------------------+-------------------------------------------------+
-| pte_put_many        | Sub a value to a pte_refcount                   |
-+---------------------+-------------------------------------------------+
-| pte_put_vmf         | Decrement a pte_refcount in the page fault path |
-+---------------------+-------------------------------------------------+
++---------------------+------------------------------------------------------+
+| pte_ref_init        | Initialize the pte_refcount and pmd                  |
++---------------------+------------------------------------------------------+
+| pte_to_pmd          | Get the corresponding pmd                            |
++---------------------+------------------------------------------------------+
+| pte_update_pmd      | Update the corresponding pmd                         |
++---------------------+------------------------------------------------------+
+| pte_get             | Increment a pte_refcount                             |
++---------------------+------------------------------------------------------+
+| pte_get_many        | Add a value to a pte_refcount                        |
++---------------------+------------------------------------------------------+
+| pte_get_unless_zero | Increment a pte_refcount unless it is 0              |
++---------------------+------------------------------------------------------+
+| pte_try_get         | Try to increment a pte_refcount                      |
++---------------------+------------------------------------------------------+
+| pte_tryget_map      | Try to increment a pte_refcount before               |
+|                     | pte_offset_map()                                     |
++---------------------+------------------------------------------------------+
+| pte_tryget_map_lock | Try to increment a pte_refcount before               |
+|                     | pte_offset_map_lock()                                |
++---------------------+------------------------------------------------------+
+| __pte_put           | Decrement a pte_refcount                             |
++---------------------+------------------------------------------------------+
+| __pte_put_many      | Sub a value to a pte_refcount                        |
++---------------------+------------------------------------------------------+
+| pte_put             | Decrement a pte_refcount(without tlb parameter)      |
++---------------------+------------------------------------------------------+
+| pte_put_many        | Sub a value to a pte_refcount(without tlb parameter) |
++---------------------+------------------------------------------------------+
+| pte_put_vmf         | Decrement a pte_refcount in the page fault path      |
++---------------------+------------------------------------------------------+
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index ca5bfe83ec61..69ea13437947 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -233,7 +233,7 @@ config X86
 	select HAVE_PCI
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
-	select MMU_GATHER_RCU_TABLE_FREE		if PARAVIRT
+	select MMU_GATHER_RCU_TABLE_FREE		if PARAVIRT || FREE_USER_PTE
 	select HAVE_POSIX_CPU_TIMERS_TASK_WORK
 	select HAVE_REGS_AND_STACK_ACCESS_API
 	select HAVE_RELIABLE_STACKTRACE		if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION
diff --git a/include/linux/pte_ref.h b/include/linux/pte_ref.h
index 8a26eaba83ef..dc3923bb38f6 100644
--- a/include/linux/pte_ref.h
+++ b/include/linux/pte_ref.h
@@ -22,7 +22,8 @@ enum pte_tryget_type pte_try_get(pmd_t *pmd);
 bool pte_get_unless_zero(pmd_t *pmd);
 
 #ifdef CONFIG_FREE_USER_PTE
-void free_user_pte_table(struct mm_struct *mm, pmd_t *pmdp, unsigned long addr);
+void free_user_pte_table(struct mmu_gather *tlb, struct mm_struct *mm,
+			 pmd_t *pmd, unsigned long addr);
 
 static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
 {
@@ -48,14 +49,21 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
 	atomic_add(nr, &pte->pte_refcount);
 }
 
-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long addr, unsigned int nr)
+static inline void __pte_put_many(struct mmu_gather *tlb, struct mm_struct *mm,
+				  pmd_t *pmd, unsigned long addr,
+				  unsigned int nr)
 {
 	pgtable_t pte = pmd_pgtable(*pmd);
 
 	VM_BUG_ON(!PageTable(pte));
 	if (atomic_sub_and_test(nr, &pte->pte_refcount))
-		free_user_pte_table(mm, pmd, addr & PMD_MASK);
+		free_user_pte_table(tlb, mm, pmd, addr & PMD_MASK);
+}
+
+static inline void __pte_put(struct mmu_gather *tlb, struct mm_struct *mm,
+			     pmd_t *pmd, unsigned long addr)
+{
+	__pte_put_many(tlb, mm, pmd, addr, 1);
 }
 #else
 static inline void pte_ref_init(pgtable_t pte, pmd_t *pmd, int count)
@@ -75,8 +83,14 @@ static inline void pte_get_many(pmd_t *pmd, unsigned int nr)
 {
 }
 
-static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
-				unsigned long addr, unsigned int nr)
+static inline void __pte_put_many(struct mmu_gather *tlb, struct mm_struct *mm,
+				  pmd_t *pmd, unsigned long addr,
+				  unsigned int nr)
+{
+}
+
+static inline void __pte_put(struct mmu_gather *tlb, struct mm_struct *mm,
+			     pmd_t *pmd, unsigned long addr)
 {
 }
 #endif /* CONFIG_FREE_USER_PTE */
@@ -110,6 +124,12 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
 	return pte_offset_map_lock(mm, pmd, address, ptlp);
 }
 
+static inline void pte_put_many(struct mm_struct *mm, pmd_t *pmd,
+				unsigned long addr, unsigned int nr)
+{
+	__pte_put_many(NULL, mm, pmd, addr, nr);
+}
+
 /*
  * pte_put - Decrement refcount for the PTE page table.
  * @mm: the mm_struct of the target address space.
@@ -120,7 +140,7 @@ static inline pte_t *pte_tryget_map_lock(struct mm_struct *mm, pmd_t *pmd,
  */
 static inline void pte_put(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
 {
-	pte_put_many(mm, pmd, addr, 1);
+	__pte_put(NULL, mm, pmd, addr);
 }
 
 #endif
diff --git a/mm/madvise.c b/mm/madvise.c
index 5cf2832abb98..b51254305bb2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -477,7 +477,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
-	pte_put(vma->vm_mm, pmd, start);
+	__pte_put(tlb, vma->vm_mm, pmd, start);
 	if (pageout)
 		reclaim_pages(&page_list);
 	cond_resched();
@@ -710,7 +710,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_pte, ptl);
 	if (nr_put)
-		pte_put_many(mm, pmd, start, nr_put);
+		__pte_put_many(tlb, mm, pmd, start, nr_put);
 	cond_resched();
 next:
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 4d1ede78d1b0..1bdae3b0f877 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1469,7 +1469,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 	}
 
 	if (nr_put)
-		pte_put_many(mm, pmd, start, nr_put);
+		__pte_put_many(tlb, mm, pmd, start, nr_put);
 
 	return addr;
 }
@@ -1515,7 +1515,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 		if (pte_try_get(pmd))
 			goto next;
 		next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-		pte_put(tlb->mm, pmd, addr);
+		__pte_put(tlb, tlb->mm, pmd, addr);
 next:
 		cond_resched();
 	} while (pmd++, addr = next, addr != end);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 1b9837419bf9..1bd9fa889421 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -134,42 +134,42 @@ static void __tlb_remove_table_free(struct mmu_table_batch *batch)
  *
  */
 
-static void tlb_remove_table_smp_sync(void *arg)
+static void tlb_remove_table_rcu(struct rcu_head *head)
 {
-	/* Simply deliver the interrupt */
+	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
 }
 
-static void tlb_remove_table_sync_one(void)
+static void tlb_remove_table_free(struct mmu_table_batch *batch)
 {
-	/*
-	 * This isn't an RCU grace period and hence the page-tables cannot be
-	 * assumed to be actually RCU-freed.
-	 *
-	 * It is however sufficient for software page-table walkers that rely on
-	 * IRQ disabling.
-	 */
-	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
+	call_rcu(&batch->rcu, tlb_remove_table_rcu);
 }
 
-static void tlb_remove_table_rcu(struct rcu_head *head)
+static void tlb_remove_table_one_rcu(struct rcu_head *head)
 {
-	__tlb_remove_table_free(container_of(head, struct mmu_table_batch, rcu));
+	struct page *page = container_of(head, struct page, rcu_head);
+
+	__tlb_remove_table(page);
 }
 
-static void tlb_remove_table_free(struct mmu_table_batch *batch)
+static void tlb_remove_table_one(void *table)
 {
-	call_rcu(&batch->rcu, tlb_remove_table_rcu);
+	pgtable_t page = (pgtable_t)table;
+
+	call_rcu(&page->rcu_head, tlb_remove_table_one_rcu);
 }
 
 #else /* !CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
-static void tlb_remove_table_sync_one(void) { }
-
 static void tlb_remove_table_free(struct mmu_table_batch *batch)
 {
 	__tlb_remove_table_free(batch);
 }
 
+static void tlb_remove_table_one(void *table)
+{
+	__tlb_remove_table(table);
+}
+
 #endif /* CONFIG_MMU_GATHER_RCU_TABLE_FREE */
 
 /*
@@ -187,12 +187,6 @@ static inline void tlb_table_invalidate(struct mmu_gather *tlb)
 	}
 }
 
-static void tlb_remove_table_one(void *table)
-{
-	tlb_remove_table_sync_one();
-	__tlb_remove_table(table);
-}
-
 static void tlb_table_flush(struct mmu_gather *tlb)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
diff --git a/mm/pte_ref.c b/mm/pte_ref.c
index 728e61cea25e..f9650ad23c7c 100644
--- a/mm/pte_ref.c
+++ b/mm/pte_ref.c
@@ -8,6 +8,8 @@
 #include <linux/pte_ref.h>
 #include <linux/mm.h>
 #include <linux/hugetlb.h>
+#include <asm/pgalloc.h>
+#include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
 #ifdef CONFIG_FREE_USER_PTE
@@ -117,7 +119,8 @@ static void pte_free_rcu(struct rcu_head *rcu)
 	__free_page(page);
 }
 
-void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
+void free_user_pte_table(struct mmu_gather *tlb, struct mm_struct *mm,
+			 pmd_t *pmd, unsigned long addr)
 {
 	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
 	spinlock_t *ptl;
@@ -125,10 +128,14 @@ void free_user_pte_table(struct mm_struct *mm, pmd_t *pmd, unsigned long addr)
 
 	ptl = pmd_lock(mm, pmd);
 	pmdval = pmdp_huge_get_and_clear(mm, addr, pmd);
-	flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+	if (!tlb)
+		flush_tlb_range(&vma, addr, addr + PMD_SIZE);
+	else
+		pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
 	spin_unlock(ptl);
 
 	pte_free_debug(pmdval);
 	mm_dec_nr_ptes(mm);
-	call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
+	if (!tlb)
+		call_rcu(&pmd_pgtable(pmdval)->rcu_head, pte_free_rcu);
 }
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
                   ` (14 preceding siblings ...)
  2021-11-10 10:54 ` [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages Qi Zheng
@ 2021-11-10 12:56 ` Jason Gunthorpe
  2021-11-10 13:25   ` David Hildenbrand
  2021-11-10 13:54   ` Qi Zheng
  15 siblings, 2 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 12:56 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:

> In this patch series, we add a pte_refcount field to the struct page of page
> table to track how many users of PTE page table. Similar to the mechanism of
> page refcount, the user of PTE page table should hold a refcount to it before
> accessing. The PTE page table page will be freed when the last refcount is
> dropped.

So, this approach basically adds two atomics on every PTE map

If I have it right the reason that zap cannot clean the PTEs today is
because zap cannot obtain the mmap lock due to a lock ordering issue
with the inode lock vs mmap lock.

If it could obtain the mmap lock then it could do the zap using the
write side as unmapping a vma does.

Rather than adding a new "lock" to ever PTE I wonder if it would be
more efficient to break up the mmap lock and introduce a specific
rwsem for the page table itself, in addition to the PTL. Currently the
mmap lock is protecting both the vma list and the page table.

I think that would allow the lock ordering issue to be resolved and
zap could obtain a page table rwsem.

Compared to two atomics per PTE this would just be two atomic per
page table walk operation, it is conceptually a lot simpler, and would
allow freeing all the page table levels, not just PTEs.

?

Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 12:56 ` [PATCH v3 00/15] Free user " Jason Gunthorpe
@ 2021-11-10 13:25   ` David Hildenbrand
  2021-11-10 13:59     ` Qi Zheng
  2021-11-10 14:38     ` Jason Gunthorpe
  2021-11-10 13:54   ` Qi Zheng
  1 sibling, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2021-11-10 13:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Qi Zheng
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 10.11.21 13:56, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> 
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
> 
> So, this approach basically adds two atomics on every PTE map
> 
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

There are different ways to zap: madvise(DONTNEED) vs
fallocate(PUNCH_HOLE). It depends on "from where" we're actually
comming: a process page table walker or the rmap.

The way locking currently works doesn't allow to remove a page table
just by holding the mmap lock, not even in write mode. You'll also need
to hold the respective rmap locks -- which implies that reclaiming apge
tables crossing VMAs is "problematic". Take a look at khugepaged which
has to play quite some tricks to remove a page table.

And there are other ways we can create empty page tables via the rmap,
like reclaim/writeback, although they are rather a secondary concern mostly.

> 
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.
> 
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

There is the rmap side of things as well. At least the rmap won't
reclaim alloc/free page tables, but it will walk page tables while
holding the respective rmap lock.

> 
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
> 
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

Another alternative is to not do it in the kernel automatically, but
instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
called by user space explicitly once it's reasonable. While this will
work for the obvious madvise(DONTNEED) users -- like memory allocators
-- that zap memory, it's a bit more complicated once shared memory is
involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
work for many use cases that want to optimize memory consumption for
sparse memory mappings.

Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
are most certainly the most important part piece.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 12:56 ` [PATCH v3 00/15] Free user " Jason Gunthorpe
  2021-11-10 13:25   ` David Hildenbrand
@ 2021-11-10 13:54   ` Qi Zheng
  1 sibling, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 13:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, david, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/10/21 8:56 PM, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> 
>> In this patch series, we add a pte_refcount field to the struct page of page
>> table to track how many users of PTE page table. Similar to the mechanism of
>> page refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The PTE page table page will be freed when the last refcount is
>> dropped.
> 
> So, this approach basically adds two atomics on every PTE map
> 
> If I have it right the reason that zap cannot clean the PTEs today is
> because zap cannot obtain the mmap lock due to a lock ordering issue
> with the inode lock vs mmap lock.

Currently, both MADV_DONTNEED and MADV_FREE obtain the read side of
mmap_lock instead of write side, which is the reason that 
jemalloc/tcmalloc prefer to use madvise() to release physical memory.

> 
> If it could obtain the mmap lock then it could do the zap using the
> write side as unmapping a vma does.

Even if it obtains the write side of mmap_lock, how to make sure that
all the page table entries are empty? Traverse 512 entries every time?

> 
> Rather than adding a new "lock" to ever PTE I wonder if it would be
> more efficient to break up the mmap lock and introduce a specific
> rwsem for the page table itself, in addition to the PTL. Currently the
> mmap lock is protecting both the vma list and the page table.

Now each level of page table has its own spin lock. Can you explain the
working mechanism of this special rwsem more clearly?

If we can reduce the protection range of mmap_lock, it is indeed a great
thing, but I think it is very difficult, and it will not solve the
problem of how to check that all entries in the page table page are
empty.

> 
> I think that would allow the lock ordering issue to be resolved and
> zap could obtain a page table rwsem.
> 
> Compared to two atomics per PTE this would just be two atomic per
> page table walk operation, it is conceptually a lot simpler, and would
> allow freeing all the page table levels, not just PTEs.

The reason why only the PTE page is released now is that it is the
largest. This reference count can actually be used for other levels of
page tables.

> 
> ?
> 
> Jason
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 13:25   ` David Hildenbrand
@ 2021-11-10 13:59     ` Qi Zheng
  2021-11-10 14:38     ` Jason Gunthorpe
  1 sibling, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-10 13:59 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/10/21 9:25 PM, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>
>>> In this patch series, we add a pte_refcount field to the struct page of page
>>> table to track how many users of PTE page table. Similar to the mechanism of
>>> page refcount, the user of PTE page table should hold a refcount to it before
>>> accessing. The PTE page table page will be freed when the last refcount is
>>> dropped.
>>
>> So, this approach basically adds two atomics on every PTE map
>>
>> If I have it right the reason that zap cannot clean the PTEs today is
>> because zap cannot obtain the mmap lock due to a lock ordering issue
>> with the inode lock vs mmap lock.
> 
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.
> 
> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. You'll also need
> to hold the respective rmap locks -- which implies that reclaiming apge
> tables crossing VMAs is "problematic". Take a look at khugepaged which
> has to play quite some tricks to remove a page table.
> 
> And there are other ways we can create empty page tables via the rmap,
> like reclaim/writeback, although they are rather a secondary concern mostly.
> 
>>
>> If it could obtain the mmap lock then it could do the zap using the
>> write side as unmapping a vma does.
>>
>> Rather than adding a new "lock" to ever PTE I wonder if it would be
>> more efficient to break up the mmap lock and introduce a specific
>> rwsem for the page table itself, in addition to the PTL. Currently the
>> mmap lock is protecting both the vma list and the page table.
> 
> There is the rmap side of things as well. At least the rmap won't
> reclaim alloc/free page tables, but it will walk page tables while
> holding the respective rmap lock.
> 
>>
>> I think that would allow the lock ordering issue to be resolved and
>> zap could obtain a page table rwsem.
>>
>> Compared to two atomics per PTE this would just be two atomic per
>> page table walk operation, it is conceptually a lot simpler, and would
>> allow freeing all the page table levels, not just PTEs.
> 
> Another alternative is to not do it in the kernel automatically, but
> instead have a madvise(MADV_CLEANUP_PGTABLE) mechanism that will get
> called by user space explicitly once it's reasonable. While this will
> work for the obvious madvise(DONTNEED) users -- like memory allocators
> -- that zap memory, it's a bit more complicated once shared memory is
> involved and we're fallocate(PUNCH_HOLE) memory. But it would at least
> work for many use cases that want to optimize memory consumption for
> sparse memory mappings.
> 
> Note that PTEs are the biggest memory consumer. On x86-64, a 1 TiB area
> will consume 2 GiB of PTE tables and only 4 MiB of PMD tables. So PTEs
> are most certainly the most important part piece.
> 

total agree!

Thanks,
Qi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 13:25   ` David Hildenbrand
  2021-11-10 13:59     ` Qi Zheng
@ 2021-11-10 14:38     ` Jason Gunthorpe
  2021-11-10 15:37       ` David Hildenbrand
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 14:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
> On 10.11.21 13:56, Jason Gunthorpe wrote:
> > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
> > 
> >> In this patch series, we add a pte_refcount field to the struct page of page
> >> table to track how many users of PTE page table. Similar to the mechanism of
> >> page refcount, the user of PTE page table should hold a refcount to it before
> >> accessing. The PTE page table page will be freed when the last refcount is
> >> dropped.
> > 
> > So, this approach basically adds two atomics on every PTE map
> > 
> > If I have it right the reason that zap cannot clean the PTEs today is
> > because zap cannot obtain the mmap lock due to a lock ordering issue
> > with the inode lock vs mmap lock.
> 
> There are different ways to zap: madvise(DONTNEED) vs
> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
> comming: a process page table walker or the rmap.

AFAIK rmap is the same issue, it can't lock the mmap_sem

> The way locking currently works doesn't allow to remove a page table
> just by holding the mmap lock, not even in write mode. 

I'm not sure I understand this. If the goal is to free the PTE tables
then the main concern is use-after free on page table walkers (which
this series is addressing). Ignoring bugs, we have only three ways to
read the page table:

 - Fully locked. Under the PTLs (gup slow is an example)
 - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
 - hw-locked. Barriered with TLB flush (gup fast is an example)

#1 should be completely safe as the PTLs will protect eveything
#2 is safe so long as the write side is held during any layout changes
#3 interacts with the TLB flush, and is also safe with zap

rmap itself is a #1 page table walker, ie it gets the PTLs under
page_vma_mapped_walk().

The sin we have comitted here is that both the mmap lock and the PTLs
are being used to protect the page table itself with a very
complicated dual semantic.

Splitting the sleeping mmap lock into 'covers vma' and 'covers page
tables' lets us solve the lock ordering and semi-locked can become
more fully locked by the new lock, instead of by abusing mmap sem.

I'd suggest to make this new lock a special rwsem which allows either
concurrent read access OR concurrent PTL access, but not both. This
way we don't degrade performance of the split PTLs, *and* when
something needs to change the page table structure it has a way to
properly exclude all the #2 lockless readers.

So evey touch to the page table starts by obtaining this new lock,
depending on the access mode to be used. (PTL vs lockless read)

We can keep the existing THP logic where a leaf PMD can be transformed
to a non-leaf PMD in the semi-locked case, but the case where a
non-leaf PMD is transformed to a leaf PMD has to take the lock.

Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 14/15] Documentation: add document for pte_ref
  2021-11-10 10:54 ` [PATCH v3 14/15] Documentation: add document for pte_ref Qi Zheng
@ 2021-11-10 14:39   ` Jonathan Corbet
  2021-11-11  5:40     ` Qi Zheng
  0 siblings, 1 reply; 43+ messages in thread
From: Jonathan Corbet @ 2021-11-10 14:39 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming, Qi Zheng

Qi Zheng <zhengqi.arch@bytedance.com> writes:

> This commit adds document for pte_ref under `Documentation/vm/`.

Thanks for documenting this work!

> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
>  Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 212 insertions(+)
>  create mode 100644 Documentation/vm/pte_ref.rst

When you add a new RST file, you also need to add it to the associated
index.rst file or it won't be included in the docs build.  Instead,
you'll get the "not included in any toctree" warning that you surely saw
when you tested the docs build with this file :)

> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
> new file mode 100644
> index 000000000000..c5323a263464
> --- /dev/null
> +++ b/Documentation/vm/pte_ref.rst
> @@ -0,0 +1,212 @@
> +.. _pte_ref:

Do you need this label anywhere?  If not, I'd leave it out.

Thanks,

jon

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 14:38     ` Jason Gunthorpe
@ 2021-11-10 15:37       ` David Hildenbrand
  2021-11-10 16:39         ` Jason Gunthorpe
  2021-11-10 16:49         ` Matthew Wilcox
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2021-11-10 15:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 10.11.21 15:38, Jason Gunthorpe wrote:
> On Wed, Nov 10, 2021 at 02:25:50PM +0100, David Hildenbrand wrote:
>> On 10.11.21 13:56, Jason Gunthorpe wrote:
>>> On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote:
>>>
>>>> In this patch series, we add a pte_refcount field to the struct page of page
>>>> table to track how many users of PTE page table. Similar to the mechanism of
>>>> page refcount, the user of PTE page table should hold a refcount to it before
>>>> accessing. The PTE page table page will be freed when the last refcount is
>>>> dropped.
>>>
>>> So, this approach basically adds two atomics on every PTE map
>>>
>>> If I have it right the reason that zap cannot clean the PTEs today is
>>> because zap cannot obtain the mmap lock due to a lock ordering issue
>>> with the inode lock vs mmap lock.
>>
>> There are different ways to zap: madvise(DONTNEED) vs
>> fallocate(PUNCH_HOLE). It depends on "from where" we're actually
>> comming: a process page table walker or the rmap.
> 
> AFAIK rmap is the same issue, it can't lock the mmap_sem
> 
>> The way locking currently works doesn't allow to remove a page table
>> just by holding the mmap lock, not even in write mode. 
> 
> I'm not sure I understand this. If the goal is to free the PTE tables
> then the main concern is use-after free on page table walkers (which
> this series is addressing). Ignoring bugs, we have only three ways to
> read the page table:

Yes, use-after-free and reuse-while-freeing are the two challenges AFAIKs.

> 
>  - Fully locked. Under the PTLs (gup slow is an example)
>  - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
>  - hw-locked. Barriered with TLB flush (gup fast is an example)

Three additions as far as I can tell:

1. Fully locked currently needs the read mmap lock OR the rmap lock in
   read. PTLs on their own are not sufficient AFAIKT.
2. #1 and #2 can currently only walk within VMA ranges.
3. We can theoretically walk page tables outside VMA ranges with the
write mmap lock, because page tables get removed with the mmap lock in
read mode and heavy-weight operations (VMA layout, khugepaged) are
performed under the write mmap lock.

The rmap locks protect from modifications where we want to exclude rmap
walkers similarly to how we grab the mmap lock in write, where the PTLs
are not sufficient.

See mm/mremap.c:move_ptes() as an example which performs VMA layout +
page table modifications. See khugepagd which doesn't perform VMA layout
modifications but page table modifications.

> 
> #1 should be completely safe as the PTLs will protect eveything
> #2 is safe so long as the write side is held during any layout changes
> #3 interacts with the TLB flush, and is also safe with zap
> 
> rmap itself is a #1 page table walker, ie it gets the PTLs under
> page_vma_mapped_walk().

When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Because the PMD-PTLs re usually not taken in case we know there is a
page table (nothing would currently change it without heavy locking).
And if they are taken, they are only held while allocating/checking a
PMDE, not while actually *using* the page table that's mapped in that entry.

For example, walk_page_range() requires the mmap lock in read and grabs
the PTE-PTLs.

> 
> The sin we have comitted here is that both the mmap lock and the PTLs
> are being used to protect the page table itself with a very
> complicated dual semantic.
> 
> Splitting the sleeping mmap lock into 'covers vma' and 'covers page
> tables' lets us solve the lock ordering and semi-locked can become
> more fully locked by the new lock, instead of by abusing mmap sem.

It would still be a fairly coarse-grained locking, I am not sure if that
is a step into the right direction. If you want to modify *some* page
table in your process you have exclude each and every page table walker.
Or did I mis-interpret what you were saying?

> 
> I'd suggest to make this new lock a special rwsem which allows either
> concurrent read access OR concurrent PTL access, but not both. This

I looked into such a lock recently in similar context and something like
that does not exist yet (and fairness will be challenging). You either
have a single reader or multiple writer. I'd be interested if someone
knows of something like that.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 15:37       ` David Hildenbrand
@ 2021-11-10 16:39         ` Jason Gunthorpe
  2021-11-10 17:37           ` David Hildenbrand
  2021-11-10 16:49         ` Matthew Wilcox
  1 sibling, 1 reply; 43+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 16:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:

> >  - Fully locked. Under the PTLs (gup slow is an example)
> >  - Semi-locked. Under the read mmap lock and no PTLs (hmm is an example)
> >  - hw-locked. Barriered with TLB flush (gup fast is an example)
> 
> Three additions as far as I can tell:
> 
> 1. Fully locked currently needs the read mmap lock OR the rmap lock in
>    read. PTLs on their own are not sufficient AFAIKT.

I think the reality is we don't have any fully locked walkers.. Even
gup slow is semi-lockless until it reaches lower levels, then it takes
the PTLs. (I forgot about that detail!)

The mmap lock is being used to protect the higher levels of the page
table. It is part of its complicated dual purpose.

> 2. #1 and #2 can currently only walk within VMA ranges.

AFICT this is an artifact of re-using the mmap lock to protect the
page table and not being able to obtain the mmap lock in all the
places the page table structure is manipulated.

> 3. We can theoretically walk page tables outside VMA ranges with the
> write mmap lock, because page tables get removed with the mmap lock in
> read mode and heavy-weight operations (VMA layout, khugepaged) are
> performed under the write mmap lock.

Yes, again, an artifact of the current locking.
 
> The rmap locks protect from modifications where we want to exclude rmap
> walkers similarly to how we grab the mmap lock in write, where the PTLs
> are not sufficient.

It is a similar dual purpose locking as the mmap sem :(

> > #1 should be completely safe as the PTLs will protect eveything
> > #2 is safe so long as the write side is held during any layout changes
> > #3 interacts with the TLB flush, and is also safe with zap
> > 
> > rmap itself is a #1 page table walker, ie it gets the PTLs under
> > page_vma_mapped_walk().
> 
> When you talk about PTLs, do you mean only PTE-PTLs or also PMD-PTLs?

Ah, here I was thinking about a lock that can protect all the
levels. Today we are abusing the mmap lock to act as the pud_lock, for
instance.

> Because the PMD-PTLs re usually not taken in case we know there is a
> page table (nothing would currently change it without heavy
> locking).

This only works with the lockless walkers, and relies on the read mmap
sem/etc to also mean 'a PTE table cannot become a leaf PMD'

> For example, walk_page_range() requires the mmap lock in read and grabs
> the PTE-PTLs.

Yes, that is a semi-locked reader.

> It would still be a fairly coarse-grained locking, I am not sure if that
> is a step into the right direction. If you want to modify *some* page
> table in your process you have exclude each and every page table walker.
> Or did I mis-interpret what you were saying?

That is one possible design, it favours fast walking and penalizes
mutation. We could also stick a lock in the PMD (instead of a
refcount) and still logically be using a lock instead of a refcount
scheme. Remember modify here is "want to change a table pointer into a
leaf pointer" so it isn't an every day activity..

There is some advantage with this thinking because it harmonizes well
with the other stuff that wants to convert tables into leafs, but has
to deal with complicated locking.

On the other hand, refcounts are a degenerate kind of rwsem and only
help with freeing pages. It also puts more atomics in normal fast
paths since we are refcounting each PTE, not read locking the PMD.

Perhaps the ideal thing would be to stick a rwsem in the PMD. read
means a table cannot be come a leaf. I don't know if there is space
for another atomic in the PMD level, and we'd have to use a hitching
post/hashed waitq scheme too since there surely isn't room for a waitq
too..

I wouldn't be so quick to say one is better than the other, but at
least let's have thought about a locking solution before merging
refcounts :)

Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 15:37       ` David Hildenbrand
  2021-11-10 16:39         ` Jason Gunthorpe
@ 2021-11-10 16:49         ` Matthew Wilcox
  2021-11-10 16:53           ` David Hildenbrand
  1 sibling, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2021-11-10 16:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Qi Zheng, akpm, tglx, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> > I'd suggest to make this new lock a special rwsem which allows either
> > concurrent read access OR concurrent PTL access, but not both. This
> 
> I looked into such a lock recently in similar context and something like
> that does not exist yet (and fairness will be challenging). You either
> have a single reader or multiple writer. I'd be interested if someone
> knows of something like that.

We've talked about having such a lock before for filesystems which want
to permit either many direct-IO accesses or many buffered-IO accesses, but
want to exclude a mixture of direct-IO and buffered-IO.  The name we came
up with for such a lock was the red-blue lock.  Either Team Red has the
lock, or Team Blue has the lock (or it's free).  Indicate free with velue
zero, Team Red with positive numbers and Team Blue with negative numbers.
If we need to indicate an opposing team is waiting for the semaphore,
we can use a high bit (1 << 30) to indicate no new team members can
acquire the lock.  Not sure whether anybody ever coded it up.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 16:49         ` Matthew Wilcox
@ 2021-11-10 16:53           ` David Hildenbrand
  2021-11-10 16:56             ` Jason Gunthorpe
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-11-10 16:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jason Gunthorpe, Qi Zheng, akpm, tglx, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

On 10.11.21 17:49, Matthew Wilcox wrote:
> On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
>>> I'd suggest to make this new lock a special rwsem which allows either
>>> concurrent read access OR concurrent PTL access, but not both. This
>>
>> I looked into such a lock recently in similar context and something like
>> that does not exist yet (and fairness will be challenging). You either
>> have a single reader or multiple writer. I'd be interested if someone
>> knows of something like that.
> 
> We've talked about having such a lock before for filesystems which want
> to permit either many direct-IO accesses or many buffered-IO accesses, but
> want to exclude a mixture of direct-IO and buffered-IO.  The name we came
> up with for such a lock was the red-blue lock.  Either Team Red has the
> lock, or Team Blue has the lock (or it's free).  Indicate free with velue
> zero, Team Red with positive numbers and Team Blue with negative numbers.
> If we need to indicate an opposing team is waiting for the semaphore,
> we can use a high bit (1 << 30) to indicate no new team members can
> acquire the lock.  Not sure whether anybody ever coded it up.

Interesting, thanks for sharing!

My excessive google search didn't reveal anything back then (~3 months
ago) -- only questions on popular coding websites asking essentially for
the same thing without any helpful replies. But maybe I used the wrong
keywords (e.g., "multiple reader, multiple writer", I certainly didn't
search for "red-blue lock", but I do like the name :) ).

Fairness might still be the biggest issue, but I am certainly no locking
expert.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 16:53           ` David Hildenbrand
@ 2021-11-10 16:56             ` Jason Gunthorpe
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 16:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Matthew Wilcox, Qi Zheng, akpm, tglx, kirill.shutemov,
	mika.penttila, linux-doc, linux-kernel, linux-mm, songmuchun,
	zhouchengming

On Wed, Nov 10, 2021 at 05:53:57PM +0100, David Hildenbrand wrote:
> On 10.11.21 17:49, Matthew Wilcox wrote:
> > On Wed, Nov 10, 2021 at 04:37:14PM +0100, David Hildenbrand wrote:
> >>> I'd suggest to make this new lock a special rwsem which allows either
> >>> concurrent read access OR concurrent PTL access, but not both. This
> >>
> >> I looked into such a lock recently in similar context and something like
> >> that does not exist yet (and fairness will be challenging). You either
> >> have a single reader or multiple writer. I'd be interested if someone
> >> knows of something like that.
> > 
> > We've talked about having such a lock before for filesystems which want
> > to permit either many direct-IO accesses or many buffered-IO accesses, but
> > want to exclude a mixture of direct-IO and buffered-IO.  The name we came
> > up with for such a lock was the red-blue lock.  Either Team Red has the
> > lock, or Team Blue has the lock (or it's free).  Indicate free with velue
> > zero, Team Red with positive numbers and Team Blue with negative numbers.
> > If we need to indicate an opposing team is waiting for the semaphore,
> > we can use a high bit (1 << 30) to indicate no new team members can
> > acquire the lock.  Not sure whether anybody ever coded it up.
> 
> Interesting, thanks for sharing!
> 
> My excessive google search didn't reveal anything back then (~3 months
> ago) -- only questions on popular coding websites asking essentially for
> the same thing without any helpful replies. But maybe I used the wrong
> keywords (e.g., "multiple reader, multiple writer", I certainly didn't
> search for "red-blue lock", but I do like the name :) ).
> 
> Fairness might still be the biggest issue, but I am certainly no locking
> expert.

Fairness could use the same basic logic as the write prefered to read
in the rwsem.

The atomic implementation would be with atomic_dec_unless_positive()
and atomic_inc_unless_negative(), without fairness it looks
straightfoward.

Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 16:39         ` Jason Gunthorpe
@ 2021-11-10 17:37           ` David Hildenbrand
  2021-11-10 17:49             ` Jason Gunthorpe
  2021-11-11  3:58             ` Qi Zheng
  0 siblings, 2 replies; 43+ messages in thread
From: David Hildenbrand @ 2021-11-10 17:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

>> It would still be a fairly coarse-grained locking, I am not sure if that
>> is a step into the right direction. If you want to modify *some* page
>> table in your process you have exclude each and every page table walker.
>> Or did I mis-interpret what you were saying?
> 
> That is one possible design, it favours fast walking and penalizes
> mutation. We could also stick a lock in the PMD (instead of a
> refcount) and still logically be using a lock instead of a refcount
> scheme. Remember modify here is "want to change a table pointer into a
> leaf pointer" so it isn't an every day activity..

It will be if we somewhat frequent when reclaim an empty PTE page table
as soon as it turns empty. This not only happens when zapping, but also
during writeback/swapping. So while writing back / swapping you might be
left with empty page tables to reclaim.

Of course, this is the current approach. Another approach that doesn't
require additional refcounts is scanning page tables for empty ones and
reclaiming them. This scanning can either be triggered manually from
user space or automatically from the kernel.

> 
> There is some advantage with this thinking because it harmonizes well
> with the other stuff that wants to convert tables into leafs, but has
> to deal with complicated locking.
> 
> On the other hand, refcounts are a degenerate kind of rwsem and only
> help with freeing pages. It also puts more atomics in normal fast
> paths since we are refcounting each PTE, not read locking the PMD.
> 
> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
> means a table cannot be come a leaf. I don't know if there is space
> for another atomic in the PMD level, and we'd have to use a hitching
> post/hashed waitq scheme too since there surely isn't room for a waitq
> too..
> 
> I wouldn't be so quick to say one is better than the other, but at
> least let's have thought about a locking solution before merging
> refcounts :)

Yes, absolutely. I can see the beauty in the current approach, because
it just reclaims "automatically" once possible -- page table empty and
nobody is walking it. The downside is that it doesn't always make sense
to reclaim an empty page table immediately once it turns empty.

Also, it adds complexity for something that is only a problem in some
corner cases -- sparse memory mappings, especially relevant for some
memory allocators after freeing a lot of memory or running VMs with
memory ballooning after inflating the balloon. Some of these use cases
might be good with just triggering page table reclaim manually from user
space.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 17:37           ` David Hildenbrand
@ 2021-11-10 17:49             ` Jason Gunthorpe
  2021-11-11  3:58             ` Qi Zheng
  1 sibling, 0 replies; 43+ messages in thread
From: Jason Gunthorpe @ 2021-11-10 17:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On Wed, Nov 10, 2021 at 06:37:46PM +0100, David Hildenbrand wrote:

> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty.

What do you think is the frequency of unlocked page table walks that
this would have to block on?

> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.

Right, this is why it would be nice if the complexity could address
more than one problem, like the existing complex locking around the
thp stuff..

Jason

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 06/15] mm: introduce refcount for user PTE page table page
  2021-11-10 10:54 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
@ 2021-11-11  0:37   ` kernel test robot
  0 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2021-11-11  0:37 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: kbuild-all, linux-doc, linux-kernel, linux-mm, songmuchun

[-- Attachment #1: Type: text/plain, Size: 11092 bytes --]

Hi Qi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master next-20211110]
[cannot apply to tip/perf/core tip/x86/core v5.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base:   https://github.com/hnaz/linux-mm master
config: mips-allyesconfig (attached as .config)
compiler: mips-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/e03404013f81d7b11aa6f5c3fef3816320b2baf0
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
        git checkout e03404013f81d7b11aa6f5c3fef3816320b2baf0
        # save the attached .config to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=mips prepare

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   error: no override and no default toolchain set
   init/Kconfig:70:warning: 'RUSTC_VERSION': number is invalid
   In file included from include/linux/mmzone.h:21,
                    from include/linux/gfp.h:6,
                    from include/linux/radix-tree.h:12,
                    from include/linux/fs.h:15,
                    from include/linux/compat.h:17,
                    from arch/mips/kernel/asm-offsets.c:12:
>> include/linux/mm_types.h:154:33: error: unknown type name 'pmd_t'
     154 |                                 pmd_t *pmd;             /* PTE page only */
         |                                 ^~~~~
   In file included from include/asm-generic/div64.h:27,
                    from arch/mips/include/asm/div64.h:89,
                    from include/linux/math.h:5,
                    from include/linux/math64.h:6,
                    from include/linux/time.h:6,
                    from include/linux/compat.h:10,
                    from arch/mips/kernel/asm-offsets.c:12:
   include/linux/mm.h: In function 'pte_alloc':
   include/linux/mm.h:2318:22: error: implicit declaration of function 'is_huge_pmd'; did you mean 'zap_huge_pmd'? [-Werror=implicit-function-declaration]
    2318 |         if (unlikely(is_huge_pmd(*pmd)))
         |                      ^~~~~~~~~~~
   include/linux/compiler.h:78:45: note: in definition of macro 'unlikely'
      78 | # define unlikely(x)    __builtin_expect(!!(x), 0)
         |                                             ^
   arch/mips/kernel/asm-offsets.c: At top level:
   arch/mips/kernel/asm-offsets.c:26:6: error: no previous prototype for 'output_ptreg_defines' [-Werror=missing-prototypes]
      26 | void output_ptreg_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:78:6: error: no previous prototype for 'output_task_defines' [-Werror=missing-prototypes]
      78 | void output_task_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:92:6: error: no previous prototype for 'output_thread_info_defines' [-Werror=missing-prototypes]
      92 | void output_thread_info_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:108:6: error: no previous prototype for 'output_thread_defines' [-Werror=missing-prototypes]
     108 | void output_thread_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:136:6: error: no previous prototype for 'output_thread_fpu_defines' [-Werror=missing-prototypes]
     136 | void output_thread_fpu_defines(void)
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:179:6: error: no previous prototype for 'output_mm_defines' [-Werror=missing-prototypes]
     179 | void output_mm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:218:6: error: no previous prototype for 'output_sc_defines' [-Werror=missing-prototypes]
     218 | void output_sc_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:253:6: error: no previous prototype for 'output_signal_defined' [-Werror=missing-prototypes]
     253 | void output_signal_defined(void)
         |      ^~~~~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:320:6: error: no previous prototype for 'output_pbe_defines' [-Werror=missing-prototypes]
     320 | void output_pbe_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:332:6: error: no previous prototype for 'output_pm_defines' [-Werror=missing-prototypes]
     332 | void output_pm_defines(void)
         |      ^~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:346:6: error: no previous prototype for 'output_kvm_defines' [-Werror=missing-prototypes]
     346 | void output_kvm_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   arch/mips/kernel/asm-offsets.c:390:6: error: no previous prototype for 'output_cps_defines' [-Werror=missing-prototypes]
     390 | void output_cps_defines(void)
         |      ^~~~~~~~~~~~~~~~~~
   cc1: all warnings being treated as errors
   make[2]: *** [scripts/Makefile.build:122: arch/mips/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [Makefile:1288: prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:226: __sub-make] Error 2
   make: Target 'prepare' not remade because of errors.


vim +/pmd_t +154 include/linux/mm_types.h

    70	
    71	struct page {
    72		unsigned long flags;		/* Atomic flags, some possibly
    73						 * updated asynchronously */
    74		/*
    75		 * Five words (20/40 bytes) are available in this union.
    76		 * WARNING: bit 0 of the first word is used for PageTail(). That
    77		 * means the other users of this union MUST NOT use the bit to
    78		 * avoid collision and false-positive PageTail().
    79		 */
    80		union {
    81			struct {	/* Page cache and anonymous pages */
    82				/**
    83				 * @lru: Pageout list, eg. active_list protected by
    84				 * lruvec->lru_lock.  Sometimes used as a generic list
    85				 * by the page owner.
    86				 */
    87				struct list_head lru;
    88				/* See page-flags.h for PAGE_MAPPING_FLAGS */
    89				struct address_space *mapping;
    90				pgoff_t index;		/* Our offset within mapping. */
    91				/**
    92				 * @private: Mapping-private opaque data.
    93				 * Usually used for buffer_heads if PagePrivate.
    94				 * Used for swp_entry_t if PageSwapCache.
    95				 * Indicates order in the buddy system if PageBuddy.
    96				 */
    97				unsigned long private;
    98			};
    99			struct {	/* page_pool used by netstack */
   100				/**
   101				 * @pp_magic: magic value to avoid recycling non
   102				 * page_pool allocated pages.
   103				 */
   104				unsigned long pp_magic;
   105				struct page_pool *pp;
   106				unsigned long _pp_mapping_pad;
   107				unsigned long dma_addr;
   108				atomic_long_t pp_frag_count;
   109			};
   110			struct {	/* slab, slob and slub */
   111				union {
   112					struct list_head slab_list;
   113					struct {	/* Partial pages */
   114						struct page *next;
   115	#ifdef CONFIG_64BIT
   116						int pages;	/* Nr of pages left */
   117	#else
   118						short int pages;
   119	#endif
   120					};
   121				};
   122				struct kmem_cache *slab_cache; /* not slob */
   123				/* Double-word boundary */
   124				void *freelist;		/* first free object */
   125				union {
   126					void *s_mem;	/* slab: first object */
   127					unsigned long counters;		/* SLUB */
   128					struct {			/* SLUB */
   129						unsigned inuse:16;
   130						unsigned objects:15;
   131						unsigned frozen:1;
   132					};
   133				};
   134			};
   135			struct {	/* Tail pages of compound page */
   136				unsigned long compound_head;	/* Bit zero is set */
   137	
   138				/* First tail page only */
   139				unsigned char compound_dtor;
   140				unsigned char compound_order;
   141				atomic_t compound_mapcount;
   142				unsigned int compound_nr; /* 1 << compound_order */
   143			};
   144			struct {	/* Second tail page of compound page */
   145				unsigned long _compound_pad_1;	/* compound_head */
   146				atomic_t hpage_pinned_refcount;
   147				/* For both global and memcg */
   148				struct list_head deferred_list;
   149			};
   150			struct {	/* Page table pages */
   151				unsigned long _pt_pad_1;	/* compound_head */
   152				union {
   153					pgtable_t pmd_huge_pte; /* protected by page->ptl */
 > 154					pmd_t *pmd;             /* PTE page only */
   155				};
   156				unsigned long _pt_pad_2;	/* mapping */
   157				union {
   158					struct mm_struct *pt_mm; /* x86 pgds only */
   159					atomic_t pt_frag_refcount; /* powerpc */
   160					atomic_t pte_refcount;  /* PTE page only */
   161				};
   162	#if ALLOC_SPLIT_PTLOCKS
   163				spinlock_t *ptl;
   164	#else
   165				spinlock_t ptl;
   166	#endif
   167			};
   168			struct {	/* ZONE_DEVICE pages */
   169				/** @pgmap: Points to the hosting device page map. */
   170				struct dev_pagemap *pgmap;
   171				void *zone_device_data;
   172				/*
   173				 * ZONE_DEVICE private pages are counted as being
   174				 * mapped so the next 3 words hold the mapping, index,
   175				 * and private fields from the source anonymous or
   176				 * page cache page while the page is migrated to device
   177				 * private memory.
   178				 * ZONE_DEVICE MEMORY_DEVICE_FS_DAX pages also
   179				 * use the mapping, index, and private fields when
   180				 * pmem backed DAX files are mapped.
   181				 */
   182			};
   183	
   184			/** @rcu_head: You can use this to free a page by RCU. */
   185			struct rcu_head rcu_head;
   186		};
   187	
   188		union {		/* This union is 4 bytes in size. */
   189			/*
   190			 * If the page can be mapped to userspace, encodes the number
   191			 * of times this page is referenced by a page table.
   192			 */
   193			atomic_t _mapcount;
   194	
   195			/*
   196			 * If the page is neither PageSlab nor mappable to userspace,
   197			 * the value stored here may help determine what this page
   198			 * is used for.  See page-flags.h for a list of page types
   199			 * which are currently stored here.
   200			 */
   201			unsigned int page_type;
   202	
   203			unsigned int active;		/* SLAB */
   204			int units;			/* SLOB */
   205		};
   206	
   207		/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
   208		atomic_t _refcount;
   209	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 72455 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-10 17:37           ` David Hildenbrand
  2021-11-10 17:49             ` Jason Gunthorpe
@ 2021-11-11  3:58             ` Qi Zheng
  2021-11-11  9:22               ` David Hildenbrand
  1 sibling, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-11  3:58 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>> is a step into the right direction. If you want to modify *some* page
>>> table in your process you have exclude each and every page table walker.
>>> Or did I mis-interpret what you were saying?
>>
>> That is one possible design, it favours fast walking and penalizes
>> mutation. We could also stick a lock in the PMD (instead of a
>> refcount) and still logically be using a lock instead of a refcount
>> scheme. Remember modify here is "want to change a table pointer into a
>> leaf pointer" so it isn't an every day activity..
> 
> It will be if we somewhat frequent when reclaim an empty PTE page table
> as soon as it turns empty. This not only happens when zapping, but also
> during writeback/swapping. So while writing back / swapping you might be
> left with empty page tables to reclaim.
> 
> Of course, this is the current approach. Another approach that doesn't
> require additional refcounts is scanning page tables for empty ones and
> reclaiming them. This scanning can either be triggered manually from
> user space or automatically from the kernel.

Whether it is introducing a special rwsem or scanning an empty page
table, there are two problems as follows:

	#1. When to trigger the scanning or releasing?
	#2. Every time to release a 4K page table page, 512 page table
	    entries need to be scanned.

For #1, if the scanning is triggered manually from user space, the
kernel is relatively passive, and the user does not fully know the best
timing to scan. If the scanning is triggered automatically from the
kernel, that is great. But the timing is not easy to confirm, is it
scanned and reclaimed every time zap or try_to_unmap?

For #2, refcount has advantages.

> 
>>
>> There is some advantage with this thinking because it harmonizes well
>> with the other stuff that wants to convert tables into leafs, but has
>> to deal with complicated locking.
>>
>> On the other hand, refcounts are a degenerate kind of rwsem and only
>> help with freeing pages. It also puts more atomics in normal fast
>> paths since we are refcounting each PTE, not read locking the PMD.
>>
>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>> means a table cannot be come a leaf. I don't know if there is space
>> for another atomic in the PMD level, and we'd have to use a hitching
>> post/hashed waitq scheme too since there surely isn't room for a waitq
>> too..
>>
>> I wouldn't be so quick to say one is better than the other, but at
>> least let's have thought about a locking solution before merging
>> refcounts :)
> 
> Yes, absolutely. I can see the beauty in the current approach, because
> it just reclaims "automatically" once possible -- page table empty and
> nobody is walking it. The downside is that it doesn't always make sense
> to reclaim an empty page table immediately once it turns empty.
> 
> Also, it adds complexity for something that is only a problem in some
> corner cases -- sparse memory mappings, especially relevant for some
> memory allocators after freeing a lot of memory or running VMs with
> memory ballooning after inflating the balloon. Some of these use cases
> might be good with just triggering page table reclaim manually from user
> space.
> 

Yes, this is indeed a problem. Perhaps some flags can be introduced so
that the release of page table pages can be delayed in some cases.
Similar to the lazyfree mechanism in MADV_FREE?

Thanks,
Qi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 14/15] Documentation: add document for pte_ref
  2021-11-10 14:39   ` Jonathan Corbet
@ 2021-11-11  5:40     ` Qi Zheng
  0 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-11  5:40 UTC (permalink / raw)
  To: Jonathan Corbet, akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/10/21 10:39 PM, Jonathan Corbet wrote:
> Qi Zheng <zhengqi.arch@bytedance.com> writes:
> 
>> This commit adds document for pte_ref under `Documentation/vm/`.
> 
> Thanks for documenting this work!
> 
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>>   Documentation/vm/pte_ref.rst | 212 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 212 insertions(+)
>>   create mode 100644 Documentation/vm/pte_ref.rst
> 
> When you add a new RST file, you also need to add it to the associated
> index.rst file or it won't be included in the docs build.  Instead,
> you'll get the "not included in any toctree" warning that you surely saw
> when you tested the docs build with this file :)

OK, I will add it to the associated index.rst in the next version.

> 
>> diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
>> new file mode 100644
>> index 000000000000..c5323a263464
>> --- /dev/null
>> +++ b/Documentation/vm/pte_ref.rst
>> @@ -0,0 +1,212 @@
>> +.. _pte_ref:
> 
> Do you need this label anywhere?  If not, I'd leave it out.

I will remove this label in the next version.

Thanks,
Qi

> 
> Thanks,
> 
> jon
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11  3:58             ` Qi Zheng
@ 2021-11-11  9:22               ` David Hildenbrand
  2021-11-11 11:08                 ` Qi Zheng
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-11-11  9:22 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 04:58, Qi Zheng wrote:
> 
> 
> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>> is a step into the right direction. If you want to modify *some* page
>>>> table in your process you have exclude each and every page table walker.
>>>> Or did I mis-interpret what you were saying?
>>>
>>> That is one possible design, it favours fast walking and penalizes
>>> mutation. We could also stick a lock in the PMD (instead of a
>>> refcount) and still logically be using a lock instead of a refcount
>>> scheme. Remember modify here is "want to change a table pointer into a
>>> leaf pointer" so it isn't an every day activity..
>>
>> It will be if we somewhat frequent when reclaim an empty PTE page table
>> as soon as it turns empty. This not only happens when zapping, but also
>> during writeback/swapping. So while writing back / swapping you might be
>> left with empty page tables to reclaim.
>>
>> Of course, this is the current approach. Another approach that doesn't
>> require additional refcounts is scanning page tables for empty ones and
>> reclaiming them. This scanning can either be triggered manually from
>> user space or automatically from the kernel.
> 
> Whether it is introducing a special rwsem or scanning an empty page
> table, there are two problems as follows:
> 
> 	#1. When to trigger the scanning or releasing?

For example when reclaiming memory, when scanning page tables in
khugepaged, or triggered by user space (note that this is the approach I
originally looked into). But it certainly requires more locking thought
to avoid stopping essentially any page table walker.

> 	#2. Every time to release a 4K page table page, 512 page table
> 	    entries need to be scanned.

It would happen only when actually trigger reclaim of page tables
(again, someone has to trigger it), so it's barely an issue.

For example, khugepaged already scans the page tables either way.

> 
> For #1, if the scanning is triggered manually from user space, the
> kernel is relatively passive, and the user does not fully know the best
> timing to scan. If the scanning is triggered automatically from the
> kernel, that is great. But the timing is not easy to confirm, is it
> scanned and reclaimed every time zap or try_to_unmap?
> 
> For #2, refcount has advantages.
> 
>>
>>>
>>> There is some advantage with this thinking because it harmonizes well
>>> with the other stuff that wants to convert tables into leafs, but has
>>> to deal with complicated locking.
>>>
>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>> help with freeing pages. It also puts more atomics in normal fast
>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>
>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>> means a table cannot be come a leaf. I don't know if there is space
>>> for another atomic in the PMD level, and we'd have to use a hitching
>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>> too..
>>>
>>> I wouldn't be so quick to say one is better than the other, but at
>>> least let's have thought about a locking solution before merging
>>> refcounts :)
>>
>> Yes, absolutely. I can see the beauty in the current approach, because
>> it just reclaims "automatically" once possible -- page table empty and
>> nobody is walking it. The downside is that it doesn't always make sense
>> to reclaim an empty page table immediately once it turns empty.
>>
>> Also, it adds complexity for something that is only a problem in some
>> corner cases -- sparse memory mappings, especially relevant for some
>> memory allocators after freeing a lot of memory or running VMs with
>> memory ballooning after inflating the balloon. Some of these use cases
>> might be good with just triggering page table reclaim manually from user
>> space.
>>
> 
> Yes, this is indeed a problem. Perhaps some flags can be introduced so
> that the release of page table pages can be delayed in some cases.
> Similar to the lazyfree mechanism in MADV_FREE?

The issue AFAIU is that once your refcount hits 0 (no more references,
no more entries), the longer you wait with reclaim, the longer others
have to wait for populating a fresh page table because the "page table
to be reclaimed" is still stuck around. You'd have to keep the refcount
increased for a while, and only drop it after a while. But when? And
how? IMHO it's not trivial, but maybe there is an easy way to achieve it.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11  9:22               ` David Hildenbrand
@ 2021-11-11 11:08                 ` Qi Zheng
  2021-11-11 11:19                   ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-11 11:08 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 5:22 PM, David Hildenbrand wrote:
> On 11.11.21 04:58, Qi Zheng wrote:
>>
>>
>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>> is a step into the right direction. If you want to modify *some* page
>>>>> table in your process you have exclude each and every page table walker.
>>>>> Or did I mis-interpret what you were saying?
>>>>
>>>> That is one possible design, it favours fast walking and penalizes
>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>> refcount) and still logically be using a lock instead of a refcount
>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>> leaf pointer" so it isn't an every day activity..
>>>
>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>> as soon as it turns empty. This not only happens when zapping, but also
>>> during writeback/swapping. So while writing back / swapping you might be
>>> left with empty page tables to reclaim.
>>>
>>> Of course, this is the current approach. Another approach that doesn't
>>> require additional refcounts is scanning page tables for empty ones and
>>> reclaiming them. This scanning can either be triggered manually from
>>> user space or automatically from the kernel.
>>
>> Whether it is introducing a special rwsem or scanning an empty page
>> table, there are two problems as follows:
>>
>> 	#1. When to trigger the scanning or releasing?
> 
> For example when reclaiming memory, when scanning page tables in
> khugepaged, or triggered by user space (note that this is the approach I
> originally looked into). But it certainly requires more locking thought
> to avoid stopping essentially any page table walker.
> 
>> 	#2. Every time to release a 4K page table page, 512 page table
>> 	    entries need to be scanned.
> 
> It would happen only when actually trigger reclaim of page tables
> (again, someone has to trigger it), so it's barely an issue.
> 
> For example, khugepaged already scans the page tables either way.
> 
>>
>> For #1, if the scanning is triggered manually from user space, the
>> kernel is relatively passive, and the user does not fully know the best
>> timing to scan. If the scanning is triggered automatically from the
>> kernel, that is great. But the timing is not easy to confirm, is it
>> scanned and reclaimed every time zap or try_to_unmap?
>>
>> For #2, refcount has advantages.
>>
>>>
>>>>
>>>> There is some advantage with this thinking because it harmonizes well
>>>> with the other stuff that wants to convert tables into leafs, but has
>>>> to deal with complicated locking.
>>>>
>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>> help with freeing pages. It also puts more atomics in normal fast
>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>
>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>> means a table cannot be come a leaf. I don't know if there is space
>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>> too..
>>>>
>>>> I wouldn't be so quick to say one is better than the other, but at
>>>> least let's have thought about a locking solution before merging
>>>> refcounts :)
>>>
>>> Yes, absolutely. I can see the beauty in the current approach, because
>>> it just reclaims "automatically" once possible -- page table empty and
>>> nobody is walking it. The downside is that it doesn't always make sense
>>> to reclaim an empty page table immediately once it turns empty.
>>>
>>> Also, it adds complexity for something that is only a problem in some
>>> corner cases -- sparse memory mappings, especially relevant for some
>>> memory allocators after freeing a lot of memory or running VMs with
>>> memory ballooning after inflating the balloon. Some of these use cases
>>> might be good with just triggering page table reclaim manually from user
>>> space.
>>>
>>
>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>> that the release of page table pages can be delayed in some cases.
>> Similar to the lazyfree mechanism in MADV_FREE?
> 
> The issue AFAIU is that once your refcount hits 0 (no more references,
> no more entries), the longer you wait with reclaim, the longer others
> have to wait for populating a fresh page table because the "page table
> to be reclaimed" is still stuck around. You'd have to keep the refcount
> increased for a while, and only drop it after a while. But when? And
> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
> 

For running VMs with memory ballooning after inflating the balloon, is
this a hot behavior? Even if it is, it is already facing the release and
reallocation of physical pages. The overhead after introducing
pte_refcount is that we need to release and re-allocate page table page.
But 2MB physical pages only corresponds to 4KiB of PTE page table page.
So maybe the overhead is not big.

In fact, the performance test shown on the cover letter is this case:

	test program: 
https://lore.kernel.org/lkml/20100106160614.ff756f82.kamezawa.hiroyu@jp.fujitsu.com/2-multi-fault-all.c

Thanks,
Qi

> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 11:08                 ` Qi Zheng
@ 2021-11-11 11:19                   ` David Hildenbrand
  2021-11-11 12:00                     ` Qi Zheng
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-11-11 11:19 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 12:08, Qi Zheng wrote:
> 
> 
> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>> On 11.11.21 04:58, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>> table in your process you have exclude each and every page table walker.
>>>>>> Or did I mis-interpret what you were saying?
>>>>>
>>>>> That is one possible design, it favours fast walking and penalizes
>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>> leaf pointer" so it isn't an every day activity..
>>>>
>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>> during writeback/swapping. So while writing back / swapping you might be
>>>> left with empty page tables to reclaim.
>>>>
>>>> Of course, this is the current approach. Another approach that doesn't
>>>> require additional refcounts is scanning page tables for empty ones and
>>>> reclaiming them. This scanning can either be triggered manually from
>>>> user space or automatically from the kernel.
>>>
>>> Whether it is introducing a special rwsem or scanning an empty page
>>> table, there are two problems as follows:
>>>
>>> 	#1. When to trigger the scanning or releasing?
>>
>> For example when reclaiming memory, when scanning page tables in
>> khugepaged, or triggered by user space (note that this is the approach I
>> originally looked into). But it certainly requires more locking thought
>> to avoid stopping essentially any page table walker.
>>
>>> 	#2. Every time to release a 4K page table page, 512 page table
>>> 	    entries need to be scanned.
>>
>> It would happen only when actually trigger reclaim of page tables
>> (again, someone has to trigger it), so it's barely an issue.
>>
>> For example, khugepaged already scans the page tables either way.
>>
>>>
>>> For #1, if the scanning is triggered manually from user space, the
>>> kernel is relatively passive, and the user does not fully know the best
>>> timing to scan. If the scanning is triggered automatically from the
>>> kernel, that is great. But the timing is not easy to confirm, is it
>>> scanned and reclaimed every time zap or try_to_unmap?
>>>
>>> For #2, refcount has advantages.
>>>
>>>>
>>>>>
>>>>> There is some advantage with this thinking because it harmonizes well
>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>> to deal with complicated locking.
>>>>>
>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>
>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>> too..
>>>>>
>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>> least let's have thought about a locking solution before merging
>>>>> refcounts :)
>>>>
>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>> it just reclaims "automatically" once possible -- page table empty and
>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>> to reclaim an empty page table immediately once it turns empty.
>>>>
>>>> Also, it adds complexity for something that is only a problem in some
>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>> memory allocators after freeing a lot of memory or running VMs with
>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>> might be good with just triggering page table reclaim manually from user
>>>> space.
>>>>
>>>
>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>> that the release of page table pages can be delayed in some cases.
>>> Similar to the lazyfree mechanism in MADV_FREE?
>>
>> The issue AFAIU is that once your refcount hits 0 (no more references,
>> no more entries), the longer you wait with reclaim, the longer others
>> have to wait for populating a fresh page table because the "page table
>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>> increased for a while, and only drop it after a while. But when? And
>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>
> 
> For running VMs with memory ballooning after inflating the balloon, is
> this a hot behavior? Even if it is, it is already facing the release and
> reallocation of physical pages. The overhead after introducing
> pte_refcount is that we need to release and re-allocate page table page.
> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
> So maybe the overhead is not big.

The cases that come to my mind are

a) Swapping on shared memory with concurrent access
b) Reclaim on file-backed memory with concurrent access
c) Free page reporting as implemented by virtio-balloon

In all of these cases, you can have someone immediately re-access the
page table and re-populate it.

For something mostly static (balloon inflation, memory allocator), it's
not that big of a deal I guess.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 11:19                   ` David Hildenbrand
@ 2021-11-11 12:00                     ` Qi Zheng
  2021-11-11 12:20                       ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-11 12:00 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 7:19 PM, David Hildenbrand wrote:
> On 11.11.21 12:08, Qi Zheng wrote:
>>
>>
>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>
>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>
>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>> left with empty page tables to reclaim.
>>>>>
>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>> user space or automatically from the kernel.
>>>>
>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>> table, there are two problems as follows:
>>>>
>>>> 	#1. When to trigger the scanning or releasing?
>>>
>>> For example when reclaiming memory, when scanning page tables in
>>> khugepaged, or triggered by user space (note that this is the approach I
>>> originally looked into). But it certainly requires more locking thought
>>> to avoid stopping essentially any page table walker.
>>>
>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>> 	    entries need to be scanned.
>>>
>>> It would happen only when actually trigger reclaim of page tables
>>> (again, someone has to trigger it), so it's barely an issue.
>>>
>>> For example, khugepaged already scans the page tables either way.
>>>
>>>>
>>>> For #1, if the scanning is triggered manually from user space, the
>>>> kernel is relatively passive, and the user does not fully know the best
>>>> timing to scan. If the scanning is triggered automatically from the
>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>
>>>> For #2, refcount has advantages.
>>>>
>>>>>
>>>>>>
>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>> to deal with complicated locking.
>>>>>>
>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>
>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>> too..
>>>>>>
>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>> least let's have thought about a locking solution before merging
>>>>>> refcounts :)
>>>>>
>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>
>>>>> Also, it adds complexity for something that is only a problem in some
>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>> might be good with just triggering page table reclaim manually from user
>>>>> space.
>>>>>
>>>>
>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>> that the release of page table pages can be delayed in some cases.
>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>
>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>> no more entries), the longer you wait with reclaim, the longer others
>>> have to wait for populating a fresh page table because the "page table
>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>> increased for a while, and only drop it after a while. But when? And
>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>
>>
>> For running VMs with memory ballooning after inflating the balloon, is
>> this a hot behavior? Even if it is, it is already facing the release and
>> reallocation of physical pages. The overhead after introducing
>> pte_refcount is that we need to release and re-allocate page table page.
>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>> So maybe the overhead is not big.
> 
> The cases that come to my mind are
> 
> a) Swapping on shared memory with concurrent access
> b) Reclaim on file-backed memory with concurrent access
> c) Free page reporting as implemented by virtio-balloon
> 
> In all of these cases, you can have someone immediately re-access the
> page table and re-populate it.

In the performance test shown on the cover, we repeatedly performed
touch and madvise(MADV_DONTNEED) actions, which simulated the case
you said above.

We did find a small amount of performance regression, but I think it is
acceptable, and no new perf hotspots have been added.

> 
> For something mostly static (balloon inflation, memory allocator), it's
> not that big of a deal I guess.
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:00                     ` Qi Zheng
@ 2021-11-11 12:20                       ` David Hildenbrand
  2021-11-11 12:32                         ` Qi Zheng
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-11-11 12:20 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 13:00, Qi Zheng wrote:
> 
> 
> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>> On 11.11.21 12:08, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>
>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>
>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>> left with empty page tables to reclaim.
>>>>>>
>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>> user space or automatically from the kernel.
>>>>>
>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>> table, there are two problems as follows:
>>>>>
>>>>> 	#1. When to trigger the scanning or releasing?
>>>>
>>>> For example when reclaiming memory, when scanning page tables in
>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>> originally looked into). But it certainly requires more locking thought
>>>> to avoid stopping essentially any page table walker.
>>>>
>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>> 	    entries need to be scanned.
>>>>
>>>> It would happen only when actually trigger reclaim of page tables
>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>
>>>> For example, khugepaged already scans the page tables either way.
>>>>
>>>>>
>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>
>>>>> For #2, refcount has advantages.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>> to deal with complicated locking.
>>>>>>>
>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>
>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>> too..
>>>>>>>
>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>> least let's have thought about a locking solution before merging
>>>>>>> refcounts :)
>>>>>>
>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>
>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>> space.
>>>>>>
>>>>>
>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>> that the release of page table pages can be delayed in some cases.
>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>
>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>> no more entries), the longer you wait with reclaim, the longer others
>>>> have to wait for populating a fresh page table because the "page table
>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>> increased for a while, and only drop it after a while. But when? And
>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>
>>>
>>> For running VMs with memory ballooning after inflating the balloon, is
>>> this a hot behavior? Even if it is, it is already facing the release and
>>> reallocation of physical pages. The overhead after introducing
>>> pte_refcount is that we need to release and re-allocate page table page.
>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>> So maybe the overhead is not big.
>>
>> The cases that come to my mind are
>>
>> a) Swapping on shared memory with concurrent access
>> b) Reclaim on file-backed memory with concurrent access
>> c) Free page reporting as implemented by virtio-balloon
>>
>> In all of these cases, you can have someone immediately re-access the
>> page table and re-populate it.
> 
> In the performance test shown on the cover, we repeatedly performed
> touch and madvise(MADV_DONTNEED) actions, which simulated the case
> you said above.
> 
> We did find a small amount of performance regression, but I think it is
> acceptable, and no new perf hotspots have been added.

That test always accesses 2MiB and does it from a single thread. Things
might (IMHO will) look different when only accessing individual pages
and doing the access from one/multiple separate threads (that's what
a),b) and c) essentially do, they don't do it in the pattern you
measured. what you measured matches rather a typical memory allocator).


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:20                       ` David Hildenbrand
@ 2021-11-11 12:32                         ` Qi Zheng
  2021-11-11 12:51                           ` David Hildenbrand
  0 siblings, 1 reply; 43+ messages in thread
From: Qi Zheng @ 2021-11-11 12:32 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 8:20 PM, David Hildenbrand wrote:
> On 11.11.21 13:00, Qi Zheng wrote:
>>
>>
>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>
>>>>>>
>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>
>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>
>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>> left with empty page tables to reclaim.
>>>>>>>
>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>> user space or automatically from the kernel.
>>>>>>
>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>> table, there are two problems as follows:
>>>>>>
>>>>>> 	#1. When to trigger the scanning or releasing?
>>>>>
>>>>> For example when reclaiming memory, when scanning page tables in
>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>> originally looked into). But it certainly requires more locking thought
>>>>> to avoid stopping essentially any page table walker.
>>>>>
>>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>>> 	    entries need to be scanned.
>>>>>
>>>>> It would happen only when actually trigger reclaim of page tables
>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>
>>>>> For example, khugepaged already scans the page tables either way.
>>>>>
>>>>>>
>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>
>>>>>> For #2, refcount has advantages.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>> to deal with complicated locking.
>>>>>>>>
>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>
>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>> too..
>>>>>>>>
>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>> refcounts :)
>>>>>>>
>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>
>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>> space.
>>>>>>>
>>>>>>
>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>
>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>> have to wait for populating a fresh page table because the "page table
>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>> increased for a while, and only drop it after a while. But when? And
>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>
>>>>
>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>> reallocation of physical pages. The overhead after introducing
>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>> So maybe the overhead is not big.
>>>
>>> The cases that come to my mind are
>>>
>>> a) Swapping on shared memory with concurrent access
>>> b) Reclaim on file-backed memory with concurrent access
>>> c) Free page reporting as implemented by virtio-balloon
>>>
>>> In all of these cases, you can have someone immediately re-access the
>>> page table and re-populate it.
>>
>> In the performance test shown on the cover, we repeatedly performed
>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>> you said above.
>>
>> We did find a small amount of performance regression, but I think it is
>> acceptable, and no new perf hotspots have been added.
> 
> That test always accesses 2MiB and does it from a single thread. Things
> might (IMHO will) look different when only accessing individual pages
> and doing the access from one/multiple separate threads (that's what

No, it includes multi-threading:

	while (1) {
		char *c;
		char *start = mmap_area[cpu];
		char *end = mmap_area[cpu] + FAULT_LENGTH;
		pthread_barrier_wait(&barrier);
		//printf("fault into %p-%p\n",start, end);

		for (c = start; c < end; c += PAGE_SIZE)
			*c = 0;

		pthread_barrier_wait(&barrier);
		for (i = 0; cpu==0 && i < num; i++)
			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
		pthread_barrier_wait(&barrier);
	}

Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
memory of threads on other cpu.

> a),b) and c) essentially do, they don't do it in the pattern you
> measured. what you measured matches rather a typical memory allocator).
> 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:32                         ` Qi Zheng
@ 2021-11-11 12:51                           ` David Hildenbrand
  2021-11-11 13:01                             ` Qi Zheng
  0 siblings, 1 reply; 43+ messages in thread
From: David Hildenbrand @ 2021-11-11 12:51 UTC (permalink / raw)
  To: Qi Zheng, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming

On 11.11.21 13:32, Qi Zheng wrote:
> 
> 
> On 11/11/21 8:20 PM, David Hildenbrand wrote:
>> On 11.11.21 13:00, Qi Zheng wrote:
>>>
>>>
>>> On 11/11/21 7:19 PM, David Hildenbrand wrote:
>>>> On 11.11.21 12:08, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/11/21 5:22 PM, David Hildenbrand wrote:
>>>>>> On 11.11.21 04:58, Qi Zheng wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/11/21 1:37 AM, David Hildenbrand wrote:
>>>>>>>>>> It would still be a fairly coarse-grained locking, I am not sure if that
>>>>>>>>>> is a step into the right direction. If you want to modify *some* page
>>>>>>>>>> table in your process you have exclude each and every page table walker.
>>>>>>>>>> Or did I mis-interpret what you were saying?
>>>>>>>>>
>>>>>>>>> That is one possible design, it favours fast walking and penalizes
>>>>>>>>> mutation. We could also stick a lock in the PMD (instead of a
>>>>>>>>> refcount) and still logically be using a lock instead of a refcount
>>>>>>>>> scheme. Remember modify here is "want to change a table pointer into a
>>>>>>>>> leaf pointer" so it isn't an every day activity..
>>>>>>>>
>>>>>>>> It will be if we somewhat frequent when reclaim an empty PTE page table
>>>>>>>> as soon as it turns empty. This not only happens when zapping, but also
>>>>>>>> during writeback/swapping. So while writing back / swapping you might be
>>>>>>>> left with empty page tables to reclaim.
>>>>>>>>
>>>>>>>> Of course, this is the current approach. Another approach that doesn't
>>>>>>>> require additional refcounts is scanning page tables for empty ones and
>>>>>>>> reclaiming them. This scanning can either be triggered manually from
>>>>>>>> user space or automatically from the kernel.
>>>>>>>
>>>>>>> Whether it is introducing a special rwsem or scanning an empty page
>>>>>>> table, there are two problems as follows:
>>>>>>>
>>>>>>> 	#1. When to trigger the scanning or releasing?
>>>>>>
>>>>>> For example when reclaiming memory, when scanning page tables in
>>>>>> khugepaged, or triggered by user space (note that this is the approach I
>>>>>> originally looked into). But it certainly requires more locking thought
>>>>>> to avoid stopping essentially any page table walker.
>>>>>>
>>>>>>> 	#2. Every time to release a 4K page table page, 512 page table
>>>>>>> 	    entries need to be scanned.
>>>>>>
>>>>>> It would happen only when actually trigger reclaim of page tables
>>>>>> (again, someone has to trigger it), so it's barely an issue.
>>>>>>
>>>>>> For example, khugepaged already scans the page tables either way.
>>>>>>
>>>>>>>
>>>>>>> For #1, if the scanning is triggered manually from user space, the
>>>>>>> kernel is relatively passive, and the user does not fully know the best
>>>>>>> timing to scan. If the scanning is triggered automatically from the
>>>>>>> kernel, that is great. But the timing is not easy to confirm, is it
>>>>>>> scanned and reclaimed every time zap or try_to_unmap?
>>>>>>>
>>>>>>> For #2, refcount has advantages.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> There is some advantage with this thinking because it harmonizes well
>>>>>>>>> with the other stuff that wants to convert tables into leafs, but has
>>>>>>>>> to deal with complicated locking.
>>>>>>>>>
>>>>>>>>> On the other hand, refcounts are a degenerate kind of rwsem and only
>>>>>>>>> help with freeing pages. It also puts more atomics in normal fast
>>>>>>>>> paths since we are refcounting each PTE, not read locking the PMD.
>>>>>>>>>
>>>>>>>>> Perhaps the ideal thing would be to stick a rwsem in the PMD. read
>>>>>>>>> means a table cannot be come a leaf. I don't know if there is space
>>>>>>>>> for another atomic in the PMD level, and we'd have to use a hitching
>>>>>>>>> post/hashed waitq scheme too since there surely isn't room for a waitq
>>>>>>>>> too..
>>>>>>>>>
>>>>>>>>> I wouldn't be so quick to say one is better than the other, but at
>>>>>>>>> least let's have thought about a locking solution before merging
>>>>>>>>> refcounts :)
>>>>>>>>
>>>>>>>> Yes, absolutely. I can see the beauty in the current approach, because
>>>>>>>> it just reclaims "automatically" once possible -- page table empty and
>>>>>>>> nobody is walking it. The downside is that it doesn't always make sense
>>>>>>>> to reclaim an empty page table immediately once it turns empty.
>>>>>>>>
>>>>>>>> Also, it adds complexity for something that is only a problem in some
>>>>>>>> corner cases -- sparse memory mappings, especially relevant for some
>>>>>>>> memory allocators after freeing a lot of memory or running VMs with
>>>>>>>> memory ballooning after inflating the balloon. Some of these use cases
>>>>>>>> might be good with just triggering page table reclaim manually from user
>>>>>>>> space.
>>>>>>>>
>>>>>>>
>>>>>>> Yes, this is indeed a problem. Perhaps some flags can be introduced so
>>>>>>> that the release of page table pages can be delayed in some cases.
>>>>>>> Similar to the lazyfree mechanism in MADV_FREE?
>>>>>>
>>>>>> The issue AFAIU is that once your refcount hits 0 (no more references,
>>>>>> no more entries), the longer you wait with reclaim, the longer others
>>>>>> have to wait for populating a fresh page table because the "page table
>>>>>> to be reclaimed" is still stuck around. You'd have to keep the refcount
>>>>>> increased for a while, and only drop it after a while. But when? And
>>>>>> how? IMHO it's not trivial, but maybe there is an easy way to achieve it.
>>>>>>
>>>>>
>>>>> For running VMs with memory ballooning after inflating the balloon, is
>>>>> this a hot behavior? Even if it is, it is already facing the release and
>>>>> reallocation of physical pages. The overhead after introducing
>>>>> pte_refcount is that we need to release and re-allocate page table page.
>>>>> But 2MB physical pages only corresponds to 4KiB of PTE page table page.
>>>>> So maybe the overhead is not big.
>>>>
>>>> The cases that come to my mind are
>>>>
>>>> a) Swapping on shared memory with concurrent access
>>>> b) Reclaim on file-backed memory with concurrent access
>>>> c) Free page reporting as implemented by virtio-balloon
>>>>
>>>> In all of these cases, you can have someone immediately re-access the
>>>> page table and re-populate it.
>>>
>>> In the performance test shown on the cover, we repeatedly performed
>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>> you said above.
>>>
>>> We did find a small amount of performance regression, but I think it is
>>> acceptable, and no new perf hotspots have been added.
>>
>> That test always accesses 2MiB and does it from a single thread. Things
>> might (IMHO will) look different when only accessing individual pages
>> and doing the access from one/multiple separate threads (that's what
> 
> No, it includes multi-threading:
> 

Oh sorry, I totally skipped [2].

> 	while (1) {
> 		char *c;
> 		char *start = mmap_area[cpu];
> 		char *end = mmap_area[cpu] + FAULT_LENGTH;
> 		pthread_barrier_wait(&barrier);
> 		//printf("fault into %p-%p\n",start, end);
> 
> 		for (c = start; c < end; c += PAGE_SIZE)
> 			*c = 0;
> 
> 		pthread_barrier_wait(&barrier);
> 		for (i = 0; cpu==0 && i < num; i++)
> 			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
> 		pthread_barrier_wait(&barrier);
> 	}
> 
> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
> memory of threads on other cpu.
> 

I'll have a more detailed look at the benchmark. On a quick glimpse,
looks like the threads are also accessing a full 2MiB range, one page at
a time, and one thread is zapping the whole 2MiB range. A single CPU
only accesses memory within one 2MiB range IIRC.

Having multiple threads just access individual pages within a single 2
MiB region, and having one thread zap that memory (e.g., simulate
swapout) could be another benchmark.

We have to make sure to run with THP disabled (e.g., using
madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
eventually), because otherwise you might just be populating+zapping THPs
if they would otherwise be allowed in the environment.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 00/15] Free user PTE page table pages
  2021-11-11 12:51                           ` David Hildenbrand
@ 2021-11-11 13:01                             ` Qi Zheng
  0 siblings, 0 replies; 43+ messages in thread
From: Qi Zheng @ 2021-11-11 13:01 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe
  Cc: akpm, tglx, kirill.shutemov, mika.penttila, linux-doc,
	linux-kernel, linux-mm, songmuchun, zhouchengming



On 11/11/21 8:51 PM, David Hildenbrand wrote:

>>>>
>>>> In the performance test shown on the cover, we repeatedly performed
>>>> touch and madvise(MADV_DONTNEED) actions, which simulated the case
>>>> you said above.
>>>>
>>>> We did find a small amount of performance regression, but I think it is
>>>> acceptable, and no new perf hotspots have been added.
>>>
>>> That test always accesses 2MiB and does it from a single thread. Things
>>> might (IMHO will) look different when only accessing individual pages
>>> and doing the access from one/multiple separate threads (that's what
>>
>> No, it includes multi-threading:
>>
> 
> Oh sorry, I totally skipped [2].
> 
>> 	while (1) {
>> 		char *c;
>> 		char *start = mmap_area[cpu];
>> 		char *end = mmap_area[cpu] + FAULT_LENGTH;
>> 		pthread_barrier_wait(&barrier);
>> 		//printf("fault into %p-%p\n",start, end);
>>
>> 		for (c = start; c < end; c += PAGE_SIZE)
>> 			*c = 0;
>>
>> 		pthread_barrier_wait(&barrier);
>> 		for (i = 0; cpu==0 && i < num; i++)
>> 			madvise(mmap_area[i], FAULT_LENGTH, MADV_DONTNEED);
>> 		pthread_barrier_wait(&barrier);
>> 	}
>>
>> Thread on cpu0 will use madvise(MADV_DONTNEED) to release the physical
>> memory of threads on other cpu.
>>
> 
> I'll have a more detailed look at the benchmark. On a quick glimpse,

Thank you for your time :)

> looks like the threads are also accessing a full 2MiB range, one page at
> a time, and one thread is zapping the whole 2MiB range. A single CPU
> only accesses memory within one 2MiB range IIRC.
> 
> Having multiple threads just access individual pages within a single 2
> MiB region, and having one thread zap that memory (e.g., simulate
> swapout) could be another benchmark.

LGTM, I will simulate more scenarios for testing.

> 
> We have to make sure to run with THP disabled (e.g., using
> madvise(MADV_NOHUGEPAGE) on the complete mapping in the benchmark
> eventually), because otherwise you might just be populating+zapping THPs
> if they would otherwise be allowed in the environment.

Yes, I turned off THP during testing:

root@~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 02/15] mm: introduce is_huge_pmd() helper
  2021-11-10 10:54 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
@ 2021-11-11 13:46   ` kernel test robot
  0 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2021-11-11 13:46 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: kbuild-all, linux-doc, linux-kernel, linux-mm, songmuchun

[-- Attachment #1: Type: text/plain, Size: 9105 bytes --]

Hi Qi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on tip/perf/core tip/x86/core linus/master v5.15 next-20211111]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base:   https://github.com/hnaz/linux-mm master
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/ce86336fbabb116520ad01162faf5c8d4a1ce124
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
        git checkout ce86336fbabb116520ad01162faf5c8d4a1ce124
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/memory.c: In function 'copy_pmd_range':
>> mm/memory.c:1149:21: error: implicit declaration of function 'is_huge_pmd'; did you mean 'is_hugepd'? [-Werror=implicit-function-declaration]
    1149 |                 if (is_huge_pmd(*src_pmd)) {
         |                     ^~~~~~~~~~~
         |                     is_hugepd
   In file included from <command-line>:
   In function 'zap_pmd_range',
       inlined from 'zap_pud_range' at mm/memory.c:1499:10,
       inlined from 'zap_p4d_range' at mm/memory.c:1520:10,
       inlined from 'unmap_page_range' at mm/memory.c:1541:10:
   include/linux/compiler_types.h:335:45: error: call to '__compiletime_assert_304' declared with attribute error: BUILD_BUG failed
     335 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:316:25: note: in definition of macro '__compiletime_assert'
     316 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:335:9: note: in expansion of macro '_compiletime_assert'
     335 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
         |                     ^~~~~~~~~~~~~~~~
   include/linux/huge_mm.h:328:27: note: in expansion of macro 'BUILD_BUG'
     328 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
         |                           ^~~~~~~~~
   mm/memory.c:1444:44: note: in expansion of macro 'HPAGE_PMD_SIZE'
    1444 |                         if (next - addr != HPAGE_PMD_SIZE)
         |                                            ^~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   mm/mprotect.c: In function 'change_pmd_range':
>> mm/mprotect.c:260:21: error: implicit declaration of function 'is_huge_pmd'; did you mean 'is_hugepd'? [-Werror=implicit-function-declaration]
     260 |                 if (is_huge_pmd(*pmd)) {
         |                     ^~~~~~~~~~~
         |                     is_hugepd
   In file included from <command-line>:
   In function 'change_pmd_range',
       inlined from 'change_pud_range' at mm/mprotect.c:307:12,
       inlined from 'change_p4d_range' at mm/mprotect.c:327:12,
       inlined from 'change_protection_range' at mm/mprotect.c:352:12:
   include/linux/compiler_types.h:335:45: error: call to '__compiletime_assert_298' declared with attribute error: BUILD_BUG failed
     335 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:316:25: note: in definition of macro '__compiletime_assert'
     316 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:335:9: note: in expansion of macro '_compiletime_assert'
     335 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
         |                     ^~~~~~~~~~~~~~~~
   include/linux/huge_mm.h:328:27: note: in expansion of macro 'BUILD_BUG'
     328 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
         |                           ^~~~~~~~~
   mm/mprotect.c:261:44: note: in expansion of macro 'HPAGE_PMD_SIZE'
     261 |                         if (next - addr != HPAGE_PMD_SIZE) {
         |                                            ^~~~~~~~~~~~~~
   cc1: some warnings being treated as errors
--
   mm/mremap.c: In function 'move_page_tables':
>> mm/mremap.c:535:21: error: implicit declaration of function 'is_huge_pmd'; did you mean 'is_hugepd'? [-Werror=implicit-function-declaration]
     535 |                 if (is_huge_pmd(*old_pmd)) {
         |                     ^~~~~~~~~~~
         |                     is_hugepd
   In file included from <command-line>:
   include/linux/compiler_types.h:335:45: error: call to '__compiletime_assert_304' declared with attribute error: BUILD_BUG failed
     335 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:316:25: note: in definition of macro '__compiletime_assert'
     316 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:335:9: note: in expansion of macro '_compiletime_assert'
     335 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
         |                     ^~~~~~~~~~~~~~~~
   include/linux/huge_mm.h:328:27: note: in expansion of macro 'BUILD_BUG'
     328 | #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
         |                           ^~~~~~~~~
   mm/mremap.c:536:39: note: in expansion of macro 'HPAGE_PMD_SIZE'
     536 |                         if (extent == HPAGE_PMD_SIZE &&
         |                                       ^~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +1149 mm/memory.c

  1132	
  1133	static inline int
  1134	copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
  1135		       pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
  1136		       unsigned long end)
  1137	{
  1138		struct mm_struct *dst_mm = dst_vma->vm_mm;
  1139		struct mm_struct *src_mm = src_vma->vm_mm;
  1140		pmd_t *src_pmd, *dst_pmd;
  1141		unsigned long next;
  1142	
  1143		dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
  1144		if (!dst_pmd)
  1145			return -ENOMEM;
  1146		src_pmd = pmd_offset(src_pud, addr);
  1147		do {
  1148			next = pmd_addr_end(addr, end);
> 1149			if (is_huge_pmd(*src_pmd)) {
  1150				int err;
  1151				VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
  1152				err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
  1153						    addr, dst_vma, src_vma);
  1154				if (err == -ENOMEM)
  1155					return -ENOMEM;
  1156				if (!err)
  1157					continue;
  1158				/* fall through */
  1159			}
  1160			if (pmd_none_or_clear_bad(src_pmd))
  1161				continue;
  1162			if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
  1163					   addr, next))
  1164				return -ENOMEM;
  1165		} while (dst_pmd++, src_pmd++, addr = next, addr != end);
  1166		return 0;
  1167	}
  1168	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 20057 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation
  2021-11-10 10:54 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
@ 2021-11-11 15:17   ` kernel test robot
  0 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2021-11-11 15:17 UTC (permalink / raw)
  To: Qi Zheng, akpm, tglx, kirill.shutemov, mika.penttila, david, jgg
  Cc: kbuild-all, linux-doc, linux-kernel, linux-mm, songmuchun

[-- Attachment #1: Type: text/plain, Size: 4248 bytes --]

Hi Qi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on hnaz-mm/master]
[also build test ERROR on linus/master next-20211111]
[cannot apply to tip/perf/core tip/x86/core v5.15]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base:   https://github.com/hnaz/linux-mm master
config: ia64-defconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/6e3cc5bb722cbd2fc4170d2f5371e52792d17d2e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
        git checkout 6e3cc5bb722cbd2fc4170d2f5371e52792d17d2e
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=ia64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from arch/ia64/include/uapi/asm/gcc_intrin.h:11,
                    from arch/ia64/include/asm/gcc_intrin.h:10,
                    from arch/ia64/include/uapi/asm/intrinsics.h:20,
                    from arch/ia64/include/asm/intrinsics.h:11,
                    from arch/ia64/include/asm/page.h:11,
                    from arch/ia64/include/asm/pgtable.h:18,
                    from include/linux/pgtable.h:6,
                    from include/linux/pte_ref.h:10,
                    from mm/pte_ref.c:8:
   mm/pte_ref.c: In function 'pte_try_get':
>> mm/pte_ref.c:37:22: error: implicit declaration of function 'is_huge_pmd'; did you mean 'zap_huge_pmd'? [-Werror=implicit-function-declaration]
      37 |         if (unlikely(is_huge_pmd(*pmd)))
         |                      ^~~~~~~~~~~
   include/linux/compiler.h:78:45: note: in definition of macro 'unlikely'
      78 | # define unlikely(x)    __builtin_expect(!!(x), 0)
         |                                             ^
   cc1: some warnings being treated as errors


vim +37 mm/pte_ref.c

e03404013f81d7 Qi Zheng 2021-11-10  20  
e03404013f81d7 Qi Zheng 2021-11-10  21  /*
e03404013f81d7 Qi Zheng 2021-11-10  22   * pte_try_get - Try to increment refcount for the PTE page table.
e03404013f81d7 Qi Zheng 2021-11-10  23   * @pmd: a pointer to the pmd entry corresponding to the PTE page table.
e03404013f81d7 Qi Zheng 2021-11-10  24   *
e03404013f81d7 Qi Zheng 2021-11-10  25   * Return true if the increment succeeded. Otherwise return false.
e03404013f81d7 Qi Zheng 2021-11-10  26   *
e03404013f81d7 Qi Zheng 2021-11-10  27   * Before Operating the PTE page table, we need to hold a refcount
e03404013f81d7 Qi Zheng 2021-11-10  28   * to protect against the concurrent release of the PTE page table.
e03404013f81d7 Qi Zheng 2021-11-10  29   * But we will fail in the following case:
e03404013f81d7 Qi Zheng 2021-11-10  30   * 	- The content mapped in @pmd is not a PTE page
e03404013f81d7 Qi Zheng 2021-11-10  31   * 	- The refcount of the PTE page table is zero, it will be freed
e03404013f81d7 Qi Zheng 2021-11-10  32   */
e03404013f81d7 Qi Zheng 2021-11-10  33  enum pte_tryget_type pte_try_get(pmd_t *pmd)
e03404013f81d7 Qi Zheng 2021-11-10  34  {
e03404013f81d7 Qi Zheng 2021-11-10  35  	if (unlikely(pmd_none(*pmd)))
e03404013f81d7 Qi Zheng 2021-11-10  36  		return TRYGET_FAILED_NONE;
e03404013f81d7 Qi Zheng 2021-11-10 @37  	if (unlikely(is_huge_pmd(*pmd)))
e03404013f81d7 Qi Zheng 2021-11-10  38  		return TRYGET_FAILED_HUGE_PMD;
e03404013f81d7 Qi Zheng 2021-11-10  39  
e03404013f81d7 Qi Zheng 2021-11-10  40  	return TRYGET_SUCCESSED;
e03404013f81d7 Qi Zheng 2021-11-10  41  }
e03404013f81d7 Qi Zheng 2021-11-10  42  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 20057 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [mm/pte_ref]  afcc9fb874: kernel_BUG_at_include/linux/pte_ref.h
  2021-11-10 10:54 ` [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages Qi Zheng
@ 2021-11-14 14:43   ` kernel test robot
  0 siblings, 0 replies; 43+ messages in thread
From: kernel test robot @ 2021-11-14 14:43 UTC (permalink / raw)
  To: Qi Zheng
  Cc: lkp, lkp, akpm, tglx, kirill.shutemov, mika.penttila, david, jgg,
	linux-doc, linux-kernel, linux-mm, songmuchun, zhouchengming,
	Qi Zheng

[-- Attachment #1: Type: text/plain, Size: 7397 bytes --]



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: afcc9fb8741f26773a381ac1e159e0172344b7d5 ("[PATCH v3 13/15] mm/pte_ref: free user PTE page table pages")
url: https://github.com/0day-ci/linux/commits/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
base: https://github.com/hnaz/linux-mm master
patch link: https://lore.kernel.org/linux-doc/20211110105428.32458-14-zhengqi.arch@bytedance.com

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+------------------------------------------+------------+------------+
|                                          | e249f0fa9a | afcc9fb874 |
+------------------------------------------+------------+------------+
| boot_successes                           | 16         | 0          |
| boot_failures                            | 0          | 14         |
| kernel_BUG_at_include/linux/pte_ref.h    | 0          | 14         |
| invalid_opcode:#[##]                     | 0          | 14         |
| RIP:destroy_args                         | 0          | 14         |
| Kernel_panic-not_syncing:Fatal_exception | 0          | 14         |
+------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>


[    7.245922][    T1] kernel BUG at include/linux/pte_ref.h:56!
[    7.269161][    T1] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    7.271019][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-rc7-mm1-00448-gafcc9fb8741f #1
[    7.273761][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 7.276418][ T1] RIP: 0010:destroy_args (include/linux/pte_ref.h:56 include/linux/pte_ref.h:123 mm/debug_vm_pgtable.c:1051) 
[ 7.277992][ T1] Code: 6b 58 4c 8b 2b 49 8b 3c 24 e8 c6 38 b4 fe 48 c1 e0 06 48 03 05 aa eb 4c ff 8b 50 30 81 e2 00 02 00 f0 81 fa 00 00 00 f0 74 02 <0f> 0b f0 83 68 20 01 75 15 48 89 ea 4c 89 e6 4c 89 ef 48 81 e2 00
All code
========
   0:	6b 58 4c 8b          	imul   $0xffffff8b,0x4c(%rax),%ebx
   4:	2b 49 8b             	sub    -0x75(%rcx),%ecx
   7:	3c 24                	cmp    $0x24,%al
   9:	e8 c6 38 b4 fe       	callq  0xfffffffffeb438d4
   e:	48 c1 e0 06          	shl    $0x6,%rax
  12:	48 03 05 aa eb 4c ff 	add    -0xb31456(%rip),%rax        # 0xffffffffff4cebc3
  19:	8b 50 30             	mov    0x30(%rax),%edx
  1c:	81 e2 00 02 00 f0    	and    $0xf0000200,%edx
  22:	81 fa 00 00 00 f0    	cmp    $0xf0000000,%edx
  28:	74 02                	je     0x2c
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	f0 83 68 20 01       	lock subl $0x1,0x20(%rax)
  31:	75 15                	jne    0x48
  33:	48 89 ea             	mov    %rbp,%rdx
  36:	4c 89 e6             	mov    %r12,%rsi
  39:	4c 89 ef             	mov    %r13,%rdi
  3c:	48                   	rex.W
  3d:	81                   	.byte 0x81
  3e:	e2 00                	loop   0x40

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	f0 83 68 20 01       	lock subl $0x1,0x20(%rax)
   7:	75 15                	jne    0x1e
   9:	48 89 ea             	mov    %rbp,%rdx
   c:	4c 89 e6             	mov    %r12,%rsi
   f:	4c 89 ef             	mov    %r13,%rdi
  12:	48                   	rex.W
  13:	81                   	.byte 0x81
  14:	e2 00                	loop   0x16
[    7.283473][    T1] RSP: 0000:ffffc90000013da0 EFLAGS: 00010206
[    7.285295][    T1] RAX: ffffea0000000000 RBX: ffffc90000013dc8 RCX: 0000000000000000
[    7.287675][    T1] RDX: 00000000f0000200 RSI: ffffffff823848b5 RDI: 0000000000000000
[    7.290056][    T1] RBP: 000024b4af3bd000 R08: 0000000000000001 R09: 0000000000000040
[    7.292449][    T1] R10: ffff88842fc2fb60 R11: ffffc90000013d00 R12: ffff88812da63000
[    7.294926][    T1] R13: ffff88810ca08c00 R14: 0000000140000067 R15: 0000000000000027
[    7.297349][    T1] FS:  0000000000000000(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
[    7.300020][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    7.301949][    T1] CR2: 0000000000000000 CR3: 0000000002612000 CR4: 00000000000006f0
[    7.304153][    T1] Call Trace:
[    7.306975][    T1]  <TASK>
[ 7.307966][ T1] debug_vm_pgtable (mm/debug_vm_pgtable.c:1334) 
[ 7.309435][ T1] ? init_args (mm/debug_vm_pgtable.c:1241) 
[ 7.310773][ T1] do_one_initcall (init/main.c:1303) 
[ 7.312212][ T1] kernel_init_freeable (init/main.c:1377 init/main.c:1394 init/main.c:1413 init/main.c:1618) 
[ 7.313728][ T1] ? rest_init (init/main.c:1499) 
[ 7.315002][ T1] kernel_init (init/main.c:1509) 
[ 7.316368][ T1] ret_from_fork (arch/x86/entry/entry_64.S:301) 
[    7.317692][    T1]  </TASK>
[    7.318697][    T1] Modules linked in:
[    7.320060][    T1] ---[ end trace 1f2bbe378e842286 ]---
[ 7.321766][ T1] RIP: 0010:destroy_args (include/linux/pte_ref.h:56 include/linux/pte_ref.h:123 mm/debug_vm_pgtable.c:1051) 
[ 7.323325][ T1] Code: 6b 58 4c 8b 2b 49 8b 3c 24 e8 c6 38 b4 fe 48 c1 e0 06 48 03 05 aa eb 4c ff 8b 50 30 81 e2 00 02 00 f0 81 fa 00 00 00 f0 74 02 <0f> 0b f0 83 68 20 01 75 15 48 89 ea 4c 89 e6 4c 89 ef 48 81 e2 00
All code
========
   0:	6b 58 4c 8b          	imul   $0xffffff8b,0x4c(%rax),%ebx
   4:	2b 49 8b             	sub    -0x75(%rcx),%ecx
   7:	3c 24                	cmp    $0x24,%al
   9:	e8 c6 38 b4 fe       	callq  0xfffffffffeb438d4
   e:	48 c1 e0 06          	shl    $0x6,%rax
  12:	48 03 05 aa eb 4c ff 	add    -0xb31456(%rip),%rax        # 0xffffffffff4cebc3
  19:	8b 50 30             	mov    0x30(%rax),%edx
  1c:	81 e2 00 02 00 f0    	and    $0xf0000200,%edx
  22:	81 fa 00 00 00 f0    	cmp    $0xf0000000,%edx
  28:	74 02                	je     0x2c
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	f0 83 68 20 01       	lock subl $0x1,0x20(%rax)
  31:	75 15                	jne    0x48
  33:	48 89 ea             	mov    %rbp,%rdx
  36:	4c 89 e6             	mov    %r12,%rsi
  39:	4c 89 ef             	mov    %r13,%rdi
  3c:	48                   	rex.W
  3d:	81                   	.byte 0x81
  3e:	e2 00                	loop   0x40

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	f0 83 68 20 01       	lock subl $0x1,0x20(%rax)
   7:	75 15                	jne    0x1e
   9:	48 89 ea             	mov    %rbp,%rdx
   c:	4c 89 e6             	mov    %r12,%rsi
   f:	4c 89 ef             	mov    %r13,%rdi
  12:	48                   	rex.W
  13:	81                   	.byte 0x81
  14:	e2 00                	loop   0x16


To reproduce:

        # build kernel
	cd linux
	cp config-5.15.0-rc7-mm1-00448-gafcc9fb8741f .config
	make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.



---
0DAY/LKP+ Test Infrastructure                   Open Source Technology Center
https://lists.01.org/hyperkitty/list/lkp@lists.01.org       Intel Corporation

Thanks,
Oliver Sang


[-- Attachment #2: config-5.15.0-rc7-mm1-00448-gafcc9fb8741f --]
[-- Type: text/plain, Size: 121995 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 5.15.0-rc7-mm1 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc-9 (Debian 9.3.0-22) 9.3.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=90300
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=23502
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=23502
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT is not set
# CONFIG_BPF_UNPRIV_DEFAULT_OFF is not set
# CONFIG_BPF_PRELOAD is not set
# end of BPF subsystem

CONFIG_PREEMPT_NONE_BEHAVIOUR=y
# CONFIG_PREEMPT_VOLUNTARY_BEHAVIOUR is not set
# CONFIG_PREEMPT_BEHAVIOUR is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
# CONFIG_SCHED_CORE is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
# CONFIG_PRINTK_INDEX is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_MISC is not set
CONFIG_CGROUP_DEBUG=y
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_NAMESPACES is not set
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_RD_LZ4 is not set
CONFIG_RD_ZSTD=y
# CONFIG_BOOT_CONFIG is not set
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_LD_ORPHAN_WARN=y
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_HAVE_ARCH_USERFAULTFD_WP=y
CONFIG_HAVE_ARCH_USERFAULTFD_MINOR=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_USERFAULTFD=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_KCMP=y
CONFIG_RSEQ=y
# CONFIG_DEBUG_RSEQ is not set
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
CONFIG_COMPAT_BRK=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
# CONFIG_SHUFFLE_PAGE_ALLOCATOR is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SYSTEM_DATA_VERIFICATION=y
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_FILTER_PGPROT=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_NR_GPIO=1024
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=5
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
CONFIG_RETPOLINE=y
# CONFIG_X86_CPU_RESCTRL is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_X86_HV_CALLBACK_VECTOR=y
# CONFIG_XEN is not set
CONFIG_KVM_GUEST=y
CONFIG_ARCH_CPUIDLE_HALTPOLL=y
# CONFIG_PVH is not set
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_JAILHOUSE_GUEST is not set
# CONFIG_ACRN_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
CONFIG_PROCESSOR_SELECT=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_HYGON=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_CPU_SUP_ZHAOXIN=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=512
CONFIG_SCHED_CLUSTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCELOG_LEGACY=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=m

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
# CONFIG_PERF_EVENTS_INTEL_RAPL is not set
# CONFIG_PERF_EVENTS_INTEL_CSTATE is not set
# CONFIG_PERF_EVENTS_AMD_POWER is not set
CONFIG_PERF_EVENTS_AMD_UNCORE=y
# end of Performance monitoring

CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_X86_IOPL_IOPERM=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_5LEVEL=y
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
# CONFIG_AMD_MEM_ENCRYPT is not set
CONFIG_NUMA=y
# CONFIG_AMD_NUMA is not set
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
# CONFIG_ARCH_MEMORY_PROBE is not set
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_X86_PMEM_LEGACY_DEVICE=y
CONFIG_X86_PMEM_LEGACY=m
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_X86_UMIP=y
# CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not set
CONFIG_X86_INTEL_TSX_MODE_OFF=y
# CONFIG_X86_INTEL_TSX_MODE_ON is not set
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
# CONFIG_X86_SGX is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_MIXED=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
# CONFIG_KEXEC_SIG is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_DYNAMIC_MEMORY_LAYOUT=y
CONFIG_HOTPLUG_CPU=y
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
CONFIG_COMPAT_VDSO=y
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_XONLY is not set
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
# CONFIG_MODIFY_LDT_SYSCALL is not set
CONFIG_HAVE_LIVEPATCH=y
# end of Processor type and features

CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_SUSPEND_SKIP_SYNC=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_HIBERNATION_SNAPSHOT_DEV=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
# CONFIG_DPM_WATCHDOG is not set
# CONFIG_PM_TRACE_RTC is not set
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
# CONFIG_ENERGY_MODEL is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
# CONFIG_ACPI_FPDT is not set
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_REV_OVERRIDE_POSSIBLE is not set
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_TAD is not set
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_IPMI is not set
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_TABLE_UPGRADE is not set
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_HOTPLUG_MEMORY is not set
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
CONFIG_ACPI_NFIT=m
# CONFIG_NFIT_SECURITY_DEBUG is not set
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_HMAT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_EINJ=m
CONFIG_ACPI_APEI_ERST_DEBUG=y
# CONFIG_ACPI_DPTF is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_PMIC_OPREGION is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_PRMT=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_ACPI_CPUFREQ is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
# CONFIG_CPU_IDLE_GOV_HALTPOLL is not set
CONFIG_HALTPOLL_CPUIDLE=y
# end of CPU Idle

# CONFIG_INTEL_IDLE is not set
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_ISA_BUS is not set
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_X86_X32 is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
# end of Binary Emulations

CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_HAVE_KVM_NO_POLL=y
CONFIG_KVM_XFER_TO_GUEST_WORK=y
CONFIG_HAVE_KVM_PM_NOTIFIER=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
# CONFIG_KVM_WERROR is not set
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_XEN is not set
# CONFIG_KVM_MMU_AUDIT is not set
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y

#
# General architecture-dependent options
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_HOTPLUG_SMT=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
# CONFIG_JUMP_LABEL is not set
# CONFIG_STATIC_CALL_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_TABLE_FREE=y
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
CONFIG_MMU_LAZY_TLB_REFCOUNT=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
# CONFIG_SECCOMP_CACHE_DEBUG is not set
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_LTO_NONE=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
CONFIG_MODULE_COMPRESS_NONE=y
# CONFIG_MODULE_COMPRESS_GZIP is not set
# CONFIG_MODULE_COMPRESS_XZ is not set
# CONFIG_MODULE_COMPRESS_ZSTD is not set
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
CONFIG_MODPROBE_PATH="/sbin/modprobe"
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG_COMMON=y
CONFIG_BLK_DEV_BSGLIB=y
# CONFIG_BLK_DEV_INTEGRITY is not set
# CONFIG_BLK_DEV_ZONED is not set
# CONFIG_BLK_DEV_THROTTLING is not set
# CONFIG_BLK_WBT is not set
# CONFIG_BLK_CGROUP_IOLATENCY is not set
# CONFIG_BLK_CGROUP_IOCOST is not set
# CONFIG_BLK_CGROUP_IOPRIO is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set
# CONFIG_BLK_INLINE_ENCRYPTION is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
# CONFIG_BSD_DISKLABEL is not set
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
# end of Partition Types

CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_PM=y
CONFIG_BLOCK_HOLDER_DEPRECATED=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
# end of IO Schedulers

CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_ASN1=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_BINFMT_SCRIPT=y
# CONFIG_BINFMT_MISC is not set
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG=y
# CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
# CONFIG_PAGE_REPORTING is not set
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=m
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_THP_SWAP=y
# CONFIG_CLEANCACHE is not set
# CONFIG_FRONTSWAP is not set
CONFIG_CMA=y
# CONFIG_CMA_DEBUG is not set
# CONFIG_CMA_DEBUGFS is not set
# CONFIG_CMA_SYSFS is not set
CONFIG_CMA_AREAS=7
# CONFIG_MEM_SOFT_DIRTY is not set
# CONFIG_ZPOOL is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
# CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ARCH_HAS_ZONE_DMA_SET=y
CONFIG_ZONE_DMA=y
CONFIG_ZONE_DMA32=y
CONFIG_ZONE_DEVICE=y
CONFIG_DEV_PAGEMAP_OPS=y
# CONFIG_DEVICE_PRIVATE is not set
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_TEST is not set
# CONFIG_READ_ONLY_THP_FOR_FS is not set
CONFIG_ARCH_HAS_PTE_SPECIAL=y
CONFIG_FREE_USER_PTE=y

#
# Data Access Monitoring
#
# CONFIG_DAMON is not set
# end of Data Access Monitoring
# end of Memory Management options

CONFIG_NET=y
CONFIG_COMPAT_NETLINK_MESSAGES=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
# CONFIG_PACKET is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
# CONFIG_XDP_SOCKETS is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_NET_IP_TUNNEL=m
# CONFIG_IP_MROUTE is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_IPVTI is not set
CONFIG_NET_UDP_TUNNEL=m
CONFIG_NET_FOU=m
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_INET_UDP_DIAG is not set
# CONFIG_INET_RAW_DIAG is not set
# CONFIG_INET_DIAG_DESTROY is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_MPTCP is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
# CONFIG_NETFILTER is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
CONFIG_NET_L3_MASTER_DEV=y
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
# end of Network testing
# end of Networking options

# CONFIG_HAMRADIO is not set
CONFIG_CAN=m
CONFIG_CAN_RAW=m
CONFIG_CAN_BCM=m
CONFIG_CAN_GW=m
# CONFIG_CAN_J1939 is not set
# CONFIG_CAN_ISOTP is not set

#
# CAN Device Drivers
#
CONFIG_CAN_VCAN=m
# CONFIG_CAN_VXCAN is not set
# CONFIG_CAN_SLCAN is not set
CONFIG_CAN_DEV=m
CONFIG_CAN_CALC_BITTIMING=y
# CONFIG_CAN_KVASER_PCIEFD is not set
# CONFIG_CAN_C_CAN is not set
# CONFIG_CAN_CC770 is not set
# CONFIG_CAN_IFI_CANFD is not set
# CONFIG_CAN_M_CAN is not set
# CONFIG_CAN_PEAK_PCIEFD is not set
# CONFIG_CAN_SJA1000 is not set
# CONFIG_CAN_SOFTING is not set

#
# CAN USB interfaces
#
# CONFIG_CAN_8DEV_USB is not set
# CONFIG_CAN_EMS_USB is not set
# CONFIG_CAN_ESD_USB2 is not set
# CONFIG_CAN_ETAS_ES58X is not set
# CONFIG_CAN_GS_USB is not set
# CONFIG_CAN_KVASER_USB is not set
# CONFIG_CAN_MCBA_USB is not set
# CONFIG_CAN_PEAK_USB is not set
# CONFIG_CAN_UCAN is not set
# end of CAN USB interfaces

# CONFIG_CAN_DEBUG_DEVICES is not set
# end of CAN Device Drivers

# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
CONFIG_STREAM_PARSER=y
# CONFIG_MCTP is not set
CONFIG_WIRELESS=y
CONFIG_WEXT_CORE=y
CONFIG_WEXT_PROC=y
CONFIG_CFG80211=m
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_CERTIFICATION_ONUS is not set
CONFIG_CFG80211_REQUIRE_SIGNED_REGDB=y
CONFIG_CFG80211_USE_KERNEL_REGDB_KEYS=y
CONFIG_CFG80211_DEFAULT_PS=y
# CONFIG_CFG80211_DEBUGFS is not set
CONFIG_CFG80211_CRDA_SUPPORT=y
CONFIG_CFG80211_WEXT=y
CONFIG_MAC80211=m
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
CONFIG_MAC80211_MESH=y
CONFIG_MAC80211_LEDS=y
CONFIG_MAC80211_DEBUGFS=y
# CONFIG_MAC80211_MESSAGE_TRACING is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_RFKILL is not set
CONFIG_NET_9P=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
CONFIG_LWTUNNEL=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
CONFIG_NET_SELFTESTS=y
CONFIG_NET_SOCK_MSG=y
CONFIG_NET_DEVLINK=y
CONFIG_PAGE_POOL=y
# CONFIG_FAILOVER is not set
CONFIG_ETHTOOL_NETLINK=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
# CONFIG_PCIEAER_INJECT is not set
# CONFIG_PCIE_ECRC is not set
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
# CONFIG_PCI_PF_STUB is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_IOV=y
# CONFIG_PCI_PRI is not set
# CONFIG_PCI_PASID is not set
# CONFIG_PCI_P2PDMA is not set
CONFIG_PCI_LABEL=y
# CONFIG_PCIE_BUS_TUNE_OFF is not set
CONFIG_PCIE_BUS_DEFAULT=y
# CONFIG_PCIE_BUS_SAFE is not set
# CONFIG_PCIE_BUS_PERFORMANCE is not set
# CONFIG_PCIE_BUS_PEER2PEER is not set
# CONFIG_HOTPLUG_PCI is not set

#
# PCI controller drivers
#
# CONFIG_VMD is not set

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT_HOST is not set
# CONFIG_PCI_MESON is not set
# end of DesignWare PCI Core Support

#
# Mobiveil PCIe Core Support
#
# end of Mobiveil PCIe Core Support

#
# Cadence PCIe controllers support
#
# end of Cadence PCIe controllers support
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

# CONFIG_CXL_BUS is not set
# CONFIG_PCCARD is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
CONFIG_AUXILIARY_BUS=y
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
# CONFIG_DEVTMPFS_MOUNT is not set
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_PAGED_BUF=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y
# CONFIG_FW_LOADER_COMPRESS is not set
CONFIG_FW_CACHE=y
# end of Firmware loader

CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_REGMAP=y
CONFIG_REGMAP_MMIO=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# end of Bus devices

CONFIG_CONNECTOR=m

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_FW_CFG_SYSFS is not set
CONFIG_SYSFB=y
# CONFIG_SYSFB_SIMPLEFB is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
# CONFIG_EFI_VARS is not set
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
CONFIG_EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
# CONFIG_EFI_RCI2_TABLE is not set
# CONFIG_EFI_DISABLE_PCI_DMA is not set
# end of EFI (Extensible Firmware Interface) Support

CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y
CONFIG_EFI_EARLYCON=y
# CONFIG_EFI_CUSTOM_SSDT_OVERLAYS is not set

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

# CONFIG_GNSS is not set
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
CONFIG_CDROM=m
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_DRBD is not set
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=65536
# CONFIG_CDROM_PKTCDVD is not set
CONFIG_ATA_OVER_ETH=y
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set

#
# NVME Support
#
CONFIG_NVME_CORE=m
CONFIG_BLK_DEV_NVME=m
CONFIG_NVME_MULTIPATH=y
# CONFIG_NVME_HWMON is not set
CONFIG_NVME_FABRICS=m
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TCP is not set
CONFIG_NVME_TARGET=m
# CONFIG_NVME_TARGET_PASSTHRU is not set
CONFIG_NVME_TARGET_LOOP=m
# CONFIG_NVME_TARGET_FC is not set
# CONFIG_NVME_TARGET_TCP is not set
# end of NVME Support

#
# Misc devices
#
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_SRAM is not set
# CONFIG_DW_XDATA_PCIE is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
CONFIG_EEPROM_93CX6=y
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_EEPROM_EE1004 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# end of Texas Instruments shared transport line discipline

# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_BCM_VK is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_MISC_RTSX_USB is not set
# CONFIG_HABANA_AI is not set
# CONFIG_UACCE is not set
# CONFIG_PVPANIC is not set
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI_COMMON=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_BLK_DEV_SR=m
CONFIG_CHR_DEV_SG=m
CONFIG_BLK_DEV_BSG=y
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_SCSI_SRP_ATTRS=m
# end of SCSI Transports

CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_SCSI_CXGB3_ISCSI=m
CONFIG_SCSI_CXGB4_ISCSI=m
CONFIG_SCSI_BNX2_ISCSI=m
CONFIG_SCSI_BNX2X_FCOE=m
CONFIG_BE2ISCSI=m
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
CONFIG_SCSI_HPSA=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_3W_SAS=m
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AACRAID=m
# CONFIG_SCSI_AIC7XXX is not set
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_BUILD_FIRMWARE is not set
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
# CONFIG_SCSI_AIC94XX is not set
CONFIG_SCSI_MVSAS=m
# CONFIG_SCSI_MVSAS_DEBUG is not set
CONFIG_SCSI_MVSAS_TASKLET=y
CONFIG_SCSI_MVUMI=m
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_ARCMSR=m
# CONFIG_SCSI_ESAS2R is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_MPT3SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT3SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS=m
# CONFIG_SCSI_MPI3MR is not set
# CONFIG_SCSI_SMARTPQI is not set
CONFIG_SCSI_UFSHCD=m
CONFIG_SCSI_UFSHCD_PCI=m
# CONFIG_SCSI_UFS_DWC_TC_PCI is not set
# CONFIG_SCSI_UFSHCD_PLATFORM is not set
# CONFIG_SCSI_UFS_BSG is not set
# CONFIG_SCSI_UFS_HPB is not set
# CONFIG_SCSI_UFS_FAULT_INJECTION is not set
# CONFIG_SCSI_UFS_HWMON is not set
CONFIG_SCSI_HPTIOP=m
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_MYRB is not set
# CONFIG_SCSI_MYRS is not set
CONFIG_VMWARE_PVSCSI=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
CONFIG_FCOE=m
CONFIG_FCOE_FNIC=m
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_FDOMAIN_PCI is not set
CONFIG_SCSI_ISCI=m
# CONFIG_SCSI_IPS is not set
CONFIG_SCSI_INITIO=m
# CONFIG_SCSI_INIA100 is not set
CONFIG_SCSI_STEX=m
# CONFIG_SCSI_SYM53C8XX_2 is not set
CONFIG_SCSI_IPR=m
CONFIG_SCSI_IPR_TRACE=y
CONFIG_SCSI_IPR_DUMP=y
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_QLA_ISCSI=m
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
CONFIG_SCSI_DEBUG=m
CONFIG_SCSI_PMCRAID=m
CONFIG_SCSI_PM8001=m
# CONFIG_SCSI_BFA_FC is not set
CONFIG_SCSI_CHELSIO_FCOE=m
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=y
CONFIG_SCSI_DH_HP_SW=y
CONFIG_SCSI_DH_EMC=y
CONFIG_SCSI_DH_ALUA=y
# end of SCSI device support

CONFIG_ATA=y
CONFIG_SATA_HOST=y
CONFIG_PATA_TIMINGS=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_FORCE=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_MOBILE_LPM_POLICY=0
CONFIG_SATA_AHCI_PLATFORM=y
# CONFIG_SATA_INIC162X is not set
CONFIG_SATA_ACARD_AHCI=m
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_SX4=m
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=m
CONFIG_SATA_NV=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIS=m
CONFIG_SATA_SVW=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=m
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PLATFORM is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_BUFIO=m
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
CONFIG_DM_BIO_PRISON=m
CONFIG_DM_PERSISTENT_DATA=m
# CONFIG_DM_UNSTRIPED is not set
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_THIN_PROVISIONING=m
CONFIG_DM_CACHE=m
CONFIG_DM_CACHE_SMQ=m
# CONFIG_DM_WRITECACHE is not set
# CONFIG_DM_EBS is not set
# CONFIG_DM_ERA is not set
# CONFIG_DM_CLONE is not set
CONFIG_DM_MIRROR=m
CONFIG_DM_LOG_USERSPACE=m
CONFIG_DM_RAID=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
# CONFIG_DM_MULTIPATH_HST is not set
# CONFIG_DM_MULTIPATH_IOA is not set
CONFIG_DM_DELAY=m
# CONFIG_DM_DUST is not set
# CONFIG_DM_UEVENT is not set
CONFIG_DM_FLAKEY=m
CONFIG_DM_VERITY=m
# CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG is not set
# CONFIG_DM_VERITY_FEC is not set
CONFIG_DM_SWITCH=m
CONFIG_DM_LOG_WRITES=m
# CONFIG_DM_INTEGRITY is not set
# CONFIG_DM_AUDIT is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
CONFIG_DUMMY=y
# CONFIG_WIREGUARD is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_IPVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_GENEVE is not set
# CONFIG_BAREUDP is not set
# CONFIG_GTP is not set
CONFIG_MACSEC=y
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_TUN=m
# CONFIG_TUN_VNET_CROSS_LE is not set
CONFIG_VETH=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set
CONFIG_ETHERNET=y
CONFIG_MDIO=y
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_NET_VENDOR_ADAPTEC=y
CONFIG_ADAPTEC_STARFIRE=y
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_AMD_XGBE is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ASIX=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
CONFIG_NET_VENDOR_BROADCOM=y
CONFIG_B44=y
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
# CONFIG_BCMGENET is not set
CONFIG_BNX2=y
CONFIG_CNIC=y
# CONFIG_TIGON3 is not set
CONFIG_BNX2X=y
CONFIG_BNX2X_SRIOV=y
# CONFIG_SYSTEMPORT is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_CADENCE=y
# CONFIG_MACB is not set
# CONFIG_NET_VENDOR_CAVIUM is not set
CONFIG_NET_VENDOR_CHELSIO=y
CONFIG_CHELSIO_T1=y
# CONFIG_CHELSIO_T1_1G is not set
CONFIG_CHELSIO_T3=y
CONFIG_CHELSIO_T4=y
# CONFIG_CHELSIO_T4VF is not set
CONFIG_CHELSIO_LIB=m
CONFIG_CHELSIO_INLINE_CRYPTO=y
CONFIG_NET_VENDOR_CISCO=y
CONFIG_ENIC=y
CONFIG_NET_VENDOR_CORTINA=y
# CONFIG_CX_ECAT is not set
CONFIG_DNET=y
CONFIG_NET_VENDOR_DEC=y
# CONFIG_NET_TULIP is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
CONFIG_SUNDANCE=y
CONFIG_SUNDANCE_MMIO=y
CONFIG_NET_VENDOR_EMULEX=y
CONFIG_BE2NET=y
# CONFIG_BE2NET_HWMON is not set
CONFIG_BE2NET_BE2=y
CONFIG_BE2NET_BE3=y
CONFIG_BE2NET_LANCER=y
CONFIG_BE2NET_SKYHAWK=y
# CONFIG_NET_VENDOR_EZCHIP is not set
CONFIG_NET_VENDOR_GOOGLE=y
# CONFIG_GVE is not set
CONFIG_NET_VENDOR_HUAWEI=y
# CONFIG_HINIC is not set
CONFIG_NET_VENDOR_I825XX=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=y
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_E1000E_HWTS is not set
CONFIG_IGB=y
CONFIG_IGB_HWMON=y
CONFIG_IGBVF=y
# CONFIG_IXGB is not set
CONFIG_IXGBE=y
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBEVF=m
CONFIG_I40E=y
# CONFIG_I40EVF is not set
# CONFIG_ICE is not set
# CONFIG_FM10K is not set
CONFIG_IGC=y
CONFIG_NET_VENDOR_MICROSOFT=y
CONFIG_JME=y
CONFIG_NET_VENDOR_LITEX=y
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
# CONFIG_MLXFW is not set
CONFIG_NET_VENDOR_MICREL=y
CONFIG_KS8851_MLL=y
CONFIG_KSZ884X_PCI=y
CONFIG_NET_VENDOR_MICROCHIP=y
# CONFIG_LAN743X is not set
CONFIG_NET_VENDOR_MICROSEMI=y
CONFIG_NET_VENDOR_MYRI=y
CONFIG_MYRI10GE=y
CONFIG_FEALNX=y
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NATSEMI=y
CONFIG_NS83820=y
CONFIG_NET_VENDOR_NETERION=y
CONFIG_S2IO=y
CONFIG_VXGE=y
# CONFIG_VXGE_DEBUG_TRACE_ALL is not set
# CONFIG_NET_VENDOR_NETRONOME is not set
CONFIG_NET_VENDOR_NI=y
# CONFIG_NI_XGE_MANAGEMENT_ENET is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_NE2K_PCI is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
CONFIG_ETHOC=y
CONFIG_NET_VENDOR_PACKET_ENGINES=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_PENSANDO=y
# CONFIG_IONIC is not set
CONFIG_NET_VENDOR_QLOGIC=y
CONFIG_QLA3XXX=y
CONFIG_QLCNIC=y
CONFIG_QLCNIC_SRIOV=y
CONFIG_QLCNIC_HWMON=y
CONFIG_NETXEN_NIC=y
# CONFIG_QED is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
# CONFIG_RMNET is not set
CONFIG_NET_VENDOR_RDC=y
CONFIG_R6040=y
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
CONFIG_R8169=y
# CONFIG_NET_VENDOR_RENESAS is not set
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
CONFIG_NET_VENDOR_SEEQ=y
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
CONFIG_NET_VENDOR_SILAN=y
CONFIG_SC92031=y
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
CONFIG_SIS190=y
CONFIG_NET_VENDOR_SMSC=y
CONFIG_EPIC100=y
# CONFIG_SMSC911X is not set
CONFIG_SMSC9420=y
CONFIG_NET_VENDOR_SOCIONEXT=y
CONFIG_NET_VENDOR_STMICRO=y
CONFIG_STMMAC_ETH=y
# CONFIG_STMMAC_SELFTESTS is not set
CONFIG_STMMAC_PLATFORM=y
# CONFIG_DWMAC_GENERIC is not set
CONFIG_DWMAC_INTEL=y
# CONFIG_DWMAC_LOONGSON is not set
# CONFIG_STMMAC_PCI is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NIU=y
# CONFIG_NET_VENDOR_SYNOPSYS is not set
CONFIG_NET_VENDOR_TEHUTI=y
CONFIG_TEHUTI=y
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_PHY_SEL is not set
CONFIG_TLAN=y
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XILINX=y
# CONFIG_XILINX_EMACLITE is not set
# CONFIG_XILINX_AXI_EMAC is not set
# CONFIG_XILINX_LL_TEMAC is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLINK=y
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
# CONFIG_LED_TRIGGER_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_SFP is not set

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_ADIN_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
CONFIG_AX88796B_PHY=y
CONFIG_BROADCOM_PHY=y
# CONFIG_BCM54140_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM84881_PHY is not set
# CONFIG_BCM87XX_PHY is not set
CONFIG_BCM_NET_PHYLIB=y
CONFIG_CICADA_PHY=y
# CONFIG_CORTINA_PHY is not set
CONFIG_DAVICOM_PHY=y
CONFIG_ICPLUS_PHY=y
CONFIG_LXT_PHY=y
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
CONFIG_MARVELL_PHY=y
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MARVELL_88X2222_PHY is not set
# CONFIG_MAXLINEAR_GPHY is not set
# CONFIG_MEDIATEK_GE_PHY is not set
CONFIG_MICREL_PHY=y
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROCHIP_T1_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_MOTORCOMM_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_NXP_C45_TJA11XX_PHY is not set
# CONFIG_NXP_TJA11XX_PHY is not set
CONFIG_QSEMI_PHY=y
CONFIG_REALTEK_PHY=y
# CONFIG_RENESAS_PHY is not set
# CONFIG_ROCKCHIP_PHY is not set
CONFIG_SMSC_PHY=y
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_DP83822_PHY is not set
# CONFIG_DP83TC811_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
# CONFIG_DP83869_PHY is not set
CONFIG_VITESSE_PHY=y
# CONFIG_XILINX_GMII2RGMII is not set
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
CONFIG_FWNODE_MDIO=y
CONFIG_ACPI_MDIO=y
CONFIG_MDIO_DEVRES=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_BCM_UNIMAC is not set
# CONFIG_MDIO_MVUSB is not set
# CONFIG_MDIO_MSCC_MIIM is not set
# CONFIG_MDIO_THUNDER is not set

#
# MDIO Multiplexers
#

#
# PCS device drivers
#
CONFIG_PCS_XPCS=y
# end of PCS device drivers

CONFIG_PPP=y
CONFIG_PPP_BSDCOMP=y
CONFIG_PPP_DEFLATE=y
# CONFIG_PPP_FILTER is not set
CONFIG_PPP_MPPE=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPPOE=y
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
CONFIG_SLIP=y
CONFIG_SLHC=y
# CONFIG_SLIP_COMPRESSED is not set
CONFIG_SLIP_SMART=y
CONFIG_SLIP_MODE_SLIP6=y
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_RTL8152=y
# CONFIG_USB_LAN78XX is not set
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
CONFIG_USB_NET_AX88179_178A=y
CONFIG_USB_NET_CDCETHER=m
# CONFIG_USB_NET_CDC_EEM is not set
# CONFIG_USB_NET_CDC_NCM is not set
# CONFIG_USB_NET_HUAWEI_CDC_NCM is not set
# CONFIG_USB_NET_CDC_MBIM is not set
# CONFIG_USB_NET_DM9601 is not set
# CONFIG_USB_NET_SR9700 is not set
# CONFIG_USB_NET_SR9800 is not set
# CONFIG_USB_NET_SMSC75XX is not set
# CONFIG_USB_NET_SMSC95XX is not set
# CONFIG_USB_NET_GL620A is not set
# CONFIG_USB_NET_NET1080 is not set
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
CONFIG_USB_NET_RNDIS_HOST=m
# CONFIG_USB_NET_CDC_SUBSET is not set
# CONFIG_USB_NET_ZAURUS is not set
CONFIG_USB_NET_CX82310_ETH=y
CONFIG_USB_NET_KALMIA=y
# CONFIG_USB_NET_QMI_WWAN is not set
# CONFIG_USB_NET_INT51X1 is not set
# CONFIG_USB_IPHETH is not set
# CONFIG_USB_SIERRA_NET is not set
CONFIG_USB_VL600=m
# CONFIG_USB_NET_CH9200 is not set
# CONFIG_USB_NET_AQC111 is not set
# CONFIG_USB_RTL8153_ECM is not set
CONFIG_WLAN=y
# CONFIG_WLAN_VENDOR_ADMTEK is not set
# CONFIG_WLAN_VENDOR_ATH is not set
# CONFIG_WLAN_VENDOR_ATMEL is not set
# CONFIG_WLAN_VENDOR_BROADCOM is not set
# CONFIG_WLAN_VENDOR_CISCO is not set
# CONFIG_WLAN_VENDOR_INTEL is not set
# CONFIG_WLAN_VENDOR_INTERSIL is not set
# CONFIG_WLAN_VENDOR_MARVELL is not set
# CONFIG_WLAN_VENDOR_MEDIATEK is not set
CONFIG_WLAN_VENDOR_MICROCHIP=y
# CONFIG_WLAN_VENDOR_RALINK is not set
# CONFIG_WLAN_VENDOR_REALTEK is not set
# CONFIG_WLAN_VENDOR_RSI is not set
# CONFIG_WLAN_VENDOR_ST is not set
# CONFIG_WLAN_VENDOR_TI is not set
# CONFIG_WLAN_VENDOR_ZYDAS is not set
CONFIG_WLAN_VENDOR_QUANTENNA=y
# CONFIG_QTNFMAC_PCIE is not set
CONFIG_MAC80211_HWSIM=m
CONFIG_USB_NET_RNDIS_WLAN=m
# CONFIG_VIRT_WIFI is not set
# CONFIG_WAN is not set

#
# Wireless WAN
#
# CONFIG_WWAN is not set
# end of Wireless WAN

# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
CONFIG_NETDEVSIM=m
# CONFIG_NET_FAILOVER is not set
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_LEDS is not set
# CONFIG_INPUT_FF_MEMLESS is not set
CONFIG_INPUT_SPARSEKMAP=y
CONFIG_INPUT_MATRIXKMAP=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ADP5588=y
CONFIG_KEYBOARD_ADP5589=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1050 is not set
CONFIG_KEYBOARD_QT1070=y
CONFIG_KEYBOARD_QT2160=y
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
CONFIG_KEYBOARD_LKKBD=y
CONFIG_KEYBOARD_TCA6416=y
# CONFIG_KEYBOARD_TCA8418 is not set
CONFIG_KEYBOARD_LM8323=y
# CONFIG_KEYBOARD_LM8333 is not set
CONFIG_KEYBOARD_MAX7359=y
CONFIG_KEYBOARD_MCS=y
CONFIG_KEYBOARD_MPR121=y
CONFIG_KEYBOARD_NEWTON=y
CONFIG_KEYBOARD_OPENCORES=y
# CONFIG_KEYBOARD_SAMSUNG is not set
CONFIG_KEYBOARD_STOWAWAY=y
CONFIG_KEYBOARD_SUNKBD=y
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
CONFIG_KEYBOARD_XTKBD=y
# CONFIG_KEYBOARD_CYPRESS_SF is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
# CONFIG_MOUSE_PS2_BYD is not set
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_SENTELIC is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
# CONFIG_MOUSE_PS2_VMMOUSE is not set
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=y
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_DA7280_HAPTICS is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_IQS269A is not set
# CONFIG_INPUT_IQS626A is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_LDISC_AUTOLOAD=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_16550A_VARIANTS is not set
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
CONFIG_SERIAL_8250_NR_UARTS=16
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DWLIB=y
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
# CONFIG_SERIAL_8250_MID is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_LANTIQ is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_BCM63XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_FSL_LINFLEXUART is not set
# CONFIG_SERIAL_SPRD is not set
# end of Serial drivers

# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_N_GSM is not set
CONFIG_NOZOMI=y
# CONFIG_NULL_TTY is not set
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_TTY_PRINTK is not set
# CONFIG_VIRTIO_CONSOLE is not set
CONFIG_IPMI_HANDLER=m
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PLAT_DATA=y
# CONFIG_IPMI_PANIC_EVENT is not set
# CONFIG_IPMI_DEVICE_INTERFACE is not set
CONFIG_IPMI_SI=m
# CONFIG_IPMI_SSIF is not set
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=y
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_BA431 is not set
CONFIG_HW_RANDOM_VIA=y
# CONFIG_HW_RANDOM_XIPHERA is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
CONFIG_DEVMEM=y
CONFIG_NVRAM=y
CONFIG_DEVPORT=y
# CONFIG_HPET is not set
CONFIG_HANGCHECK_TIMER=y
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# CONFIG_XILLYUSB is not set
# CONFIG_RANDOM_TRUST_CPU is not set
# CONFIG_RANDOM_TRUST_BOOTLOADER is not set
# end of Character devices

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_AMD_MP2 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_NVIDIA_GPU is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_CP2615 is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_VIRTIO is not set
# end of I2C Hardware Bus support

# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# end of I2C support

# CONFIG_I3C is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y
CONFIG_PTP_1588_CLOCK_OPTIONAL=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
CONFIG_PTP_1588_CLOCK_KVM=y
# CONFIG_PTP_1588_CLOCK_IDT82P33 is not set
# CONFIG_PTP_1588_CLOCK_IDTCM is not set
# CONFIG_PTP_1588_CLOCK_VMW is not set
# end of PTP clock support

# CONFIG_PINCTRL is not set
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_POWER_SUPPLY_HWMON=y
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_CHARGER_ADP5061 is not set
# CONFIG_BATTERY_CW2015 is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_LTC4162L is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
# CONFIG_BATTERY_GOLDFISH is not set
# CONFIG_BATTERY_RT5033 is not set
# CONFIG_CHARGER_BD99954 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM1177 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_AHT10 is not set
# CONFIG_SENSORS_AQUACOMPUTER_D5NEXT is not set
# CONFIG_SENSORS_AS370 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_AXI_FAN_CONTROL is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ASPEED is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_CORSAIR_CPRO is not set
# CONFIG_SENSORS_CORSAIR_PSU is not set
# CONFIG_SENSORS_DRIVETEMP is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2947_I2C is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX127 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX31730 is not set
# CONFIG_SENSORS_MAX6620 is not set
# CONFIG_SENSORS_MAX6621 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_TPS23861 is not set
# CONFIG_SENSORS_MR75203 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NPCM7XX is not set
# CONFIG_SENSORS_NZXT_KRAKEN2 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SBTSI is not set
# CONFIG_SENSORS_SBRMI is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHT4x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_TMP513 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83773G is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_NETLINK is not set
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_DEVFREQ_THERMAL is not set
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
CONFIG_INTEL_POWERCLAMP=m
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_X86_PKG_TEMP_THERMAL=m
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# end of ACPI INT340X thermal drivers

CONFIG_INTEL_PCH_THERMAL=m
# CONFIG_INTEL_TCC_COOLING is not set
# CONFIG_INTEL_MENLOW is not set
# end of Intel thermal drivers

# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
CONFIG_SSB=y
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_MADERA is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_MFD_MP2629 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_INTEL_PMT is not set
# CONFIG_MFD_IQS62X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6360 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RT4831 is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
CONFIG_MFD_SYSCON=y
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_ATC260X_I2C is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set
# CONFIG_RC_CORE is not set

#
# CEC support
#
# CONFIG_MEDIA_CEC_SUPPORT is not set
# end of CEC support

# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AGP is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_VGA_SWITCHEROO is not set
# CONFIG_DRM is not set
# CONFIG_DRM_DEBUG_MODESET_LOCK is not set

#
# ARM devices
#
# end of ARM devices

CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y

#
# Frame buffer Devices
#
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_MODE_HELPERS is not set
# CONFIG_FB_TILEBLITTING is not set

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
CONFIG_LCD_CLASS_DEVICE=y
# CONFIG_LCD_PLATFORM is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_QCOM_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_BACKLIGHT_ARCXCNN is not set
# end of Backlight & LCD device support

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER is not set
# end of Console display driver support

# CONFIG_LOGO is not set
# end of Graphics support

CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_PCM=y
CONFIG_SND_HWDEP=y
CONFIG_SND_RAWMIDI=y
CONFIG_SND_JACK=y
CONFIG_SND_JACK_INPUT_DEV=y
# CONFIG_SND_OSSEMUL is not set
# CONFIG_SND_PCM_TIMER is not set
# CONFIG_SND_HRTIMER is not set
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_MAX_CARDS=32
CONFIG_SND_SUPPORT_OLD_API=y
# CONFIG_SND_PROC_FS is not set
CONFIG_SND_VERBOSE_PRINTK=y
CONFIG_SND_DEBUG=y
CONFIG_SND_DEBUG_VERBOSE=y
# CONFIG_SND_CTL_VALIDATION is not set
# CONFIG_SND_JACK_INJECTION_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_CTL_LED=y
# CONFIG_SND_SEQUENCER is not set
CONFIG_SND_MPU401_UART=y
CONFIG_SND_AC97_CODEC=y
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_ALOOP is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_AC97_POWER_SAVE is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
CONFIG_SND_INTEL8X0M=y
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SE6X is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
CONFIG_SND_VIA82XX=y
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# HD-Audio
#
CONFIG_SND_HDA=y
CONFIG_SND_HDA_GENERIC_LEDS=y
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=1
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_HDMI=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
# CONFIG_SND_HDA_CODEC_CS8409 is not set
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CA0132=y
# CONFIG_SND_HDA_CODEC_CA0132_DSP is not set
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDA_INTEL_HDMI_SILENT_STREAM is not set
# end of HD-Audio

CONFIG_SND_HDA_CORE=y
CONFIG_SND_HDA_PREALLOC_SIZE=0
CONFIG_SND_INTEL_NHLT=y
CONFIG_SND_INTEL_DSP_CONFIG=y
CONFIG_SND_INTEL_SOUNDWIRE_ACPI=y
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_USB_HIFACE is not set
# CONFIG_SND_BCD2000 is not set
# CONFIG_SND_USB_POD is not set
# CONFIG_SND_USB_PODHD is not set
# CONFIG_SND_USB_TONEPORT is not set
# CONFIG_SND_USB_VARIAX is not set
# CONFIG_SND_SOC is not set
CONFIG_SND_X86=y
CONFIG_AC97_BUS=y

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
# CONFIG_HID_APPLE is not set
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_ASUS is not set
# CONFIG_HID_AUREAL is not set
# CONFIG_HID_BELKIN is not set
# CONFIG_HID_BETOP_FF is not set
# CONFIG_HID_BIGBEN_FF is not set
# CONFIG_HID_CHERRY is not set
# CONFIG_HID_CHICONY is not set
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_MACALLY is not set
# CONFIG_HID_PRODIKEYS is not set
# CONFIG_HID_CMEDIA is not set
# CONFIG_HID_CREATIVE_SB0540 is not set
# CONFIG_HID_CYPRESS is not set
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELAN is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
# CONFIG_HID_EZKEY is not set
# CONFIG_HID_FT260 is not set
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_GLORIOUS is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_VIVALDI is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_VIEWSONIC is not set
# CONFIG_HID_XIAOMI is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_ITE is not set
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
# CONFIG_HID_KENSINGTON is not set
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
# CONFIG_HID_LOGITECH is not set
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MALTRON is not set
# CONFIG_HID_MAYFLASH is not set
# CONFIG_HID_REDRAGON is not set
# CONFIG_HID_MICROSOFT is not set
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
# CONFIG_HID_NTRIG is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PLAYSTATION is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SEMITEK is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_U2FZERO is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set
# end of Special HID drivers

#
# USB HID support
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
CONFIG_USB_HIDDEV=y
# end of USB HID support

#
# I2C HID support
#
# CONFIG_I2C_HID_ACPI is not set
# end of I2C HID support

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
# end of Intel ISH HID support

#
# AMD SFH HID Support
#
# CONFIG_AMD_SFH_HID is not set
# end of AMD SFH HID Support
# end of HID support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
# CONFIG_USB_LED_TRIG is not set
# CONFIG_USB_ULPI_BUS is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_FEW_INIT_RETRIES is not set
CONFIG_USB_DYNAMIC_MINORS=y
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_PRODUCTLIST is not set
# CONFIG_USB_OTG_DISABLE_EXTERNAL_HUB is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_AUTOSUSPEND_DELAY=2
# CONFIG_USB_MON is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
# CONFIG_USB_XHCI_DBGCAP is not set
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PCI_RENESAS is not set
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_FSL is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_SSB is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=y
# CONFIG_USB_PRINTER is not set
CONFIG_USB_WDM=y
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_REALTEK=y
CONFIG_REALTEK_AUTOPM=y
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
CONFIG_USB_STORAGE_ONETOUCH=y
CONFIG_USB_STORAGE_KARMA=y
CONFIG_USB_STORAGE_CYPRESS_ATACB=y
CONFIG_USB_STORAGE_ENE_UB6250=y
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_CDNS_SUPPORT is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_APPLE_MFI_FASTCHARGE is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# end of USB Physical Layer drivers

# CONFIG_USB_GADGET is not set
# CONFIG_TYPEC is not set
# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_CLASS_MULTICOLOR is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_APU is not set
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3532 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_MLXREG is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set
# CONFIG_LEDS_TI_LMU_COMMON is not set

#
# Flash and Torch LED drivers
#

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_ACTIVITY is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
# CONFIG_LEDS_TRIGGER_NETDEV is not set
# CONFIG_LEDS_TRIGGER_PATTERN is not set
CONFIG_LEDS_TRIGGER_AUDIO=y
# CONFIG_LEDS_TRIGGER_TTY is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="n"
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABEOZ9 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF85363 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3028 is not set
# CONFIG_RTC_DRV_RV3032 is not set
# CONFIG_RTC_DRV_RV8803 is not set
# CONFIG_RTC_DRV_SD3078 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set
# CONFIG_RTC_DRV_RX6110 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_GOLDFISH is not set
# CONFIG_DMADEVICES is not set

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
CONFIG_SW_SYNC=y
# CONFIG_UDMABUF is not set
# CONFIG_DMABUF_MOVE_NOTIFY is not set
# CONFIG_DMABUF_DEBUG is not set
# CONFIG_DMABUF_SELFTESTS is not set
# CONFIG_DMABUF_HEAPS is not set
# CONFIG_DMABUF_SYSFS_STATS is not set
# end of DMABUF options

# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=y
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_DMEM_GENIRQ is not set
# CONFIG_UIO_AEC is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_UIO_PCI_GENERIC is not set
# CONFIG_UIO_NETX is not set
# CONFIG_UIO_PRUSS is not set
# CONFIG_UIO_MF624 is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=m
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO_MENU=y
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_MMIO is not set
# CONFIG_VDPA is not set
CONFIG_VHOST_IOTLB=m
CONFIG_VHOST=m
CONFIG_VHOST_MENU=y
CONFIG_VHOST_NET=m
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set
# end of Microsoft Hyper-V guest support

# CONFIG_GREYBUS is not set
# CONFIG_COMEDI is not set
# CONFIG_STAGING is not set
# CONFIG_X86_PLATFORM_DEVICES is not set
CONFIG_PMC_ATOM=y
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_SURFACE_PLATFORMS=y
# CONFIG_SURFACE_3_POWER_OPREGION is not set
# CONFIG_SURFACE_GPE is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y
# CONFIG_COMMON_CLK_MAX9485 is not set
# CONFIG_COMMON_CLK_SI5341 is not set
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_SI544 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_XILINX_VCU is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOASID=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
# end of Generic IOMMU Pagetable Support

# CONFIG_IOMMU_DEBUGFS is not set
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
# CONFIG_AMD_IOMMU is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND=m
# CONFIG_DEVFREQ_GOV_PERFORMANCE is not set
# CONFIG_DEVFREQ_GOV_POWERSAVE is not set
# CONFIG_DEVFREQ_GOV_USERSPACE is not set
# CONFIG_DEVFREQ_GOV_PASSIVE is not set

#
# DEVFREQ Drivers
#
# CONFIG_PM_DEVFREQ_EVENT is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
CONFIG_RESET_CONTROLLER=y
# CONFIG_RESET_TI_SYSCON is not set

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
# CONFIG_USB_LGM_PHY is not set
# CONFIG_PHY_CAN_TRANSCEIVER is not set

#
# PHY drivers for Broadcom platforms
#
# CONFIG_BCM_KONA_USB2_PHY is not set
# end of PHY drivers for Broadcom platforms

# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_INTEL_LGM_EMMC is not set
# end of PHY Subsystem

# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_RAS_CEC is not set
# CONFIG_USB4 is not set

#
# Android
#
# CONFIG_ANDROID is not set
# end of Android

CONFIG_LIBNVDIMM=m
CONFIG_BLK_DEV_PMEM=m
CONFIG_ND_BLK=m
CONFIG_ND_CLAIM=y
CONFIG_ND_BTT=m
CONFIG_BTT=y
CONFIG_ND_PFN=m
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_DAX_DRIVER=y
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_KMEM=m
CONFIG_DEV_DAX_PMEM_COMPAT=m
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y
# CONFIG_NVMEM_RMEM is not set

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
# CONFIG_TEE is not set
CONFIG_PM_OPP=y
# CONFIG_UNISYS_VISORBUS is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_MOST is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_VALIDATE_FS_PARSER is not set
CONFIG_FS_IOMAP=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
# CONFIG_EXT4_USE_FOR_EXT2 is not set
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_XFS_FS=m
CONFIG_XFS_SUPPORT_V4=y
# CONFIG_XFS_QUOTA is not set
# CONFIG_XFS_POSIX_ACL is not set
CONFIG_XFS_RT=y
CONFIG_XFS_ONLINE_SCRUB=y
# CONFIG_XFS_ONLINE_REPAIR is not set
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y
# CONFIG_GFS2_FS is not set
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_FS_O2CB=m
CONFIG_OCFS2_FS_STATS=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
# CONFIG_OCFS2_DEBUG_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=y
# CONFIG_FS_VERITY is not set
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=m
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS4_FS is not set
# CONFIG_AUTOFS_FS is not set
CONFIG_FUSE_FS=m
# CONFIG_CUSE is not set
# CONFIG_VIRTIO_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set
# end of Caches

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
CONFIG_FAT_FS=y
# CONFIG_MSDOS_FS is not set
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_FAT_DEFAULT_UTF8 is not set
# CONFIG_EXFAT_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_NTFS3_FS is not set
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_CHILDREN=y
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_TMPFS_XATTR is not set
# CONFIG_TMPFS_INODE64 is not set
# CONFIG_HUGETLBFS is not set
CONFIG_MEMFD_CREATE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=y
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=y
CONFIG_CRAMFS_BLOCKDEV=y
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_DEFLATE_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_LZ4HC_COMPRESS is not set
# CONFIG_PSTORE_842_COMPRESS is not set
# CONFIG_PSTORE_ZSTD_COMPRESS is not set
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
CONFIG_PSTORE_CONSOLE=y
CONFIG_PSTORE_PMSG=y
# CONFIG_PSTORE_FTRACE is not set
CONFIG_PSTORE_RAM=m
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_EROFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_PNFS_FILE_LAYOUT=y
CONFIG_PNFS_BLOCK=m
CONFIG_PNFS_FLEXFILE_LAYOUT=y
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
# CONFIG_NFS_V4_1_MIGRATION is not set
CONFIG_ROOT_NFS=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
# CONFIG_NFS_V4_2_READ_PLUS is not set
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
# CONFIG_NFSD_BLOCKLAYOUT is not set
# CONFIG_NFSD_SCSILAYOUT is not set
# CONFIG_NFSD_FLEXFILELAYOUT is not set
# CONFIG_NFSD_V4_2_INTER_SSC is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_NFS_V4_2_SSC_HELPER=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES is not set
# CONFIG_SUNRPC_DEBUG is not set
# CONFIG_CEPH_FS is not set
CONFIG_CIFS=y
CONFIG_CIFS_STATS2=y
CONFIG_CIFS_ALLOW_INSECURE_LEGACY=y
# CONFIG_CIFS_UPCALL is not set
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_DEBUG=y
CONFIG_CIFS_DEBUG2=y
# CONFIG_CIFS_DEBUG_DUMP_KEYS is not set
# CONFIG_CIFS_DFS_UPCALL is not set
# CONFIG_CIFS_SWN_UPCALL is not set
# CONFIG_CIFS_ROOT is not set
# CONFIG_SMB_SERVER is not set
CONFIG_SMBFS_COMMON=y
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
CONFIG_NLS_CODEPAGE_936=y
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set
# CONFIG_UNICODE is not set
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_REQUEST_CACHE is not set
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
CONFIG_PAGE_TABLE_ISOLATION=y
# CONFIG_INTEL_TXT is not set
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
# CONFIG_HARDENED_USERCOPY is not set
CONFIG_FORTIFY_SOURCE=y
# CONFIG_STATIC_USERMODEHELPER is not set
# CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,bpf"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_INIT_STACK_NONE=y
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
# end of Memory initialization
# end of Kernel hardening options
# end of Security options

CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_ASYNC_PQ=m
CONFIG_ASYNC_RAID6_RECOV=m
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
# CONFIG_CRYPTO_ECDSA is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_SM2 is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# CONFIG_CRYPTO_CURVE25519_X86 is not set

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
CONFIG_CRYPTO_SEQIV=y
# CONFIG_CRYPTO_ECHAINIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CFB is not set
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_ADIANTUM is not set
CONFIG_CRYPTO_ESSIV=m

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
# CONFIG_CRYPTO_XXHASH is not set
# CONFIG_CRYPTO_BLAKE2B is not set
# CONFIG_CRYPTO_BLAKE2S is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
CONFIG_CRYPTO_CRCT10DIF=m
# CONFIG_CRYPTO_CRCT10DIF_PCLMUL is not set
CONFIG_CRYPTO_GHASH=y
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_RMD160 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_SM3 is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HASH_INFO=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_ARC4=y
# CONFIG_CRYPTO_LIB_BLAKE2S is not set
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_DES=y
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA256=y
# CONFIG_CRYPTO_HW is not set
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_X509_CERTIFICATE_PARSER=y
# CONFIG_PKCS8_PRIVATE_KEY_PARSER is not set
CONFIG_PKCS7_MESSAGE_PARSER=y
# CONFIG_PKCS7_TEST_KEY is not set
# CONFIG_SIGNED_PE_FILE_VERIFICATION is not set

#
# Certificates for signature checking
#
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_SYSTEM_TRUSTED_KEYS=""
# CONFIG_SYSTEM_EXTRA_CERTIFICATE is not set
# CONFIG_SECONDARY_TRUSTED_KEYRING is not set
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_RAID6_PQ_BENCHMARK=y
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=m
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC64 is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_ZSTD_DECOMPRESS=y
# CONFIG_XZ_DEC is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_ZSTD=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_REED_SOLOMON=m
CONFIG_REED_SOLOMON_ENC8=y
CONFIG_REED_SOLOMON_DEC8=y
CONFIG_BTREE=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_SWIOTLB=y
CONFIG_DMA_CMA=y
# CONFIG_DMA_PERNUMA_CMA is not set

#
# Default contiguous memory area size:
#
CONFIG_CMA_SIZE_MBYTES=200
CONFIG_CMA_SIZE_SEL_MBYTES=y
# CONFIG_CMA_SIZE_SEL_PERCENTAGE is not set
# CONFIG_CMA_SIZE_SEL_MIN is not set
# CONFIG_CMA_SIZE_SEL_MAX is not set
CONFIG_CMA_ALIGNMENT=8
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_DMA_MAP_BENCHMARK is not set
CONFIG_SGL_ALLOC=y
CONFIG_IOMMU_HELPER=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_CLZ_TAB=y
CONFIG_IRQ_POLL=y
CONFIG_MPILIB=y
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
CONFIG_SBITMAP=y
# end of Library routines

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_PRINTK_CALLER=y
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_DEBUG_SYNCHRO_TEST is not set
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DYNAMIC_DEBUG_CORE is not set
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_REDUCED=y
# CONFIG_DEBUG_INFO_COMPRESSED is not set
# CONFIG_DEBUG_INFO_SPLIT is not set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_DEBUG_INFO_DWARF5 is not set
CONFIG_PAHOLE_HAS_SPLIT_BTF=y
# CONFIG_GDB_SCRIPTS is not set
CONFIG_FRAME_WARN=2048
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
# CONFIG_HEADERS_INSTALL is not set
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
# CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B is not set
CONFIG_STACK_VALIDATION=y
# CONFIG_VMLINUX_MAP is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
CONFIG_HAVE_ARCH_KCSAN=y
# end of Generic Kernel Debugging Instruments

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
CONFIG_DEBUG_RODATA_TEST=y
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_PTDUMP_DEBUGFS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_DEBUG_VM_PGTABLE=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
# CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
# CONFIG_KASAN is not set
CONFIG_HAVE_ARCH_KFENCE=y
# CONFIG_KFENCE is not set
# end of Memory Debugging

# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Oops, Lockups and Hangs
#
CONFIG_PANIC_ON_OOPS=y
CONFIG_PANIC_ON_OOPS_VALUE=1
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_HARDLOCKUP_DETECTOR is not set
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=480
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_WQ_WATCHDOG=y
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
# CONFIG_SCHEDSTATS is not set
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set
CONFIG_DEBUG_PREEMPT=y

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

# CONFIG_DEBUG_IRQFLAGS is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# end of Debug kernel data structures

# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_LATENCYTOP is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_BOOTTIME_TRACING is not set
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
# CONFIG_FUNCTION_PROFILER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_HWLAT_TRACER is not set
# CONFIG_OSNOISE_TRACER is not set
# CONFIG_TIMERLAT_TRACER is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_FTRACE_SYSCALLS is not set
# CONFIG_TRACER_SNAPSHOT is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_KPROBE_EVENTS=y
# CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
# CONFIG_BPF_KPROBE_OVERRIDE is not set
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
CONFIG_TRACING_MAP=y
CONFIG_SYNTH_EVENTS=y
CONFIG_HIST_TRIGGERS=y
# CONFIG_TRACE_EVENT_INJECT is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_FTRACE_RECORD_RECURSION is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_SYNTH_EVENT_GEN_TEST is not set
# CONFIG_KPROBE_EVENT_GEN_TEST is not set
# CONFIG_HIST_TRIGGERS_DEBUG is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_MULTI_DIRECT=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
# CONFIG_STRICT_DEVMEM is not set

#
# x86 Debugging
#
# CONFIG_DEBUG_AID_FOR_SYZBOT is not set
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_EARLY_PRINTK_USB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_EARLY_PRINTK_USB_XDBC=y
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set
# CONFIG_UNWINDER_GUESS is not set
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
CONFIG_NOTIFIER_ERROR_INJECTION=m
CONFIG_PM_NOTIFIER_ERROR_INJECT=m
# CONFIG_NETDEV_NOTIFIER_ERROR_INJECT is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
CONFIG_FAULT_INJECTION=y
# CONFIG_FAILSLAB is not set
# CONFIG_FAIL_PAGE_ALLOC is not set
# CONFIG_FAULT_INJECTION_USERCOPY is not set
CONFIG_FAIL_MAKE_REQUEST=y
# CONFIG_FAIL_IO_TIMEOUT is not set
# CONFIG_FAIL_FUTEX is not set
CONFIG_FAULT_INJECTION_DEBUG_FS=y
# CONFIG_FAIL_FUNCTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
CONFIG_RUNTIME_TESTING_MENU=y
# CONFIG_LKDTM is not set
# CONFIG_TEST_MIN_HEAP is not set
# CONFIG_TEST_DIV64 is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_REED_SOLOMON_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_ASYNC_RAID6_TEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_STRING_SELFTEST is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_STRSCPY is not set
# CONFIG_TEST_KSTRTOX is not set
CONFIG_TEST_PRINTF=m
# CONFIG_TEST_SCANF is not set
CONFIG_TEST_BITMAP=m
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_XARRAY is not set
# CONFIG_TEST_OVERFLOW is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_HASH is not set
# CONFIG_TEST_IDA is not set
CONFIG_TEST_LKM=m
# CONFIG_TEST_BITOPS is not set
# CONFIG_TEST_VMALLOC is not set
CONFIG_TEST_USER_COPY=m
CONFIG_TEST_BPF=m
# CONFIG_TEST_BLACKHOLE_DEV is not set
# CONFIG_FIND_BIT_BENCHMARK is not set
# CONFIG_TEST_FIRMWARE is not set
CONFIG_TEST_SYSCTL=m
# CONFIG_TEST_UDELAY is not set
CONFIG_TEST_STATIC_KEYS=m
# CONFIG_TEST_MEMCAT_P is not set
# CONFIG_TEST_STACKINIT is not set
# CONFIG_TEST_MEMINIT is not set
# CONFIG_TEST_FREE_PAGES is not set
# CONFIG_TEST_FPU is not set
# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set
CONFIG_ARCH_USE_MEMTEST=y
CONFIG_MEMTEST=y
# end of Kernel Testing and Coverage

#
# Rust hacking
#
# end of Rust hacking
# end of Kernel hacking

[-- Attachment #3: job-script --]
[-- Type: text/plain, Size: 4680 bytes --]

#!/bin/sh

export_top_env()
{
	export suite='boot'
	export testcase='boot'
	export category='functional'
	export timeout='10m'
	export job_origin='boot.yaml'
	export queue_cmdline_keys='branch
commit
queue_at_least_once'
	export queue='validate'
	export testbox='vm-snb-194'
	export tbox_group='vm-snb'
	export branch='linux-review/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837'
	export commit='afcc9fb8741f26773a381ac1e159e0172344b7d5'
	export kconfig='x86_64-kexec'
	export repeat_to=6
	export nr_vm=160
	export submit_id='618e50ed63472456d854936e'
	export job_file='/lkp/jobs/scheduled/vm-snb-194/boot-1-debian-i386-20191205.cgz-afcc9fb8741f26773a381ac1e159e0172344b7d5-20211112-22232-7nwhob-4.yaml'
	export id='179a281320a6b68fdccca4791d37f467fc0d6ecb'
	export queuer_version='/lkp-src'
	export model='qemu-system-x86_64 -enable-kvm -cpu SandyBridge'
	export nr_cpu=2
	export memory='16G'
	export need_kconfig=\{\"KVM_GUEST\"\=\>\"y\"\}
	export ssh_base_port=23032
	export kernel_cmdline='vmalloc=128M initramfs_async=0 page_owner=on'
	export rootfs='debian-i386-20191205.cgz'
	export compiler='gcc-9'
	export enqueue_time='2021-11-12 19:33:01 +0800'
	export _id='618e50fd63472456d854936f'
	export _rt='/result/boot/1/vm-snb/debian-i386-20191205.cgz/x86_64-kexec/gcc-9/afcc9fb8741f26773a381ac1e159e0172344b7d5'
	export user='lkp'
	export LKP_SERVER='internal-lkp-server'
	export result_root='/result/boot/1/vm-snb/debian-i386-20191205.cgz/x86_64-kexec/gcc-9/afcc9fb8741f26773a381ac1e159e0172344b7d5/3'
	export scheduler_version='/lkp/lkp/.src-20211112-114915'
	export arch='i386'
	export max_uptime=600
	export initrd='/osimage/debian/debian-i386-20191205.cgz'
	export bootloader_append='root=/dev/ram0
user=lkp
job=/lkp/jobs/scheduled/vm-snb-194/boot-1-debian-i386-20191205.cgz-afcc9fb8741f26773a381ac1e159e0172344b7d5-20211112-22232-7nwhob-4.yaml
ARCH=x86_64
kconfig=x86_64-kexec
branch=linux-review/Qi-Zheng/Free-user-PTE-page-table-pages/20211110-185837
commit=afcc9fb8741f26773a381ac1e159e0172344b7d5
BOOT_IMAGE=/pkg/linux/x86_64-kexec/gcc-9/afcc9fb8741f26773a381ac1e159e0172344b7d5/vmlinuz-5.15.0-rc7-mm1-00448-gafcc9fb8741f
vmalloc=128M initramfs_async=0 page_owner=on
max_uptime=600
RESULT_ROOT=/result/boot/1/vm-snb/debian-i386-20191205.cgz/x86_64-kexec/gcc-9/afcc9fb8741f26773a381ac1e159e0172344b7d5/3
LKP_SERVER=internal-lkp-server
selinux=0
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
net.ifnames=0
printk.devkmsg=on
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
drbd.minor_count=8
systemd.log_level=err
ignore_loglevel
console=tty0
earlyprintk=ttyS0,115200
console=ttyS0,115200
vga=normal
rw'
	export modules_initrd='/pkg/linux/x86_64-kexec/gcc-9/afcc9fb8741f26773a381ac1e159e0172344b7d5/modules.cgz'
	export bm_initrd='/osimage/deps/debian-i386-20191205.cgz/lkp_20201203.cgz'
	export lkp_initrd='/osimage/user/lkp/lkp-i386.cgz'
	export site='inn'
	export LKP_CGI_PORT=80
	export LKP_CIFS_PORT=139
	export schedule_notify_address=
	export queue_at_least_once=1
	export kernel='/pkg/linux/x86_64-kexec/gcc-9/afcc9fb8741f26773a381ac1e159e0172344b7d5/vmlinuz-5.15.0-rc7-mm1-00448-gafcc9fb8741f'
	export dequeue_time='2021-11-12 19:34:47 +0800'
	export job_initrd='/lkp/jobs/scheduled/vm-snb-194/boot-1-debian-i386-20191205.cgz-afcc9fb8741f26773a381ac1e159e0172344b7d5-20211112-22232-7nwhob-4.cgz'

	[ -n "$LKP_SRC" ] ||
	export LKP_SRC=/lkp/${user:-lkp}/src
}

run_job()
{
	echo $$ > $TMP/run-job.pid

	. $LKP_SRC/lib/http.sh
	. $LKP_SRC/lib/job.sh
	. $LKP_SRC/lib/env.sh

	export_top_env

	run_monitor $LKP_SRC/monitors/one-shot/wrapper boot-slabinfo
	run_monitor $LKP_SRC/monitors/one-shot/wrapper boot-meminfo
	run_monitor $LKP_SRC/monitors/one-shot/wrapper memmap
	run_monitor $LKP_SRC/monitors/no-stdout/wrapper boot-time
	run_monitor $LKP_SRC/monitors/wrapper kmsg
	run_monitor $LKP_SRC/monitors/wrapper heartbeat
	run_monitor $LKP_SRC/monitors/wrapper meminfo
	run_monitor $LKP_SRC/monitors/wrapper oom-killer
	run_monitor $LKP_SRC/monitors/plain/watchdog

	run_test $LKP_SRC/tests/wrapper sleep 1
}

extract_stats()
{
	export stats_part_begin=
	export stats_part_end=

	$LKP_SRC/stats/wrapper boot-slabinfo
	$LKP_SRC/stats/wrapper boot-meminfo
	$LKP_SRC/stats/wrapper memmap
	$LKP_SRC/stats/wrapper boot-memory
	$LKP_SRC/stats/wrapper boot-time
	$LKP_SRC/stats/wrapper kernel-size
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper sleep
	$LKP_SRC/stats/wrapper meminfo

	$LKP_SRC/stats/wrapper time sleep.time
	$LKP_SRC/stats/wrapper dmesg
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper last_state
	$LKP_SRC/stats/wrapper stderr
	$LKP_SRC/stats/wrapper time
}

"$@"

[-- Attachment #4: dmesg.xz --]
[-- Type: application/x-xz, Size: 11632 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2021-11-14 14:43 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-10 10:54 [PATCH v3 00/15] Free user PTE page table pages Qi Zheng
2021-11-10 10:54 ` [PATCH v3 01/15] mm: do code cleanups to filemap_map_pmd() Qi Zheng
2021-11-10 10:54 ` [PATCH v3 02/15] mm: introduce is_huge_pmd() helper Qi Zheng
2021-11-11 13:46   ` kernel test robot
2021-11-10 10:54 ` [PATCH v3 03/15] mm: move pte_offset_map_lock() to pgtable.h Qi Zheng
2021-11-10 10:54 ` [PATCH v3 04/15] mm: rework the parameter of lock_page_or_retry() Qi Zheng
2021-11-10 10:54 ` [PATCH v3 05/15] mm: add pmd_installed_type return for __pte_alloc() and other friends Qi Zheng
2021-11-10 10:54 ` [PATCH v3 06/15] mm: introduce refcount for user PTE page table page Qi Zheng
2021-11-11  0:37   ` kernel test robot
2021-11-10 10:54 ` [PATCH v3 07/15] mm/pte_ref: add support for user PTE page table page allocation Qi Zheng
2021-11-11 15:17   ` kernel test robot
2021-11-10 10:54 ` [PATCH v3 08/15] mm/pte_ref: initialize the refcount of the withdrawn PTE page table page Qi Zheng
2021-11-10 10:54 ` [PATCH v3 09/15] mm/pte_ref: add support for the map/unmap of user " Qi Zheng
2021-11-10 10:54 ` [PATCH v3 10/15] mm/pte_ref: add support for page fault path Qi Zheng
2021-11-10 10:54 ` [PATCH v3 11/15] mm/pte_ref: take a refcount before accessing the PTE page table page Qi Zheng
2021-11-10 10:54 ` [PATCH v3 12/15] mm/pte_ref: update the pmd entry in move_normal_pmd() Qi Zheng
2021-11-10 10:54 ` [PATCH v3 13/15] mm/pte_ref: free user PTE page table pages Qi Zheng
2021-11-14 14:43   ` [mm/pte_ref] afcc9fb874: kernel_BUG_at_include/linux/pte_ref.h kernel test robot
2021-11-10 10:54 ` [PATCH v3 14/15] Documentation: add document for pte_ref Qi Zheng
2021-11-10 14:39   ` Jonathan Corbet
2021-11-11  5:40     ` Qi Zheng
2021-11-10 10:54 ` [PATCH v3 15/15] mm/pte_ref: use mmu_gather to free PTE page table pages Qi Zheng
2021-11-10 12:56 ` [PATCH v3 00/15] Free user " Jason Gunthorpe
2021-11-10 13:25   ` David Hildenbrand
2021-11-10 13:59     ` Qi Zheng
2021-11-10 14:38     ` Jason Gunthorpe
2021-11-10 15:37       ` David Hildenbrand
2021-11-10 16:39         ` Jason Gunthorpe
2021-11-10 17:37           ` David Hildenbrand
2021-11-10 17:49             ` Jason Gunthorpe
2021-11-11  3:58             ` Qi Zheng
2021-11-11  9:22               ` David Hildenbrand
2021-11-11 11:08                 ` Qi Zheng
2021-11-11 11:19                   ` David Hildenbrand
2021-11-11 12:00                     ` Qi Zheng
2021-11-11 12:20                       ` David Hildenbrand
2021-11-11 12:32                         ` Qi Zheng
2021-11-11 12:51                           ` David Hildenbrand
2021-11-11 13:01                             ` Qi Zheng
2021-11-10 16:49         ` Matthew Wilcox
2021-11-10 16:53           ` David Hildenbrand
2021-11-10 16:56             ` Jason Gunthorpe
2021-11-10 13:54   ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).