All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@kernel.org>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Yang Shi <shy828301@gmail.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Peter Xu <peterx@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Will Deacon <will@kernel.org>, Yu Zhao <yuzhao@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Steven Price <steven.price@arm.com>,
	SeongJae Park <sj@kernel.org>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Huang Ying <ying.huang@intel.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Zack Rusin <zackr@vmware.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Pasha Tatashin <pasha.tatashin@soleen.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Minchan Kim <minchan@kernel.org>,
	Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>,
	Thomas Hellstrom <thomas.hellstrom@linux.intel.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: [PATCH v2 04/32] mm/pgtable: allow pte_offset_map[_lock]() to fail
Date: Thu, 8 Jun 2023 18:10:32 -0700 (PDT)	[thread overview]
Message-ID: <2929bfd-9893-a374-e463-4c3127ff9b9d@google.com> (raw)
In-Reply-To: <c1c9a74a-bc5b-15ea-e5d2-8ec34bc921d@google.com>

Make pte_offset_map() a wrapper for __pte_offset_map() (optionally
outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.

__pte_offset_map() do pmdval validation (including pmd_clear_bad()
when pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the
lock, trying again if it changed.

No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done
to cover the imminent case, but we expect to generalize it later, and
it makes a mess of where to do the pmd_bad() clearing.

Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock.  This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.

Update corresponding Documentation.

Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier).  But comment
where they will go, so that it's easy to add them for experiments.  And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 Documentation/mm/split_page_table_lock.rst | 17 ++++---
 include/linux/mm.h                         | 27 +++++++----
 include/linux/pgtable.h                    | 22 ++++++---
 mm/pgtable-generic.c                       | 56 ++++++++++++++++++++++
 4 files changed, 101 insertions(+), 21 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 50ee0dfc95be..a834fad9de12 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -14,15 +14,20 @@ tables. Access to higher level tables protected by mm->page_table_lock.
 There are helpers to lock/unlock a table and other accessor functions:
 
  - pte_offset_map_lock()
-	maps pte and takes PTE table lock, returns pointer to the taken
-	lock;
+	maps PTE and takes PTE table lock, returns pointer to PTE with
+	pointer to its PTE table lock, or returns NULL if no PTE table;
+ - pte_offset_map_nolock()
+	maps PTE, returns pointer to PTE with pointer to its PTE table
+	lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map()
+	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
+ - pte_unmap()
+	unmaps PTE table;
  - pte_unmap_unlock()
 	unlocks and unmaps PTE table;
  - pte_alloc_map_lock()
-	allocates PTE table if needed and take the lock, returns pointer
-	to taken lock or NULL if allocation failed;
- - pte_lockptr()
-	returns pointer to PTE table lock;
+	allocates PTE table if needed and takes its lock, returns pointer to
+	PTE with pointer to its lock, or returns NULL if allocation failed;
  - pmd_lock()
 	takes PMD table lock, returns pointer to taken lock;
  - pmd_lockptr()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..3c2e56980853 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2787,14 +2787,25 @@ static inline void pgtable_pte_page_dtor(struct page *page)
 	dec_lruvec_page_state(page, NR_PAGETABLE);
 }
 
-#define pte_offset_map_lock(mm, pmd, address, ptlp)	\
-({							\
-	spinlock_t *__ptl = pte_lockptr(mm, pmd);	\
-	pte_t *__pte = pte_offset_map(pmd, address);	\
-	*(ptlp) = __ptl;				\
-	spin_lock(__ptl);				\
-	__pte;						\
-})
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
+static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
+{
+	return __pte_offset_map(pmd, addr, NULL);
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, spinlock_t **ptlp);
+static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, spinlock_t **ptlp)
+{
+	pte_t *pte;
+
+	__cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp));
+	return pte;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+			unsigned long addr, spinlock_t **ptlp);
 
 #define pte_unmap_unlock(pte, ptl)	do {		\
 	spin_unlock(ptl);				\
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94235ff2706e..3fabbb018557 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -94,14 +94,22 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
 #define pte_offset_kernel pte_offset_kernel
 #endif
 
-#if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address)				\
-	((pte_t *)kmap_local_page(pmd_page(*(dir))) +		\
-	 pte_index((address)))
-#define pte_unmap(pte) kunmap_local((pte))
+#ifdef CONFIG_HIGHPTE
+#define __pte_map(pmd, address) \
+	((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
+#define pte_unmap(pte)	do {	\
+	kunmap_local((pte));	\
+	/* rcu_read_unlock() to be added later */	\
+} while (0)
 #else
-#define pte_offset_map(dir, address)	pte_offset_kernel((dir), (address))
-#define pte_unmap(pte) ((void)(pte))	/* NOP */
+static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
+{
+	return pte_offset_kernel(pmd, address);
+}
+static inline void pte_unmap(pte_t *pte)
+{
+	/* rcu_read_unlock() to be added later */
+}
 #endif
 
 /* Find an entry in the second-level page table.. */
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d2fc52bffafc..c7ab18a5fb77 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,8 @@
 #include <linux/pagemap.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
 #include <linux/mm_inline.h>
 #include <asm/tlb.h>
 
@@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
+{
+	pmd_t pmdval;
+
+	/* rcu_read_lock() to be added later */
+	pmdval = pmdp_get_lockless(pmd);
+	if (pmdvalp)
+		*pmdvalp = pmdval;
+	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
+		goto nomap;
+	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+		goto nomap;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		goto nomap;
+	}
+	return __pte_map(&pmdval, addr);
+nomap:
+	/* rcu_read_unlock() to be added later */
+	return NULL;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+			     unsigned long addr, spinlock_t **ptlp)
+{
+	pmd_t pmdval;
+	pte_t *pte;
+
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (likely(pte))
+		*ptlp = pte_lockptr(mm, &pmdval);
+	return pte;
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+			     unsigned long addr, spinlock_t **ptlp)
+{
+	spinlock_t *ptl;
+	pmd_t pmdval;
+	pte_t *pte;
+again:
+	pte = __pte_offset_map(pmd, addr, &pmdval);
+	if (unlikely(!pte))
+		return pte;
+	ptl = pte_lockptr(mm, &pmdval);
+	spin_lock(ptl);
+	if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+		*ptlp = ptl;
+		return pte;
+	}
+	pte_unmap_unlock(pte, ptl);
+	goto again;
+}
-- 
2.35.3


  parent reply	other threads:[~2023-06-09  1:10 UTC|newest]

Thread overview: 82+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-09  0:54 [PATCH v2 00/32] mm: allow pte_offset_map[_lock]() to fail Hugh Dickins
2023-06-09  1:06 ` [PATCH v2 01/32] mm: use pmdp_get_lockless() without surplus barrier() Hugh Dickins
2023-06-09  1:08 ` [PATCH v2 02/32] mm/migrate: remove cruft from migration_entry_wait()s Hugh Dickins
2023-06-09  1:09 ` [PATCH v2 03/32] mm/pgtable: kmap_local_page() instead of kmap_atomic() Hugh Dickins
2023-06-09  1:10 ` Hugh Dickins [this message]
2023-07-11  1:23   ` [PATCH v2 04/32] mm/pgtable: allow pte_offset_map[_lock]() to fail Zi Yan
2023-07-28 13:53   ` Yongqin Liu
2023-07-28 14:05     ` Matthew Wilcox
2023-07-28 16:58       ` Hugh Dickins
2023-08-05 16:06         ` Yongqin Liu
2023-08-05 17:07           ` Matthew Wilcox
2023-08-08  0:29             ` John Hubbard
2023-06-09  1:11 ` [PATCH v2 05/32] mm/filemap: allow pte_offset_map_lock() " Hugh Dickins
2023-07-11  1:34   ` Zi Yan
2023-07-11  5:21     ` Hugh Dickins
2023-06-09  1:12 ` [PATCH v2 06/32] mm/page_vma_mapped: delete bogosity in page_vma_mapped_walk() Hugh Dickins
2023-07-11  1:47   ` Zi Yan
2023-06-09  1:14 ` [PATCH v2 07/32] mm/page_vma_mapped: reformat map_pte() with less indentation Hugh Dickins
2023-07-11  1:56   ` Zi Yan
2023-06-09  1:15 ` [PATCH v2 08/32] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr() Hugh Dickins
2023-06-09  1:17 ` [PATCH v2 09/32] mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails Hugh Dickins
2023-06-09  1:18 ` [PATCH v2 10/32] mm/pagewalk: walk_pte_range() allow for pte_offset_map() Hugh Dickins
2023-06-09  1:20 ` [PATCH v2 11/32] mm/vmwgfx: simplify pmd & pud mapping dirty helpers Hugh Dickins
2023-06-09  1:21 ` [PATCH v2 12/32] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel() Hugh Dickins
2023-07-10 14:42   ` Mark Brown
2023-07-10 14:42     ` Mark Brown
2023-07-10 17:18     ` Lorenzo Stoakes
2023-07-10 17:18       ` Lorenzo Stoakes
2023-07-10 17:33       ` Mark Brown
2023-07-10 17:33         ` Mark Brown
2023-07-11  4:34         ` Hugh Dickins
2023-07-11  4:34           ` Hugh Dickins
2023-07-11 15:34           ` Mark Brown
2023-07-11 15:34             ` Mark Brown
2023-07-11 16:13             ` Hugh Dickins
2023-07-11 16:13               ` Hugh Dickins
2023-07-11 16:34               ` Mark Brown
2023-07-11 16:34                 ` Mark Brown
2023-07-11 17:57               ` Mark Brown
2023-07-11 17:57                 ` Mark Brown
2023-07-13 11:19                 ` Linux regression tracking #update (Thorsten Leemhuis)
2023-07-13 11:19                   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-07-20 10:32                 ` Will Deacon
2023-07-20 10:32                   ` Will Deacon
2023-07-20 12:06                   ` Mark Brown
2023-07-20 12:06                     ` Mark Brown
2023-08-08  5:52                     ` Linux regression tracking (Thorsten Leemhuis)
2023-08-08  5:52                       ` Linux regression tracking (Thorsten Leemhuis)
2023-08-08 11:09                       ` Mark Brown
2023-08-08 11:09                         ` Mark Brown
2023-08-11  8:00                         ` Linux regression tracking #update (Thorsten Leemhuis)
2023-08-11  8:00                           ` Linux regression tracking #update (Thorsten Leemhuis)
2023-07-11 14:48     ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-07-11 14:48       ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-06-09  1:23 ` [PATCH v2 13/32] mm/hmm: retry if pte_offset_map() fails Hugh Dickins
2023-06-09  1:24 ` [PATCH v2 14/32] mm/userfaultfd: " Hugh Dickins
2023-06-09  1:26 ` [PATCH v2 15/32] mm/userfaultfd: allow pte_offset_map_lock() to fail Hugh Dickins
2023-06-09  1:27 ` [PATCH v2 16/32] mm/debug_vm_pgtable,page_table_check: warn pte map fails Hugh Dickins
2023-06-09  1:29 ` [PATCH v2 17/32] mm/various: give up if pte_offset_map[_lock]() fails Hugh Dickins
2023-06-09  1:30 ` [PATCH v2 18/32] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge() Hugh Dickins
2023-06-09  1:32 ` [PATCH v2 19/32] mm/mremap: retry if either pte_offset_map_*lock() fails Hugh Dickins
2023-06-09  1:34 ` [PATCH v2 20/32] mm/madvise: clean up pte_offset_map_lock() scans Hugh Dickins
2023-06-09  1:35 ` [PATCH v2 21/32] mm/madvise: clean up force_shm_swapin_readahead() Hugh Dickins
2023-06-09  1:36 ` [PATCH v2 22/32] mm/swapoff: allow pte_offset_map[_lock]() to fail Hugh Dickins
2023-06-09  1:37 ` [PATCH v2 23/32] mm/mglru: allow pte_offset_map_nolock() " Hugh Dickins
2023-06-09  1:38 ` [PATCH v2 24/32] mm/migrate_device: allow pte_offset_map_lock() " Hugh Dickins
2023-06-09  1:39 ` [PATCH v2 25/32] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable() Hugh Dickins
2023-06-09 18:24   ` Yang Shi
2023-06-09  1:41 ` [PATCH v2 26/32] mm/huge_memory: split huge pmd under one pte_offset_map() Hugh Dickins
2023-06-09  1:42 ` [PATCH v2 27/32] mm/khugepaged: allow pte_offset_map[_lock]() to fail Hugh Dickins
2023-06-09  1:43 ` [PATCH v2 28/32] mm/memory: " Hugh Dickins
2023-06-09 20:06   ` Andrew Morton
2023-06-09 20:11     ` Hugh Dickins
2023-06-12  9:10       ` Ryan Roberts
2023-06-15 23:10   ` [PATCH v2 28/32 fix] mm/memory: allow pte_offset_map[_lock]() to fail: fix Hugh Dickins
2023-06-09  1:45 ` [PATCH v2 29/32] mm/memory: handle_pte_fault() use pte_offset_map_nolock() Hugh Dickins
2023-06-09  1:50 ` [PATCH v2 30/32] mm/pgtable: delete pmd_trans_unstable() and friends Hugh Dickins
2023-06-09  1:52 ` [PATCH v2 31/32] mm/swap: swap_vma_readahead() do the pte_offset_map() Hugh Dickins
2023-06-12  8:03   ` Huang, Ying
2023-06-14  3:58     ` Hugh Dickins
2023-06-09  1:53 ` [PATCH v2 32/32] perf/core: Allow pte_offset_map() to fail Hugh Dickins
2023-06-20  6:50 ` [PATCH] mm/swapfile: delete outdated pte_offset_map() comment Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2929bfd-9893-a374-e463-4c3127ff9b9d@google.com \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=david@redhat.com \
    --cc=hch@infradead.org \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=mgorman@techsingularity.net \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcampbell@nvidia.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=sj@kernel.org \
    --cc=song@kernel.org \
    --cc=steven.price@arm.com \
    --cc=surenb@google.com \
    --cc=thomas.hellstrom@linux.intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    --cc=zackr@vmware.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.