All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
@ 2017-09-27 15:49 Will Deacon
  2017-09-27 15:49 ` [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables Will Deacon
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Will Deacon @ 2017-09-27 15:49 UTC (permalink / raw)
  To: peterz, paulmck, kirill.shutemov
  Cc: linux-kernel, ynorov, rruigrok, linux-arch, akpm,
	catalin.marinas, Will Deacon

Hi,

We recently had a crash report[1] on arm64 that involved a bad dereference
in the page_vma_mapped code during ext4 writeback with THP active. I can
reproduce this on -rc2:

[  254.032812] PC is at check_pte+0x20/0x170
[  254.032948] LR is at page_vma_mapped_walk+0x2e0/0x540
[...]
[  254.036114] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
[  254.036361] Call trace:
[  254.038977] [<ffff000008233328>] check_pte+0x20/0x170
[  254.039137] [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540
[  254.039332] [<ffff000008234adc>] page_mkclean_one+0xac/0x278
[  254.039489] [<ffff000008234d98>] rmap_walk_file+0xf0/0x238
[  254.039642] [<ffff000008236e74>] rmap_walk+0x64/0xa0
[  254.039784] [<ffff0000082370c8>] page_mkclean+0x90/0xa8
[  254.040029] [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8
[  254.040311] [<ffff00000832f984>] mpage_submit_page+0x34/0x98
[  254.040518] [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170
[  254.040743] [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8
[  254.040969] [<ffff00000833530c>] ext4_writepages+0x484/0xe30
[  254.041175] [<ffff0000081f6ab4>] do_writepages+0x44/0xe8
[  254.041372] [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110
[  254.041568] [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8
[  254.041739] [<ffff000008324310>] ext4_sync_file+0x80/0x4b8
[  254.041907] [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0
[  254.042106] [<ffff0000082332b4>] SyS_msync+0x194/0x1e8

After digging into the issue, I found that we appear to be racing with
a concurrent pmd update in page_vma_mapped_walk, assumedly due a THP
splitting operation. Looking at the code there:

	pvmw->pmd = pmd_offset(pud, pvmw->address);
	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
		[...]
	} else {
		if (!check_pmd(pvmw))
			return false;
	}
	if (!map_pte(pvmw))
		goto next_pte;

what happens in the crashing scenario is that we see all zeroes for the
PMD in pmd_trans_huge(*pvmw->pmd), and so go to the 'else' case (migration
isn't enabled, so the test is removed at compile-time). check_pmd then does:

	pmde = READ_ONCE(*pvmw->pmd);
	return pmd_present(pmde) && !pmd_trans_huge(pmde);

and reads a valid table entry for the PMD because the splitting has completed
(i.e. the first dereference reads from the pmdp_invalidate in the splitting
code, whereas the second dereferenced reads from the following pmd_populate).
It returns true because we should descend to the PTE level in map_pte. map_pte
does:

	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);

which on arm64 (and this appears to be the same on x86) ends up doing:

	(pmd_page_paddr((*(pvmw->pmd))) + pte_index(pvmw->address) * sizeof(pte_t))

as part of its calculation. However, this is horribly broken because GCC
inlines everything and reuses the register it loaded for the initial
pmd_trans_huge check (when we loaded the value of zero) here, so we end up
calculating a junk pointer and crashing when we dereference it. Disassembly
at the end of the mail[2] for those who are curious.

The moral of the story is that read-after-read (same address) ordering *only*
applies if READ_ONCE is used consistently. This means we need to fix page
table dereferences in the core code as well as the arch code to avoid this
problem. The two RFC patches in this series fix arm64 (which is a bigger fix
that necessary since I clean things up too) and page_vma_mapped_walk.

Comments welcome.

Will

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/532786.html
[2]

// page_vma_mapped_walk
// pvmw->pmd = pmd_offset(pud, pvmw->address);
ldr     x0, [x19, #24]		// pvmw->pmd

// if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
ldr     x1, [x0]		// *pvmw->pmd
cbz     x1, ffff0000082336a0 <page_vma_mapped_walk+0x228>
tbz     w1, #1, ffff000008233788 <page_vma_mapped_walk+0x310>	// pmd_trans_huge?

// else if (!check_pmd(pvmw))
ldr     x0, [x0]		// READ_ONCE in check_pmd
tst     x0, x24			// pmd_present?
b.eq    ffff000008233538 <page_vma_mapped_walk+0xc0>  // b.none
tbz     w0, #1, ffff000008233538 <page_vma_mapped_walk+0xc0>	// pmd_trans_huge?

// if (!map_pte(pvmw))
ldr     x0, [x19, #16]		// pvmw->address

// pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
and     x1, x1, #0xfffffffff000	// Reusing the old value of *pvmw->pmd!!!
[...]

--->8

Will Deacon (2):
  arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of
    lock

 arch/arm64/include/asm/hugetlb.h     |   2 +-
 arch/arm64/include/asm/kvm_mmu.h     |  18 +--
 arch/arm64/include/asm/mmu_context.h |   4 +-
 arch/arm64/include/asm/pgalloc.h     |  42 +++---
 arch/arm64/include/asm/pgtable.h     |  29 ++--
 arch/arm64/kernel/hibernate.c        | 148 +++++++++---------
 arch/arm64/mm/dump.c                 |  54 ++++---
 arch/arm64/mm/fault.c                |  44 +++---
 arch/arm64/mm/hugetlbpage.c          |  94 ++++++------
 arch/arm64/mm/kasan_init.c           |  62 ++++----
 arch/arm64/mm/mmu.c                  | 281 ++++++++++++++++++-----------------
 arch/arm64/mm/pageattr.c             |  30 ++--
 mm/page_vma_mapped.c                 |  25 ++--
 13 files changed, 427 insertions(+), 406 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-27 15:49 [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Will Deacon
@ 2017-09-27 15:49 ` Will Deacon
  2017-09-28  8:38   ` Peter Zijlstra
  2017-09-28 19:18   ` Timur Tabi
  2017-09-27 15:49 ` [RFC PATCH 2/2] mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of lock Will Deacon
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 27+ messages in thread
From: Will Deacon @ 2017-09-27 15:49 UTC (permalink / raw)
  To: peterz, paulmck, kirill.shutemov
  Cc: linux-kernel, ynorov, rruigrok, linux-arch, akpm,
	catalin.marinas, Will Deacon

In many cases, page tables can be accessed concurrently by either another
CPU (due to things like fast gup) or by the hardware page table walker
itself, which may set access/dirty bits. In such cases, it is important
to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
entries cannot be torn, merged or subject to apparent loss of coherence.

Whilst there are some scenarios where this cannot happen (e.g. pinned
kernel mappings for the linear region), the overhead of using READ_ONCE
/WRITE_ONCE everywhere is minimal and makes the code an awful lot easier
to reason about. This patch consistently uses these macros in the arch
code, as well as explicitly namespacing pointers to page table entries
from the entries themselves by using adopting a 'p' suffix for the former
(as is sometimes used elsewhere in the kernel source).

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm64/include/asm/hugetlb.h     |   2 +-
 arch/arm64/include/asm/kvm_mmu.h     |  18 +--
 arch/arm64/include/asm/mmu_context.h |   4 +-
 arch/arm64/include/asm/pgalloc.h     |  42 +++---
 arch/arm64/include/asm/pgtable.h     |  29 ++--
 arch/arm64/kernel/hibernate.c        | 148 +++++++++---------
 arch/arm64/mm/dump.c                 |  54 ++++---
 arch/arm64/mm/fault.c                |  44 +++---
 arch/arm64/mm/hugetlbpage.c          |  94 ++++++------
 arch/arm64/mm/kasan_init.c           |  62 ++++----
 arch/arm64/mm/mmu.c                  | 281 ++++++++++++++++++-----------------
 arch/arm64/mm/pageattr.c             |  30 ++--
 12 files changed, 417 insertions(+), 391 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 1dca41bea16a..e73f68569624 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -22,7 +22,7 @@
 
 static inline pte_t huge_ptep_get(pte_t *ptep)
 {
-	return *ptep;
+	return READ_ONCE(*ptep);
 }
 
 
diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
index 672c8684d5c2..670d20fc80d9 100644
--- a/arch/arm64/include/asm/kvm_mmu.h
+++ b/arch/arm64/include/asm/kvm_mmu.h
@@ -173,32 +173,32 @@ static inline pmd_t kvm_s2pmd_mkwrite(pmd_t pmd)
 	return pmd;
 }
 
-static inline void kvm_set_s2pte_readonly(pte_t *pte)
+static inline void kvm_set_s2pte_readonly(pte_t *ptep)
 {
 	pteval_t old_pteval, pteval;
 
-	pteval = READ_ONCE(pte_val(*pte));
+	pteval = READ_ONCE(pte_val(*ptep));
 	do {
 		old_pteval = pteval;
 		pteval &= ~PTE_S2_RDWR;
 		pteval |= PTE_S2_RDONLY;
-		pteval = cmpxchg_relaxed(&pte_val(*pte), old_pteval, pteval);
+		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
 	} while (pteval != old_pteval);
 }
 
-static inline bool kvm_s2pte_readonly(pte_t *pte)
+static inline bool kvm_s2pte_readonly(pte_t *ptep)
 {
-	return (pte_val(*pte) & PTE_S2_RDWR) == PTE_S2_RDONLY;
+	return (READ_ONCE(pte_val(*ptep)) & PTE_S2_RDWR) == PTE_S2_RDONLY;
 }
 
-static inline void kvm_set_s2pmd_readonly(pmd_t *pmd)
+static inline void kvm_set_s2pmd_readonly(pmd_t *pmdp)
 {
-	kvm_set_s2pte_readonly((pte_t *)pmd);
+	kvm_set_s2pte_readonly((pte_t *)pmdp);
 }
 
-static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
+static inline bool kvm_s2pmd_readonly(pmd_t *pmdp)
 {
-	return kvm_s2pte_readonly((pte_t *)pmd);
+	return kvm_s2pte_readonly((pte_t *)pmdp);
 }
 
 static inline bool kvm_page_empty(void *ptr)
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 3257895a9b5e..b90c6e9582b4 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -127,13 +127,13 @@ static inline void cpu_install_idmap(void)
  * Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
  * avoiding the possibility of conflicting TLB entries being allocated.
  */
-static inline void cpu_replace_ttbr1(pgd_t *pgd)
+static inline void cpu_replace_ttbr1(pgd_t *pgdp)
 {
 	typedef void (ttbr_replace_func)(phys_addr_t);
 	extern ttbr_replace_func idmap_cpu_replace_ttbr1;
 	ttbr_replace_func *replace_phys;
 
-	phys_addr_t pgd_phys = virt_to_phys(pgd);
+	phys_addr_t pgd_phys = virt_to_phys(pgdp);
 
 	replace_phys = (void *)__pa_symbol(idmap_cpu_replace_ttbr1);
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index d25f4f137c2a..91cad40b1ddf 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -36,23 +36,23 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 	return (pmd_t *)__get_free_page(PGALLOC_GFP);
 }
 
-static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
+static inline void pmd_free(struct mm_struct *mm, pmd_t *pmdp)
 {
-	BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-	free_page((unsigned long)pmd);
+	BUG_ON((unsigned long)pmdp & (PAGE_SIZE-1));
+	free_page((unsigned long)pmdp);
 }
 
-static inline void __pud_populate(pud_t *pud, phys_addr_t pmd, pudval_t prot)
+static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
 {
-	set_pud(pud, __pud(pmd | prot));
+	set_pud(pudp, __pud(pmdp | prot));
 }
 
-static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp)
 {
-	__pud_populate(pud, __pa(pmd), PMD_TYPE_TABLE);
+	__pud_populate(pudp, __pa(pmdp), PMD_TYPE_TABLE);
 }
 #else
-static inline void __pud_populate(pud_t *pud, phys_addr_t pmd, pudval_t prot)
+static inline void __pud_populate(pud_t *pudp, phys_addr_t pmd, pudval_t prot)
 {
 	BUILD_BUG();
 }
@@ -65,20 +65,20 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 	return (pud_t *)__get_free_page(PGALLOC_GFP);
 }
 
-static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+static inline void pud_free(struct mm_struct *mm, pud_t *pudp)
 {
-	BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
-	free_page((unsigned long)pud);
+	BUG_ON((unsigned long)pudp & (PAGE_SIZE-1));
+	free_page((unsigned long)pudp);
 }
 
-static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pud, pgdval_t prot)
+static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pudp, pgdval_t prot)
 {
-	set_pgd(pgdp, __pgd(pud | prot));
+	set_pgd(pgdp, __pgd(pudp | prot));
 }
 
-static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
+static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgdp, pud_t *pudp)
 {
-	__pgd_populate(pgd, __pa(pud), PUD_TYPE_TABLE);
+	__pgd_populate(pgdp, __pa(pudp), PUD_TYPE_TABLE);
 }
 #else
 static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pud, pgdval_t prot)
@@ -88,7 +88,7 @@ static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t pud, pgdval_t prot)
 #endif	/* CONFIG_PGTABLE_LEVELS > 3 */
 
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
-extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
+extern void pgd_free(struct mm_struct *mm, pgd_t *pgdp);
 
 static inline pte_t *
 pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr)
@@ -114,10 +114,10 @@ pte_alloc_one(struct mm_struct *mm, unsigned long addr)
 /*
  * Free a PTE table.
  */
-static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *ptep)
 {
-	if (pte)
-		free_page((unsigned long)pte);
+	if (ptep)
+		free_page((unsigned long)ptep);
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pte)
@@ -126,10 +126,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t pte)
 	__free_page(pte);
 }
 
-static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t pte,
+static inline void __pmd_populate(pmd_t *pmdp, phys_addr_t ptep,
 				  pmdval_t prot)
 {
-	set_pmd(pmdp, __pmd(pte | prot));
+	set_pmd(pmdp, __pmd(ptep | prot));
 }
 
 /*
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bc4e92337d16..c36571bbfe92 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -181,7 +181,7 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
 
 static inline void set_pte(pte_t *ptep, pte_t pte)
 {
-	*ptep = pte;
+	WRITE_ONCE(*ptep, pte);
 
 	/*
 	 * Only if the new pte is valid and kernel, otherwise TLB maintenance
@@ -216,6 +216,8 @@ extern void __sync_icache_dcache(pte_t pteval, unsigned long addr);
 static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep, pte_t pte)
 {
+	pte_t old_pte;
+
 	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
 		__sync_icache_dcache(pte, addr);
 
@@ -224,13 +226,14 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 	 * hardware updates of the pte (ptep_set_access_flags safely changes
 	 * valid ptes without going through an invalid entry).
 	 */
-	if (pte_valid(*ptep) && pte_valid(pte)) {
+	old_pte = READ_ONCE(*ptep);
+	if (pte_valid(old_pte) && pte_valid(pte)) {
 		VM_WARN_ONCE(!pte_young(pte),
 			     "%s: racy access flag clearing: 0x%016llx -> 0x%016llx",
-			     __func__, pte_val(*ptep), pte_val(pte));
-		VM_WARN_ONCE(pte_write(*ptep) && !pte_dirty(pte),
+			     __func__, pte_val(old_pte), pte_val(pte));
+		VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte),
 			     "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx",
-			     __func__, pte_val(*ptep), pte_val(pte));
+			     __func__, pte_val(old_pte), pte_val(pte));
 	}
 
 	set_pte(ptep, pte);
@@ -383,7 +386,7 @@ extern pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
 
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
-	*pmdp = pmd;
+	WRITE_ONCE(*pmdp, pmd);
 	dsb(ishst);
 	isb();
 }
@@ -401,7 +404,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
 /* Find an entry in the third-level page table. */
 #define pte_index(addr)		(((addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1))
 
-#define pte_offset_phys(dir,addr)	(pmd_page_paddr(*(dir)) + pte_index(addr) * sizeof(pte_t))
+#define pte_offset_phys(dir,addr)	(pmd_page_paddr(READ_ONCE(*(dir))) + pte_index(addr) * sizeof(pte_t))
 #define pte_offset_kernel(dir,addr)	((pte_t *)__va(pte_offset_phys((dir), (addr))))
 
 #define pte_offset_map(dir,addr)	pte_offset_kernel((dir), (addr))
@@ -434,7 +437,7 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd)
 
 static inline void set_pud(pud_t *pudp, pud_t pud)
 {
-	*pudp = pud;
+	WRITE_ONCE(*pudp, pud);
 	dsb(ishst);
 	isb();
 }
@@ -452,7 +455,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
 /* Find an entry in the second-level page table. */
 #define pmd_index(addr)		(((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
 
-#define pmd_offset_phys(dir, addr)	(pud_page_paddr(*(dir)) + pmd_index(addr) * sizeof(pmd_t))
+#define pmd_offset_phys(dir, addr)	(pud_page_paddr(READ_ONCE(*(dir))) + pmd_index(addr) * sizeof(pmd_t))
 #define pmd_offset(dir, addr)		((pmd_t *)__va(pmd_offset_phys((dir), (addr))))
 
 #define pmd_set_fixmap(addr)		((pmd_t *)set_fixmap_offset(FIX_PMD, addr))
@@ -487,7 +490,7 @@ static inline phys_addr_t pud_page_paddr(pud_t pud)
 
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
-	*pgdp = pgd;
+	WRITE_ONCE(*pgdp, pgd);
 	dsb(ishst);
 }
 
@@ -504,7 +507,7 @@ static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
 /* Find an entry in the frst-level page table. */
 #define pud_index(addr)		(((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
 
-#define pud_offset_phys(dir, addr)	(pgd_page_paddr(*(dir)) + pud_index(addr) * sizeof(pud_t))
+#define pud_offset_phys(dir, addr)	(pgd_page_paddr(READ_ONCE(*(dir))) + pud_index(addr) * sizeof(pud_t))
 #define pud_offset(dir, addr)		((pud_t *)__va(pud_offset_phys((dir), (addr))))
 
 #define pud_set_fixmap(addr)		((pud_t *)set_fixmap_offset(FIX_PUD, addr))
@@ -645,9 +648,9 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 	 * dirty state: clearing of PTE_RDONLY when PTE_WRITE (a.k.a. PTE_DBM)
 	 * is set.
 	 */
-	VM_WARN_ONCE(pte_write(*ptep) && !pte_dirty(*ptep),
-		     "%s: potential race with hardware DBM", __func__);
 	pte = READ_ONCE(*ptep);
+	VM_WARN_ONCE(pte_write(pte) && !pte_dirty(pte),
+		     "%s: potential race with hardware DBM", __func__);
 	do {
 		old_pte = pte;
 		pte = pte_wrprotect(pte);
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 095d3c170f5d..04c56d1e0b70 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -201,10 +201,10 @@ static int create_safe_exec_page(void *src_start, size_t length,
 				 gfp_t mask)
 {
 	int rc = 0;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep;
 	unsigned long dst = (unsigned long)allocator(mask);
 
 	if (!dst) {
@@ -215,38 +215,38 @@ static int create_safe_exec_page(void *src_start, size_t length,
 	memcpy((void *)dst, src_start, length);
 	flush_icache_range(dst, dst + length);
 
-	pgd = pgd_offset_raw(allocator(mask), dst_addr);
-	if (pgd_none(*pgd)) {
-		pud = allocator(mask);
-		if (!pud) {
+	pgdp = pgd_offset_raw(allocator(mask), dst_addr);
+	if (pgd_none(READ_ONCE(*pgdp))) {
+		pudp = allocator(mask);
+		if (!pudp) {
 			rc = -ENOMEM;
 			goto out;
 		}
-		pgd_populate(&init_mm, pgd, pud);
+		pgd_populate(&init_mm, pgdp, pudp);
 	}
 
-	pud = pud_offset(pgd, dst_addr);
-	if (pud_none(*pud)) {
-		pmd = allocator(mask);
-		if (!pmd) {
+	pudp = pud_offset(pgdp, dst_addr);
+	if (pud_none(READ_ONCE(*pudp))) {
+		pmdp = allocator(mask);
+		if (!pmdp) {
 			rc = -ENOMEM;
 			goto out;
 		}
-		pud_populate(&init_mm, pud, pmd);
+		pud_populate(&init_mm, pudp, pmdp);
 	}
 
-	pmd = pmd_offset(pud, dst_addr);
-	if (pmd_none(*pmd)) {
-		pte = allocator(mask);
-		if (!pte) {
+	pmdp = pmd_offset(pudp, dst_addr);
+	if (pmd_none(READ_ONCE(*pmdp))) {
+		ptep = allocator(mask);
+		if (!ptep) {
 			rc = -ENOMEM;
 			goto out;
 		}
-		pmd_populate_kernel(&init_mm, pmd, pte);
+		pmd_populate_kernel(&init_mm, pmdp, ptep);
 	}
 
-	pte = pte_offset_kernel(pmd, dst_addr);
-	set_pte(pte, __pte(virt_to_phys((void *)dst) |
+	ptep = pte_offset_kernel(pmdp, dst_addr);
+	set_pte(ptep, __pte(virt_to_phys((void *)dst) |
 			 pgprot_val(PAGE_KERNEL_EXEC)));
 
 	/*
@@ -263,7 +263,7 @@ static int create_safe_exec_page(void *src_start, size_t length,
 	 */
 	cpu_set_reserved_ttbr0();
 	local_flush_tlb_all();
-	write_sysreg(virt_to_phys(pgd), ttbr0_el1);
+	write_sysreg(virt_to_phys(pgdp), ttbr0_el1);
 	isb();
 
 	*phys_dst_addr = virt_to_phys((void *)dst);
@@ -320,9 +320,9 @@ int swsusp_arch_suspend(void)
 	return ret;
 }
 
-static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
+static void _copy_pte(pte_t *dst_ptep, pte_t *src_ptep, unsigned long addr)
 {
-	pte_t pte = *src_pte;
+	pte_t pte = READ_ONCE(*src_ptep);
 
 	if (pte_valid(pte)) {
 		/*
@@ -330,7 +330,7 @@ static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
 		 * read only (code, rodata). Clear the RDONLY bit from
 		 * the temporary mappings we use during restore.
 		 */
-		set_pte(dst_pte, pte_mkwrite(pte));
+		set_pte(dst_ptep, pte_mkwrite(pte));
 	} else if (debug_pagealloc_enabled() && !pte_none(pte)) {
 		/*
 		 * debug_pagealloc will removed the PTE_VALID bit if
@@ -343,112 +343,116 @@ static void _copy_pte(pte_t *dst_pte, pte_t *src_pte, unsigned long addr)
 		 */
 		BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-		set_pte(dst_pte, pte_mkpresent(pte_mkwrite(pte)));
+		set_pte(dst_ptep, pte_mkpresent(pte_mkwrite(pte)));
 	}
 }
 
-static int copy_pte(pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long start,
+static int copy_pte(pmd_t *dst_pmdp, pmd_t *src_pmdp, unsigned long start,
 		    unsigned long end)
 {
-	pte_t *src_pte;
-	pte_t *dst_pte;
+	pte_t *src_ptep;
+	pte_t *dst_ptep;
 	unsigned long addr = start;
 
-	dst_pte = (pte_t *)get_safe_page(GFP_ATOMIC);
-	if (!dst_pte)
+	dst_ptep = (pte_t *)get_safe_page(GFP_ATOMIC);
+	if (!dst_ptep)
 		return -ENOMEM;
-	pmd_populate_kernel(&init_mm, dst_pmd, dst_pte);
-	dst_pte = pte_offset_kernel(dst_pmd, start);
+	pmd_populate_kernel(&init_mm, dst_pmdp, dst_ptep);
+	dst_ptep = pte_offset_kernel(dst_pmdp, start);
 
-	src_pte = pte_offset_kernel(src_pmd, start);
+	src_ptep = pte_offset_kernel(src_pmdp, start);
 	do {
-		_copy_pte(dst_pte, src_pte, addr);
-	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+		_copy_pte(dst_ptep, src_ptep, addr);
+	} while (dst_ptep++, src_ptep++, addr += PAGE_SIZE, addr != end);
 
 	return 0;
 }
 
-static int copy_pmd(pud_t *dst_pud, pud_t *src_pud, unsigned long start,
+static int copy_pmd(pud_t *dst_pudp, pud_t *src_pudp, unsigned long start,
 		    unsigned long end)
 {
-	pmd_t *src_pmd;
-	pmd_t *dst_pmd;
+	pmd_t *src_pmdp;
+	pmd_t *dst_pmdp;
 	unsigned long next;
 	unsigned long addr = start;
 
-	if (pud_none(*dst_pud)) {
-		dst_pmd = (pmd_t *)get_safe_page(GFP_ATOMIC);
-		if (!dst_pmd)
+	if (pud_none(READ_ONCE(*dst_pudp))) {
+		dst_pmdp = (pmd_t *)get_safe_page(GFP_ATOMIC);
+		if (!dst_pmdp)
 			return -ENOMEM;
-		pud_populate(&init_mm, dst_pud, dst_pmd);
+		pud_populate(&init_mm, dst_pudp, dst_pmdp);
 	}
-	dst_pmd = pmd_offset(dst_pud, start);
+	dst_pmdp = pmd_offset(dst_pudp, start);
 
-	src_pmd = pmd_offset(src_pud, start);
+	src_pmdp = pmd_offset(src_pudp, start);
 	do {
+		pmd_t pmd = READ_ONCE(*src_pmdp);
+
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*src_pmd))
+		if (pmd_none(pmd))
 			continue;
-		if (pmd_table(*src_pmd)) {
-			if (copy_pte(dst_pmd, src_pmd, addr, next))
+		if (pmd_table(pmd)) {
+			if (copy_pte(dst_pmdp, src_pmdp, addr, next))
 				return -ENOMEM;
 		} else {
-			set_pmd(dst_pmd,
-				__pmd(pmd_val(*src_pmd) & ~PMD_SECT_RDONLY));
+			set_pmd(dst_pmdp,
+				__pmd(pmd_val(pmd) & ~PMD_SECT_RDONLY));
 		}
-	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
+	} while (dst_pmdp++, src_pmdp++, addr = next, addr != end);
 
 	return 0;
 }
 
-static int copy_pud(pgd_t *dst_pgd, pgd_t *src_pgd, unsigned long start,
+static int copy_pud(pgd_t *dst_pgdp, pgd_t *src_pgdp, unsigned long start,
 		    unsigned long end)
 {
-	pud_t *dst_pud;
-	pud_t *src_pud;
+	pud_t *dst_pudp;
+	pud_t *src_pudp;
 	unsigned long next;
 	unsigned long addr = start;
 
-	if (pgd_none(*dst_pgd)) {
-		dst_pud = (pud_t *)get_safe_page(GFP_ATOMIC);
-		if (!dst_pud)
+	if (pgd_none(READ_ONCE(*dst_pgdp))) {
+		dst_pudp = (pud_t *)get_safe_page(GFP_ATOMIC);
+		if (!dst_pudp)
 			return -ENOMEM;
-		pgd_populate(&init_mm, dst_pgd, dst_pud);
+		pgd_populate(&init_mm, dst_pgdp, dst_pudp);
 	}
-	dst_pud = pud_offset(dst_pgd, start);
+	dst_pudp = pud_offset(dst_pgdp, start);
 
-	src_pud = pud_offset(src_pgd, start);
+	src_pudp = pud_offset(src_pgdp, start);
 	do {
+		pud_t pud = READ_ONCE(*src_pudp);
+
 		next = pud_addr_end(addr, end);
-		if (pud_none(*src_pud))
+		if (pud_none(pud))
 			continue;
-		if (pud_table(*(src_pud))) {
-			if (copy_pmd(dst_pud, src_pud, addr, next))
+		if (pud_table(pud)) {
+			if (copy_pmd(dst_pudp, src_pudp, addr, next))
 				return -ENOMEM;
 		} else {
-			set_pud(dst_pud,
-				__pud(pud_val(*src_pud) & ~PMD_SECT_RDONLY));
+			set_pud(dst_pudp,
+				__pud(pud_val(pud) & ~PMD_SECT_RDONLY));
 		}
-	} while (dst_pud++, src_pud++, addr = next, addr != end);
+	} while (dst_pudp++, src_pudp++, addr = next, addr != end);
 
 	return 0;
 }
 
-static int copy_page_tables(pgd_t *dst_pgd, unsigned long start,
+static int copy_page_tables(pgd_t *dst_pgdp, unsigned long start,
 			    unsigned long end)
 {
 	unsigned long next;
 	unsigned long addr = start;
-	pgd_t *src_pgd = pgd_offset_k(start);
+	pgd_t *src_pgdp = pgd_offset_k(start);
 
-	dst_pgd = pgd_offset_raw(dst_pgd, start);
+	dst_pgdp = pgd_offset_raw(dst_pgdp, start);
 	do {
 		next = pgd_addr_end(addr, end);
-		if (pgd_none(*src_pgd))
+		if (pgd_none(READ_ONCE(*src_pgdp)))
 			continue;
-		if (copy_pud(dst_pgd, src_pgd, addr, next))
+		if (copy_pud(dst_pgdp, src_pgdp, addr, next))
 			return -ENOMEM;
-	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
+	} while (dst_pgdp++, src_pgdp++, addr = next, addr != end);
 
 	return 0;
 }
diff --git a/arch/arm64/mm/dump.c b/arch/arm64/mm/dump.c
index ca74a2aace42..691a8b2f51fd 100644
--- a/arch/arm64/mm/dump.c
+++ b/arch/arm64/mm/dump.c
@@ -286,48 +286,52 @@ static void note_page(struct pg_state *st, unsigned long addr, unsigned level,
 
 }
 
-static void walk_pte(struct pg_state *st, pmd_t *pmd, unsigned long start)
+static void walk_pte(struct pg_state *st, pmd_t *pmdp, unsigned long start)
 {
-	pte_t *pte = pte_offset_kernel(pmd, 0UL);
+	pte_t *ptep = pte_offset_kernel(pmdp, 0UL);
 	unsigned long addr;
 	unsigned i;
 
-	for (i = 0; i < PTRS_PER_PTE; i++, pte++) {
+	for (i = 0; i < PTRS_PER_PTE; i++, ptep++) {
 		addr = start + i * PAGE_SIZE;
-		note_page(st, addr, 4, pte_val(*pte));
+		note_page(st, addr, 4, READ_ONCE(pte_val(*ptep)));
 	}
 }
 
-static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start)
+static void walk_pmd(struct pg_state *st, pud_t *pudp, unsigned long start)
 {
-	pmd_t *pmd = pmd_offset(pud, 0UL);
+	pmd_t *pmdp = pmd_offset(pudp, 0UL);
 	unsigned long addr;
 	unsigned i;
 
-	for (i = 0; i < PTRS_PER_PMD; i++, pmd++) {
+	for (i = 0; i < PTRS_PER_PMD; i++, pmdp++) {
+		pmd_t pmd = READ_ONCE(*pmdp);
+
 		addr = start + i * PMD_SIZE;
-		if (pmd_none(*pmd) || pmd_sect(*pmd)) {
-			note_page(st, addr, 3, pmd_val(*pmd));
+		if (pmd_none(pmd) || pmd_sect(pmd)) {
+			note_page(st, addr, 3, pmd_val(pmd));
 		} else {
-			BUG_ON(pmd_bad(*pmd));
-			walk_pte(st, pmd, addr);
+			BUG_ON(pmd_bad(pmd));
+			walk_pte(st, pmdp, addr);
 		}
 	}
 }
 
-static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start)
+static void walk_pud(struct pg_state *st, pgd_t *pgdp, unsigned long start)
 {
-	pud_t *pud = pud_offset(pgd, 0UL);
+	pud_t *pudp = pud_offset(pgdp, 0UL);
 	unsigned long addr;
 	unsigned i;
 
-	for (i = 0; i < PTRS_PER_PUD; i++, pud++) {
+	for (i = 0; i < PTRS_PER_PUD; i++, pudp++) {
+		pud_t pud = READ_ONCE(*pudp);
+
 		addr = start + i * PUD_SIZE;
-		if (pud_none(*pud) || pud_sect(*pud)) {
-			note_page(st, addr, 2, pud_val(*pud));
+		if (pud_none(pud) || pud_sect(pud)) {
+			note_page(st, addr, 2, pud_val(pud));
 		} else {
-			BUG_ON(pud_bad(*pud));
-			walk_pmd(st, pud, addr);
+			BUG_ON(pud_bad(pud));
+			walk_pmd(st, pudp, addr);
 		}
 	}
 }
@@ -335,17 +339,19 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start)
 static void walk_pgd(struct pg_state *st, struct mm_struct *mm,
 		     unsigned long start)
 {
-	pgd_t *pgd = pgd_offset(mm, 0UL);
+	pgd_t *pgdp = pgd_offset(mm, 0UL);
 	unsigned i;
 	unsigned long addr;
 
-	for (i = 0; i < PTRS_PER_PGD; i++, pgd++) {
+	for (i = 0; i < PTRS_PER_PGD; i++, pgdp++) {
+		pgd_t pgd = READ_ONCE(*pgdp);
+
 		addr = start + i * PGDIR_SIZE;
-		if (pgd_none(*pgd)) {
-			note_page(st, addr, 1, pgd_val(*pgd));
+		if (pgd_none(pgd)) {
+			note_page(st, addr, 1, pgd_val(pgd));
 		} else {
-			BUG_ON(pgd_bad(*pgd));
-			walk_pud(st, pgd, addr);
+			BUG_ON(pgd_bad(pgd));
+			walk_pud(st, pgdp, addr);
 		}
 	}
 }
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 89993c4be1be..85e5be801b2f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -132,7 +132,8 @@ static void mem_abort_decode(unsigned int esr)
 void show_pte(unsigned long addr)
 {
 	struct mm_struct *mm;
-	pgd_t *pgd;
+	pgd_t *pgdp;
+	pgd_t pgd;
 
 	if (addr < TASK_SIZE) {
 		/* TTBR0 */
@@ -151,33 +152,37 @@ void show_pte(unsigned long addr)
 		return;
 	}
 
-	pr_alert("%s pgtable: %luk pages, %u-bit VAs, pgd = %p\n",
+	pr_alert("%s pgtable: %luk pages, %u-bit VAs, pgdp = %p\n",
 		 mm == &init_mm ? "swapper" : "user", PAGE_SIZE / SZ_1K,
 		 VA_BITS, mm->pgd);
-	pgd = pgd_offset(mm, addr);
-	pr_alert("[%016lx] *pgd=%016llx", addr, pgd_val(*pgd));
+	pgdp = pgd_offset(mm, addr);
+	pgd = READ_ONCE(*pgdp);
+	pr_alert("[%016lx] pgd=%016llx", addr, pgd_val(pgd));
 
 	do {
-		pud_t *pud;
-		pmd_t *pmd;
-		pte_t *pte;
+		pud_t *pudp, pud;
+		pmd_t *pmdp, pmd;
+		pte_t *ptep, pte;
 
-		if (pgd_none(*pgd) || pgd_bad(*pgd))
+		if (pgd_none(pgd) || pgd_bad(pgd))
 			break;
 
-		pud = pud_offset(pgd, addr);
-		pr_cont(", *pud=%016llx", pud_val(*pud));
-		if (pud_none(*pud) || pud_bad(*pud))
+		pudp = pud_offset(pgdp, addr);
+		pud = READ_ONCE(*pudp);
+		pr_cont(", pud=%016llx", pud_val(pud));
+		if (pud_none(pud) || pud_bad(pud))
 			break;
 
-		pmd = pmd_offset(pud, addr);
-		pr_cont(", *pmd=%016llx", pmd_val(*pmd));
-		if (pmd_none(*pmd) || pmd_bad(*pmd))
+		pmdp = pmd_offset(pudp, addr);
+		pmd = READ_ONCE(*pmdp);
+		pr_cont(", pmd=%016llx", pmd_val(pmd));
+		if (pmd_none(pmd) || pmd_bad(pmd))
 			break;
 
-		pte = pte_offset_map(pmd, addr);
-		pr_cont(", *pte=%016llx", pte_val(*pte));
-		pte_unmap(pte);
+		ptep = pte_offset_map(pmdp, addr);
+		pte = READ_ONCE(*ptep);
+		pr_cont(", pte=%016llx", pte_val(pte));
+		pte_unmap(ptep);
 	} while(0);
 
 	pr_cont("\n");
@@ -198,8 +203,9 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 			  pte_t entry, int dirty)
 {
 	pteval_t old_pteval, pteval;
+	pte_t pte = READ_ONCE(*ptep);
 
-	if (pte_same(*ptep, entry))
+	if (pte_same(pte, entry))
 		return 0;
 
 	/* only preserve the access flags and write permission */
@@ -212,7 +218,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	 * (calculated as: a & b == ~(~a | ~b)).
 	 */
 	pte_val(entry) ^= PTE_RDONLY;
-	pteval = READ_ONCE(pte_val(*ptep));
+	pteval = pte_val(pte);
 	do {
 		old_pteval = pteval;
 		pteval ^= PTE_RDONLY;
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6cb0fa92a651..ecc6818191df 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -54,14 +54,14 @@ static inline pgprot_t pte_pgprot(pte_t pte)
 static int find_num_contig(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, size_t *pgsize)
 {
-	pgd_t *pgd = pgd_offset(mm, addr);
-	pud_t *pud;
-	pmd_t *pmd;
+	pgd_t *pgdp = pgd_offset(mm, addr);
+	pud_t *pudp;
+	pmd_t *pmdp;
 
 	*pgsize = PAGE_SIZE;
-	pud = pud_offset(pgd, addr);
-	pmd = pmd_offset(pud, addr);
-	if ((pte_t *)pmd == ptep) {
+	pudp = pud_offset(pgdp, addr);
+	pmdp = pmd_offset(pudp, addr);
+	if ((pte_t *)pmdp == ptep) {
 		*pgsize = PMD_SIZE;
 		return CONT_PMDS;
 	}
@@ -181,11 +181,8 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 
 	clear_flush(mm, addr, ptep, pgsize, ncontig);
 
-	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn) {
-		pr_debug("%s: set pte %p to 0x%llx\n", __func__, ptep,
-			 pte_val(pfn_pte(pfn, hugeprot)));
+	for (i = 0; i < ncontig; i++, ptep++, addr += pgsize, pfn += dpfn)
 		set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
-	}
 }
 
 void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
@@ -203,20 +200,20 @@ void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *huge_pte_alloc(struct mm_struct *mm,
 		      unsigned long addr, unsigned long sz)
 {
-	pgd_t *pgd;
-	pud_t *pud;
-	pte_t *pte = NULL;
-
-	pr_debug("%s: addr:0x%lx sz:0x%lx\n", __func__, addr, sz);
-	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
-	if (!pud)
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
+	pte_t *ptep = NULL;
+
+	pgdp = pgd_offset(mm, addr);
+	pudp = pud_alloc(mm, pgdp, addr);
+	if (!pudp)
 		return NULL;
 
 	if (sz == PUD_SIZE) {
-		pte = (pte_t *)pud;
+		ptep = (pte_t *)pudp;
 	} else if (sz == (PAGE_SIZE * CONT_PTES)) {
-		pmd_t *pmd = pmd_alloc(mm, pud, addr);
+		pmdp = pmd_alloc(mm, pudp, addr);
 
 		WARN_ON(addr & (sz - 1));
 		/*
@@ -226,60 +223,55 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		 * will be no pte_unmap() to correspond with this
 		 * pte_alloc_map().
 		 */
-		pte = pte_alloc_map(mm, pmd, addr);
+		ptep = pte_alloc_map(mm, pmdp, addr);
 	} else if (sz == PMD_SIZE) {
 		if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) &&
-		    pud_none(*pud))
-			pte = huge_pmd_share(mm, addr, pud);
+		    pud_none(READ_ONCE(*pudp)))
+			ptep = huge_pmd_share(mm, addr, pudp);
 		else
-			pte = (pte_t *)pmd_alloc(mm, pud, addr);
+			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
 	} else if (sz == (PMD_SIZE * CONT_PMDS)) {
-		pmd_t *pmd;
-
-		pmd = pmd_alloc(mm, pud, addr);
+		pmdp = pmd_alloc(mm, pudp, addr);
 		WARN_ON(addr & (sz - 1));
-		return (pte_t *)pmd;
+		return (pte_t *)pmdp;
 	}
 
-	pr_debug("%s: addr:0x%lx sz:0x%lx ret pte=%p/0x%llx\n", __func__, addr,
-	       sz, pte, pte_val(*pte));
-	return pte;
+	return ptep;
 }
 
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz)
 {
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
+	pgd_t *pgdp;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
 
-	pgd = pgd_offset(mm, addr);
-	pr_debug("%s: addr:0x%lx pgd:%p\n", __func__, addr, pgd);
-	if (!pgd_present(*pgd))
+	pgdp = pgd_offset(mm, addr);
+	if (!pgd_present(READ_ONCE(*pgdp)))
 		return NULL;
 
-	pud = pud_offset(pgd, addr);
-	if (sz != PUD_SIZE && pud_none(*pud))
+	pudp = pud_offset(pgdp, addr);
+	pud = READ_ONCE(*pudp);
+	if (sz != PUD_SIZE && pud_none(pud))
 		return NULL;
 	/* hugepage or swap? */
-	if (pud_huge(*pud) || !pud_present(*pud))
-		return (pte_t *)pud;
+	if (pud_huge(pud) || !pud_present(pud))
+		return (pte_t *)pudp;
 	/* table; check the next level */
 
 	if (sz == CONT_PMD_SIZE)
 		addr &= CONT_PMD_MASK;
 
-	pmd = pmd_offset(pud, addr);
+	pmdp = pmd_offset(pudp, addr);
+	pmd = READ_ONCE(*pmdp);
 	if (!(sz == PMD_SIZE || sz == CONT_PMD_SIZE) &&
-	    pmd_none(*pmd))
+	    pmd_none(pmd))
 		return NULL;
-	if (pmd_huge(*pmd) || !pmd_present(*pmd))
-		return (pte_t *)pmd;
+	if (pmd_huge(pmd) || !pmd_present(pmd))
+		return (pte_t *)pmdp;
 
-	if (sz == CONT_PTE_SIZE) {
-		pte_t *pte = pte_offset_kernel(pmd, (addr & CONT_PTE_MASK));
-		return pte;
-	}
+	if (sz == CONT_PTE_SIZE)
+		return pte_offset_kernel(pmdp, (addr & CONT_PTE_MASK));
 
 	return NULL;
 }
@@ -367,7 +359,7 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
 	size_t pgsize;
 	pte_t pte;
 
-	if (!pte_cont(*ptep)) {
+	if (!pte_cont(READ_ONCE(*ptep))) {
 		ptep_set_wrprotect(mm, addr, ptep);
 		return;
 	}
@@ -391,7 +383,7 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
 	size_t pgsize;
 	int ncontig;
 
-	if (!pte_cont(*ptep)) {
+	if (!pte_cont(READ_ONCE(*ptep))) {
 		ptep_clear_flush(vma, addr, ptep);
 		return;
 	}
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..7c0c1bb1bc4b 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -35,55 +35,55 @@ static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
  * with the physical address from __pa_symbol.
  */
 
-static void __init kasan_early_pte_populate(pmd_t *pmd, unsigned long addr,
+static void __init kasan_early_pte_populate(pmd_t *pmdp, unsigned long addr,
 					unsigned long end)
 {
-	pte_t *pte;
+	pte_t *ptep;
 	unsigned long next;
 
-	if (pmd_none(*pmd))
-		__pmd_populate(pmd, __pa_symbol(kasan_zero_pte), PMD_TYPE_TABLE);
+	if (pmd_none(READ_ONCE(*pmdp)))
+		__pmd_populate(pmdp, __pa_symbol(kasan_zero_pte), PMD_TYPE_TABLE);
 
-	pte = pte_offset_kimg(pmd, addr);
+	ptep = pte_offset_kimg(pmdp, addr);
 	do {
 		next = addr + PAGE_SIZE;
-		set_pte(pte, pfn_pte(sym_to_pfn(kasan_zero_page),
+		set_pte(ptep, pfn_pte(sym_to_pfn(kasan_zero_page),
 					PAGE_KERNEL));
-	} while (pte++, addr = next, addr != end && pte_none(*pte));
+	} while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)));
 }
 
-static void __init kasan_early_pmd_populate(pud_t *pud,
+static void __init kasan_early_pmd_populate(pud_t *pudp,
 					unsigned long addr,
 					unsigned long end)
 {
-	pmd_t *pmd;
+	pmd_t *pmdp;
 	unsigned long next;
 
-	if (pud_none(*pud))
-		__pud_populate(pud, __pa_symbol(kasan_zero_pmd), PMD_TYPE_TABLE);
+	if (pud_none(READ_ONCE(*pudp)))
+		__pud_populate(pudp, __pa_symbol(kasan_zero_pmd), PMD_TYPE_TABLE);
 
-	pmd = pmd_offset_kimg(pud, addr);
+	pmdp = pmd_offset_kimg(pudp, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		kasan_early_pte_populate(pmd, addr, next);
-	} while (pmd++, addr = next, addr != end && pmd_none(*pmd));
+		kasan_early_pte_populate(pmdp, addr, next);
+	} while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)));
 }
 
-static void __init kasan_early_pud_populate(pgd_t *pgd,
+static void __init kasan_early_pud_populate(pgd_t *pgdp,
 					unsigned long addr,
 					unsigned long end)
 {
-	pud_t *pud;
+	pud_t *pudp;
 	unsigned long next;
 
-	if (pgd_none(*pgd))
-		__pgd_populate(pgd, __pa_symbol(kasan_zero_pud), PUD_TYPE_TABLE);
+	if (pgd_none(READ_ONCE(*pgdp)))
+		__pgd_populate(pgdp, __pa_symbol(kasan_zero_pud), PUD_TYPE_TABLE);
 
-	pud = pud_offset_kimg(pgd, addr);
+	pudp = pud_offset_kimg(pgdp, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		kasan_early_pmd_populate(pud, addr, next);
-	} while (pud++, addr = next, addr != end && pud_none(*pud));
+		kasan_early_pmd_populate(pudp, addr, next);
+	} while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)));
 }
 
 static void __init kasan_map_early_shadow(void)
@@ -91,13 +91,13 @@ static void __init kasan_map_early_shadow(void)
 	unsigned long addr = KASAN_SHADOW_START;
 	unsigned long end = KASAN_SHADOW_END;
 	unsigned long next;
-	pgd_t *pgd;
+	pgd_t *pgdp;
 
-	pgd = pgd_offset_k(addr);
+	pgdp = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		kasan_early_pud_populate(pgd, addr, next);
-	} while (pgd++, addr = next, addr != end);
+		kasan_early_pud_populate(pgdp, addr, next);
+	} while (pgdp++, addr = next, addr != end);
 }
 
 asmlinkage void __init kasan_early_init(void)
@@ -113,14 +113,14 @@ asmlinkage void __init kasan_early_init(void)
  */
 void __init kasan_copy_shadow(pgd_t *pgdir)
 {
-	pgd_t *pgd, *pgd_new, *pgd_end;
+	pgd_t *pgdp, *pgd_newp, *pgd_endp;
 
-	pgd = pgd_offset_k(KASAN_SHADOW_START);
-	pgd_end = pgd_offset_k(KASAN_SHADOW_END);
-	pgd_new = pgd_offset_raw(pgdir, KASAN_SHADOW_START);
+	pgdp = pgd_offset_k(KASAN_SHADOW_START);
+	pgd_endp = pgd_offset_k(KASAN_SHADOW_END);
+	pgd_newp = pgd_offset_raw(pgdir, KASAN_SHADOW_START);
 	do {
-		set_pgd(pgd_new, *pgd);
-	} while (pgd++, pgd_new++, pgd != pgd_end);
+		set_pgd(pgd_newp, READ_ONCE(*pgdp));
+	} while (pgdp++, pgd_newp++, pgdp != pgd_endp);
 }
 
 static void __init clear_pgds(unsigned long start,
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e0e864..937a059f6fc2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -120,45 +120,48 @@ static bool pgattr_change_is_safe(u64 old, u64 new)
 	return ((old ^ new) & ~mask) == 0;
 }
 
-static void init_pte(pmd_t *pmd, unsigned long addr, unsigned long end,
+static void init_pte(pmd_t *pmdp, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot)
 {
-	pte_t *pte;
+	pte_t *ptep;
 
-	pte = pte_set_fixmap_offset(pmd, addr);
+	ptep = pte_set_fixmap_offset(pmdp, addr);
 	do {
-		pte_t old_pte = *pte;
+		pte_t old_pte = READ_ONCE(*ptep);
 
-		set_pte(pte, pfn_pte(__phys_to_pfn(phys), prot));
+		set_pte(ptep, pfn_pte(__phys_to_pfn(phys), prot));
 
 		/*
 		 * After the PTE entry has been populated once, we
 		 * only allow updates to the permission attributes.
 		 */
-		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte), pte_val(*pte)));
+		BUG_ON(!pgattr_change_is_safe(pte_val(old_pte),
+					      READ_ONCE(pte_val(*ptep))));
 
 		phys += PAGE_SIZE;
-	} while (pte++, addr += PAGE_SIZE, addr != end);
+	} while (ptep++, addr += PAGE_SIZE, addr != end);
 
 	pte_clear_fixmap();
 }
 
-static void alloc_init_cont_pte(pmd_t *pmd, unsigned long addr,
+static void alloc_init_cont_pte(pmd_t *pmdp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
 				phys_addr_t (*pgtable_alloc)(void),
 				int flags)
 {
 	unsigned long next;
+	pmd_t pmd = READ_ONCE(*pmdp);
 
-	BUG_ON(pmd_sect(*pmd));
-	if (pmd_none(*pmd)) {
+	BUG_ON(pmd_sect(pmd));
+	if (pmd_none(pmd)) {
 		phys_addr_t pte_phys;
 		BUG_ON(!pgtable_alloc);
 		pte_phys = pgtable_alloc();
-		__pmd_populate(pmd, pte_phys, PMD_TYPE_TABLE);
+		__pmd_populate(pmdp, pte_phys, PMD_TYPE_TABLE);
+		pmd = READ_ONCE(*pmdp);
 	}
-	BUG_ON(pmd_bad(*pmd));
+	BUG_ON(pmd_bad(pmd));
 
 	do {
 		pgprot_t __prot = prot;
@@ -170,67 +173,69 @@ static void alloc_init_cont_pte(pmd_t *pmd, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pte(pmd, addr, next, phys, __prot);
+		init_pte(pmdp, addr, next, phys, __prot);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
 }
 
-static void init_pmd(pud_t *pud, unsigned long addr, unsigned long end,
+static void init_pmd(pud_t *pudp, unsigned long addr, unsigned long end,
 		     phys_addr_t phys, pgprot_t prot,
 		     phys_addr_t (*pgtable_alloc)(void), int flags)
 {
 	unsigned long next;
-	pmd_t *pmd;
+	pmd_t *pmdp;
 
-	pmd = pmd_set_fixmap_offset(pud, addr);
+	pmdp = pmd_set_fixmap_offset(pudp, addr);
 	do {
-		pmd_t old_pmd = *pmd;
+		pmd_t old_pmd = READ_ONCE(*pmdp);
 
 		next = pmd_addr_end(addr, end);
 
 		/* try section mapping first */
 		if (((addr | next | phys) & ~SECTION_MASK) == 0 &&
 		    (flags & NO_BLOCK_MAPPINGS) == 0) {
-			pmd_set_huge(pmd, phys, prot);
+			pmd_set_huge(pmdp, phys, prot);
 
 			/*
 			 * After the PMD entry has been populated once, we
 			 * only allow updates to the permission attributes.
 			 */
 			BUG_ON(!pgattr_change_is_safe(pmd_val(old_pmd),
-						      pmd_val(*pmd)));
+						      READ_ONCE(pmd_val(*pmdp))));
 		} else {
-			alloc_init_cont_pte(pmd, addr, next, phys, prot,
+			alloc_init_cont_pte(pmdp, addr, next, phys, prot,
 					    pgtable_alloc, flags);
 
 			BUG_ON(pmd_val(old_pmd) != 0 &&
-			       pmd_val(old_pmd) != pmd_val(*pmd));
+			       pmd_val(old_pmd) != READ_ONCE(pmd_val(*pmdp)));
 		}
 		phys += next - addr;
-	} while (pmd++, addr = next, addr != end);
+	} while (pmdp++, addr = next, addr != end);
 
 	pmd_clear_fixmap();
 }
 
-static void alloc_init_cont_pmd(pud_t *pud, unsigned long addr,
+static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
 				unsigned long end, phys_addr_t phys,
 				pgprot_t prot,
 				phys_addr_t (*pgtable_alloc)(void), int flags)
 {
 	unsigned long next;
+	pud_t pud = READ_ONCE(*pudp);
 
 	/*
 	 * Check for initial section mappings in the pgd/pud.
 	 */
-	BUG_ON(pud_sect(*pud));
-	if (pud_none(*pud)) {
+	BUG_ON(pud_sect(pud));
+	if (pud_none(pud)) {
 		phys_addr_t pmd_phys;
 		BUG_ON(!pgtable_alloc);
 		pmd_phys = pgtable_alloc();
-		__pud_populate(pud, pmd_phys, PUD_TYPE_TABLE);
+		__pud_populate(pudp, pmd_phys, PUD_TYPE_TABLE);
+		pud = READ_ONCE(*pudp);
 	}
-	BUG_ON(pud_bad(*pud));
+	BUG_ON(pud_bad(pud));
 
 	do {
 		pgprot_t __prot = prot;
@@ -242,7 +247,7 @@ static void alloc_init_cont_pmd(pud_t *pud, unsigned long addr,
 		    (flags & NO_CONT_MAPPINGS) == 0)
 			__prot = __pgprot(pgprot_val(prot) | PTE_CONT);
 
-		init_pmd(pud, addr, next, phys, __prot, pgtable_alloc, flags);
+		init_pmd(pudp, addr, next, phys, __prot, pgtable_alloc, flags);
 
 		phys += next - addr;
 	} while (addr = next, addr != end);
@@ -260,25 +265,27 @@ static inline bool use_1G_block(unsigned long addr, unsigned long next,
 	return true;
 }
 
-static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
-				  phys_addr_t phys, pgprot_t prot,
-				  phys_addr_t (*pgtable_alloc)(void),
-				  int flags)
+static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
+			   phys_addr_t phys, pgprot_t prot,
+			   phys_addr_t (*pgtable_alloc)(void),
+			   int flags)
 {
-	pud_t *pud;
 	unsigned long next;
+	pud_t *pudp;
+	pgd_t pgd = READ_ONCE(*pgdp);
 
-	if (pgd_none(*pgd)) {
+	if (pgd_none(pgd)) {
 		phys_addr_t pud_phys;
 		BUG_ON(!pgtable_alloc);
 		pud_phys = pgtable_alloc();
-		__pgd_populate(pgd, pud_phys, PUD_TYPE_TABLE);
+		__pgd_populate(pgdp, pud_phys, PUD_TYPE_TABLE);
+		pgd = READ_ONCE(*pgdp);
 	}
-	BUG_ON(pgd_bad(*pgd));
+	BUG_ON(pgd_bad(pgd));
 
-	pud = pud_set_fixmap_offset(pgd, addr);
+	pudp = pud_set_fixmap_offset(pgdp, addr);
 	do {
-		pud_t old_pud = *pud;
+		pud_t old_pud = READ_ONCE(*pudp);
 
 		next = pud_addr_end(addr, end);
 
@@ -287,23 +294,23 @@ static void alloc_init_pud(pgd_t *pgd, unsigned long addr, unsigned long end,
 		 */
 		if (use_1G_block(addr, next, phys) &&
 		    (flags & NO_BLOCK_MAPPINGS) == 0) {
-			pud_set_huge(pud, phys, prot);
+			pud_set_huge(pudp, phys, prot);
 
 			/*
 			 * After the PUD entry has been populated once, we
 			 * only allow updates to the permission attributes.
 			 */
 			BUG_ON(!pgattr_change_is_safe(pud_val(old_pud),
-						      pud_val(*pud)));
+						      READ_ONCE(pud_val(*pudp))));
 		} else {
-			alloc_init_cont_pmd(pud, addr, next, phys, prot,
+			alloc_init_cont_pmd(pudp, addr, next, phys, prot,
 					    pgtable_alloc, flags);
 
 			BUG_ON(pud_val(old_pud) != 0 &&
-			       pud_val(old_pud) != pud_val(*pud));
+			       pud_val(old_pud) != READ_ONCE(pud_val(*pudp)));
 		}
 		phys += next - addr;
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	pud_clear_fixmap();
 }
@@ -315,7 +322,7 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 				 int flags)
 {
 	unsigned long addr, length, end, next;
-	pgd_t *pgd = pgd_offset_raw(pgdir, virt);
+	pgd_t *pgdp = pgd_offset_raw(pgdir, virt);
 
 	/*
 	 * If the virtual and physical address don't have the same offset
@@ -331,10 +338,10 @@ static void __create_pgd_mapping(pgd_t *pgdir, phys_addr_t phys,
 	end = addr + length;
 	do {
 		next = pgd_addr_end(addr, end);
-		alloc_init_pud(pgd, addr, next, phys, prot, pgtable_alloc,
+		alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
 			       flags);
 		phys += next - addr;
-	} while (pgd++, addr = next, addr != end);
+	} while (pgdp++, addr = next, addr != end);
 }
 
 static phys_addr_t pgd_pgtable_alloc(void)
@@ -396,10 +403,10 @@ static void update_mapping_prot(phys_addr_t phys, unsigned long virt,
 	flush_tlb_kernel_range(virt, virt + size);
 }
 
-static void __init __map_memblock(pgd_t *pgd, phys_addr_t start,
+static void __init __map_memblock(pgd_t *pgdp, phys_addr_t start,
 				  phys_addr_t end, pgprot_t prot, int flags)
 {
-	__create_pgd_mapping(pgd, start, __phys_to_virt(start), end - start,
+	__create_pgd_mapping(pgdp, start, __phys_to_virt(start), end - start,
 			     prot, early_pgtable_alloc, flags);
 }
 
@@ -413,7 +420,7 @@ void __init mark_linear_text_alias_ro(void)
 			    PAGE_KERNEL_RO);
 }
 
-static void __init map_mem(pgd_t *pgd)
+static void __init map_mem(pgd_t *pgdp)
 {
 	phys_addr_t kernel_start = __pa_symbol(_text);
 	phys_addr_t kernel_end = __pa_symbol(__init_begin);
@@ -446,7 +453,7 @@ static void __init map_mem(pgd_t *pgd)
 		if (memblock_is_nomap(reg))
 			continue;
 
-		__map_memblock(pgd, start, end, PAGE_KERNEL, flags);
+		__map_memblock(pgdp, start, end, PAGE_KERNEL, flags);
 	}
 
 	/*
@@ -459,7 +466,7 @@ static void __init map_mem(pgd_t *pgd)
 	 * Note that contiguous mappings cannot be remapped in this way,
 	 * so we should avoid them here.
 	 */
-	__map_memblock(pgd, kernel_start, kernel_end,
+	__map_memblock(pgdp, kernel_start, kernel_end,
 		       PAGE_KERNEL, NO_CONT_MAPPINGS);
 	memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
 
@@ -470,7 +477,7 @@ static void __init map_mem(pgd_t *pgd)
 	 * through /sys/kernel/kexec_crash_size interface.
 	 */
 	if (crashk_res.end) {
-		__map_memblock(pgd, crashk_res.start, crashk_res.end + 1,
+		__map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
 			       PAGE_KERNEL,
 			       NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
 		memblock_clear_nomap(crashk_res.start,
@@ -494,7 +501,7 @@ void mark_rodata_ro(void)
 	debug_checkwx();
 }
 
-static void __init map_kernel_segment(pgd_t *pgd, void *va_start, void *va_end,
+static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
 				      pgprot_t prot, struct vm_struct *vma,
 				      int flags, unsigned long vm_flags)
 {
@@ -504,7 +511,7 @@ static void __init map_kernel_segment(pgd_t *pgd, void *va_start, void *va_end,
 	BUG_ON(!PAGE_ALIGNED(pa_start));
 	BUG_ON(!PAGE_ALIGNED(size));
 
-	__create_pgd_mapping(pgd, pa_start, (unsigned long)va_start, size, prot,
+	__create_pgd_mapping(pgdp, pa_start, (unsigned long)va_start, size, prot,
 			     early_pgtable_alloc, flags);
 
 	if (!(vm_flags & VM_NO_GUARD))
@@ -528,7 +535,7 @@ early_param("rodata", parse_rodata);
 /*
  * Create fine-grained mappings for the kernel.
  */
-static void __init map_kernel(pgd_t *pgd)
+static void __init map_kernel(pgd_t *pgdp)
 {
 	static struct vm_struct vmlinux_text, vmlinux_rodata, vmlinux_inittext,
 				vmlinux_initdata, vmlinux_data;
@@ -544,24 +551,24 @@ static void __init map_kernel(pgd_t *pgd)
 	 * Only rodata will be remapped with different permissions later on,
 	 * all other segments are allowed to use contiguous mappings.
 	 */
-	map_kernel_segment(pgd, _text, _etext, text_prot, &vmlinux_text, 0,
+	map_kernel_segment(pgdp, _text, _etext, text_prot, &vmlinux_text, 0,
 			   VM_NO_GUARD);
-	map_kernel_segment(pgd, __start_rodata, __inittext_begin, PAGE_KERNEL,
+	map_kernel_segment(pgdp, __start_rodata, __inittext_begin, PAGE_KERNEL,
 			   &vmlinux_rodata, NO_CONT_MAPPINGS, VM_NO_GUARD);
-	map_kernel_segment(pgd, __inittext_begin, __inittext_end, text_prot,
+	map_kernel_segment(pgdp, __inittext_begin, __inittext_end, text_prot,
 			   &vmlinux_inittext, 0, VM_NO_GUARD);
-	map_kernel_segment(pgd, __initdata_begin, __initdata_end, PAGE_KERNEL,
+	map_kernel_segment(pgdp, __initdata_begin, __initdata_end, PAGE_KERNEL,
 			   &vmlinux_initdata, 0, VM_NO_GUARD);
-	map_kernel_segment(pgd, _data, _end, PAGE_KERNEL, &vmlinux_data, 0, 0);
+	map_kernel_segment(pgdp, _data, _end, PAGE_KERNEL, &vmlinux_data, 0, 0);
 
-	if (!pgd_val(*pgd_offset_raw(pgd, FIXADDR_START))) {
+	if (!READ_ONCE(pgd_val(*pgd_offset_raw(pgdp, FIXADDR_START)))) {
 		/*
 		 * The fixmap falls in a separate pgd to the kernel, and doesn't
 		 * live in the carveout for the swapper_pg_dir. We can simply
 		 * re-use the existing dir for the fixmap.
 		 */
-		set_pgd(pgd_offset_raw(pgd, FIXADDR_START),
-			*pgd_offset_k(FIXADDR_START));
+		set_pgd(pgd_offset_raw(pgdp, FIXADDR_START),
+			READ_ONCE(*pgd_offset_k(FIXADDR_START)));
 	} else if (CONFIG_PGTABLE_LEVELS > 3) {
 		/*
 		 * The fixmap shares its top level pgd entry with the kernel
@@ -570,14 +577,14 @@ static void __init map_kernel(pgd_t *pgd)
 		 * entry instead.
 		 */
 		BUG_ON(!IS_ENABLED(CONFIG_ARM64_16K_PAGES));
-		set_pud(pud_set_fixmap_offset(pgd, FIXADDR_START),
+		set_pud(pud_set_fixmap_offset(pgdp, FIXADDR_START),
 			__pud(__pa_symbol(bm_pmd) | PUD_TYPE_TABLE));
 		pud_clear_fixmap();
 	} else {
 		BUG();
 	}
 
-	kasan_copy_shadow(pgd);
+	kasan_copy_shadow(pgdp);
 }
 
 /*
@@ -587,10 +594,10 @@ static void __init map_kernel(pgd_t *pgd)
 void __init paging_init(void)
 {
 	phys_addr_t pgd_phys = early_pgtable_alloc();
-	pgd_t *pgd = pgd_set_fixmap(pgd_phys);
+	pgd_t *pgdp = pgd_set_fixmap(pgd_phys);
 
-	map_kernel(pgd);
-	map_mem(pgd);
+	map_kernel(pgdp);
+	map_mem(pgdp);
 
 	/*
 	 * We want to reuse the original swapper_pg_dir so we don't have to
@@ -601,7 +608,7 @@ void __init paging_init(void)
 	 * To do this we need to go via a temporary pgd.
 	 */
 	cpu_replace_ttbr1(__va(pgd_phys));
-	memcpy(swapper_pg_dir, pgd, PGD_SIZE);
+	memcpy(swapper_pg_dir, pgdp, PGD_SIZE);
 	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
 
 	pgd_clear_fixmap();
@@ -620,37 +627,40 @@ void __init paging_init(void)
  */
 int kern_addr_valid(unsigned long addr)
 {
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
+	pgd_t *pgdp;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep, pte;
 
 	if ((((long)addr) >> VA_BITS) != -1UL)
 		return 0;
 
-	pgd = pgd_offset_k(addr);
-	if (pgd_none(*pgd))
+	pgdp = pgd_offset_k(addr);
+	if (pgd_none(READ_ONCE(*pgdp)))
 		return 0;
 
-	pud = pud_offset(pgd, addr);
-	if (pud_none(*pud))
+	pudp = pud_offset(pgdp, addr);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
 		return 0;
 
-	if (pud_sect(*pud))
-		return pfn_valid(pud_pfn(*pud));
+	if (pud_sect(pud))
+		return pfn_valid(pud_pfn(pud));
 
-	pmd = pmd_offset(pud, addr);
-	if (pmd_none(*pmd))
+	pmdp = pmd_offset(pudp, addr);
+	pmd = READ_ONCE(*pmdp);
+	if (pmd_none(pmd))
 		return 0;
 
-	if (pmd_sect(*pmd))
-		return pfn_valid(pmd_pfn(*pmd));
+	if (pmd_sect(pmd))
+		return pfn_valid(pmd_pfn(pmd));
 
-	pte = pte_offset_kernel(pmd, addr);
-	if (pte_none(*pte))
+	ptep = pte_offset_kernel(pmdp, addr);
+	pte = READ_ONCE(*ptep);
+	if (pte_none(pte))
 		return 0;
 
-	return pfn_valid(pte_pfn(*pte));
+	return pfn_valid(pte_pfn(pte));
 }
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 #if !ARM64_SWAPPER_USES_SECTION_MAPS
@@ -663,32 +673,32 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node)
 {
 	unsigned long addr = start;
 	unsigned long next;
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
+	pgd_t *pgdp;
+	pud_t *pudp;
+	pmd_t *pmdp;
 
 	do {
 		next = pmd_addr_end(addr, end);
 
-		pgd = vmemmap_pgd_populate(addr, node);
-		if (!pgd)
+		pgdp = vmemmap_pgd_populate(addr, node);
+		if (!pgdp)
 			return -ENOMEM;
 
-		pud = vmemmap_pud_populate(pgd, addr, node);
-		if (!pud)
+		pudp = vmemmap_pud_populate(pgdp, addr, node);
+		if (!pudp)
 			return -ENOMEM;
 
-		pmd = pmd_offset(pud, addr);
-		if (pmd_none(*pmd)) {
+		pmdp = pmd_offset(pudp, addr);
+		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
 			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
 			if (!p)
 				return -ENOMEM;
 
-			set_pmd(pmd, __pmd(__pa(p) | PROT_SECT_NORMAL));
+			set_pmd(pmdp, __pmd(__pa(p) | PROT_SECT_NORMAL));
 		} else
-			vmemmap_verify((pte_t *)pmd, node, addr, next);
+			vmemmap_verify((pte_t *)pmdp, node, addr, next);
 	} while (addr = next, addr != end);
 
 	return 0;
@@ -701,20 +711,22 @@ void vmemmap_free(unsigned long start, unsigned long end)
 
 static inline pud_t * fixmap_pud(unsigned long addr)
 {
-	pgd_t *pgd = pgd_offset_k(addr);
+	pgd_t *pgdp = pgd_offset_k(addr);
+	pgd_t pgd = READ_ONCE(*pgdp);
 
-	BUG_ON(pgd_none(*pgd) || pgd_bad(*pgd));
+	BUG_ON(pgd_none(pgd) || pgd_bad(pgd));
 
-	return pud_offset_kimg(pgd, addr);
+	return pud_offset_kimg(pgdp, addr);
 }
 
 static inline pmd_t * fixmap_pmd(unsigned long addr)
 {
-	pud_t *pud = fixmap_pud(addr);
+	pud_t *pudp = fixmap_pud(addr);
+	pud_t pud = READ_ONCE(*pudp);
 
-	BUG_ON(pud_none(*pud) || pud_bad(*pud));
+	BUG_ON(pud_none(pud) || pud_bad(pud));
 
-	return pmd_offset_kimg(pud, addr);
+	return pmd_offset_kimg(pudp, addr);
 }
 
 static inline pte_t * fixmap_pte(unsigned long addr)
@@ -730,30 +742,31 @@ static inline pte_t * fixmap_pte(unsigned long addr)
  */
 void __init early_fixmap_init(void)
 {
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
+	pgd_t *pgdp, pgd;
+	pud_t *pudp;
+	pmd_t *pmdp;
 	unsigned long addr = FIXADDR_START;
 
-	pgd = pgd_offset_k(addr);
+	pgdp = pgd_offset_k(addr);
+	pgd = READ_ONCE(*pgdp);
 	if (CONFIG_PGTABLE_LEVELS > 3 &&
-	    !(pgd_none(*pgd) || pgd_page_paddr(*pgd) == __pa_symbol(bm_pud))) {
+	    !(pgd_none(pgd) || pgd_page_paddr(pgd) == __pa_symbol(bm_pud))) {
 		/*
 		 * We only end up here if the kernel mapping and the fixmap
 		 * share the top level pgd entry, which should only happen on
 		 * 16k/4 levels configurations.
 		 */
 		BUG_ON(!IS_ENABLED(CONFIG_ARM64_16K_PAGES));
-		pud = pud_offset_kimg(pgd, addr);
+		pudp = pud_offset_kimg(pgdp, addr);
 	} else {
-		if (pgd_none(*pgd))
-			__pgd_populate(pgd, __pa_symbol(bm_pud), PUD_TYPE_TABLE);
-		pud = fixmap_pud(addr);
+		if (pgd_none(pgd))
+			__pgd_populate(pgdp, __pa_symbol(bm_pud), PUD_TYPE_TABLE);
+		pudp = fixmap_pud(addr);
 	}
-	if (pud_none(*pud))
-		__pud_populate(pud, __pa_symbol(bm_pmd), PMD_TYPE_TABLE);
-	pmd = fixmap_pmd(addr);
-	__pmd_populate(pmd, __pa_symbol(bm_pte), PMD_TYPE_TABLE);
+	if (pud_none(READ_ONCE(*pudp)))
+		__pud_populate(pudp, __pa_symbol(bm_pmd), PMD_TYPE_TABLE);
+	pmdp = fixmap_pmd(addr);
+	__pmd_populate(pmdp, __pa_symbol(bm_pte), PMD_TYPE_TABLE);
 
 	/*
 	 * The boot-ioremap range spans multiple pmds, for which
@@ -762,11 +775,11 @@ void __init early_fixmap_init(void)
 	BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
 		     != (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));
 
-	if ((pmd != fixmap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)))
-	     || pmd != fixmap_pmd(fix_to_virt(FIX_BTMAP_END))) {
+	if ((pmdp != fixmap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)))
+	     || pmdp != fixmap_pmd(fix_to_virt(FIX_BTMAP_END))) {
 		WARN_ON(1);
-		pr_warn("pmd %p != %p, %p\n",
-			pmd, fixmap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)),
+		pr_warn("pmdp %p != %p, %p\n",
+			pmdp, fixmap_pmd(fix_to_virt(FIX_BTMAP_BEGIN)),
 			fixmap_pmd(fix_to_virt(FIX_BTMAP_END)));
 		pr_warn("fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
 			fix_to_virt(FIX_BTMAP_BEGIN));
@@ -782,16 +795,16 @@ void __set_fixmap(enum fixed_addresses idx,
 			       phys_addr_t phys, pgprot_t flags)
 {
 	unsigned long addr = __fix_to_virt(idx);
-	pte_t *pte;
+	pte_t *ptep;
 
 	BUG_ON(idx <= FIX_HOLE || idx >= __end_of_fixed_addresses);
 
-	pte = fixmap_pte(addr);
+	ptep = fixmap_pte(addr);
 
 	if (pgprot_val(flags)) {
-		set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
+		set_pte(ptep, pfn_pte(phys >> PAGE_SHIFT, flags));
 	} else {
-		pte_clear(&init_mm, addr, pte);
+		pte_clear(&init_mm, addr, ptep);
 		flush_tlb_kernel_range(addr, addr+PAGE_SIZE);
 	}
 }
@@ -873,32 +886,32 @@ int __init arch_ioremap_pmd_supported(void)
 	return 1;
 }
 
-int pud_set_huge(pud_t *pud, phys_addr_t phys, pgprot_t prot)
+int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
 	BUG_ON(phys & ~PUD_MASK);
-	set_pud(pud, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+	set_pud(pudp, __pud(phys | PUD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
 	return 1;
 }
 
-int pmd_set_huge(pmd_t *pmd, phys_addr_t phys, pgprot_t prot)
+int pmd_set_huge(pmd_t *pmdp, phys_addr_t phys, pgprot_t prot)
 {
 	BUG_ON(phys & ~PMD_MASK);
-	set_pmd(pmd, __pmd(phys | PMD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
+	set_pmd(pmdp, __pmd(phys | PMD_TYPE_SECT | pgprot_val(mk_sect_prot(prot))));
 	return 1;
 }
 
-int pud_clear_huge(pud_t *pud)
+int pud_clear_huge(pud_t *pudp)
 {
-	if (!pud_sect(*pud))
+	if (!pud_sect(READ_ONCE(*pudp)))
 		return 0;
-	pud_clear(pud);
+	pud_clear(pudp);
 	return 1;
 }
 
-int pmd_clear_huge(pmd_t *pmd)
+int pmd_clear_huge(pmd_t *pmdp)
 {
-	if (!pmd_sect(*pmd))
+	if (!pmd_sect(READ_ONCE(*pmdp)))
 		return 0;
-	pmd_clear(pmd);
+	pmd_clear(pmdp);
 	return 1;
 }
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a682a0a2a0fa..4b969dd2287a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -156,30 +156,32 @@ void __kernel_map_pages(struct page *page, int numpages, int enable)
  */
 bool kernel_page_present(struct page *page)
 {
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-	pte_t *pte;
+	pgd_t *pgdp;
+	pud_t *pudp, pud;
+	pmd_t *pmdp, pmd;
+	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	pgd = pgd_offset_k(addr);
-	if (pgd_none(*pgd))
+	pgdp = pgd_offset_k(addr);
+	if (pgd_none(READ_ONCE(*pgdp)))
 		return false;
 
-	pud = pud_offset(pgd, addr);
-	if (pud_none(*pud))
+	pudp = pud_offset(pgdp, addr);
+	pud = READ_ONCE(*pudp);
+	if (pud_none(pud))
 		return false;
-	if (pud_sect(*pud))
+	if (pud_sect(pud))
 		return true;
 
-	pmd = pmd_offset(pud, addr);
-	if (pmd_none(*pmd))
+	pmdp = pmd_offset(pudp, addr);
+	pmd = READ_ONCE(*pmdp);
+	if (pmd_none(pmd))
 		return false;
-	if (pmd_sect(*pmd))
+	if (pmd_sect(pmd))
 		return true;
 
-	pte = pte_offset_kernel(pmd, addr);
-	return pte_valid(*pte);
+	ptep = pte_offset_kernel(pmdp, addr);
+	return pte_valid(READ_ONCE(*ptep));
 }
 #endif /* CONFIG_HIBERNATION */
 #endif /* CONFIG_DEBUG_PAGEALLOC */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/2] mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of lock
  2017-09-27 15:49 [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Will Deacon
  2017-09-27 15:49 ` [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables Will Deacon
@ 2017-09-27 15:49 ` Will Deacon
  2017-09-27 22:01 ` [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Yury Norov
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2017-09-27 15:49 UTC (permalink / raw)
  To: peterz, paulmck, kirill.shutemov
  Cc: linux-kernel, ynorov, rruigrok, linux-arch, akpm,
	catalin.marinas, Will Deacon

Loading the pmd without holding the pmd_lock exposes us to races with
concurrent updaters of the page tables but, worse still, it also allows
the compiler to cache the pmd value in a register and reuse it later on,
even if we've performed a READ_ONCE in between and seen a more recent
value.

In the case of page_vma_mapped_walk, this leads to the following crash
when the pmd loaded for the initial pmd_trans_huge check is all zeroes
and a subsequent valid table entry is loaded by check_pmd. We then
proceed into map_pte, but the compiler re-uses the zero entry inside
pte_offset_map, resulting in a junk pointer being installed in pvmw->pte:

[  254.032812] PC is at check_pte+0x20/0x170
[  254.032948] LR is at page_vma_mapped_walk+0x2e0/0x540
[...]
[  254.036114] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
[  254.036361] Call trace:
[  254.038977] [<ffff000008233328>] check_pte+0x20/0x170
[  254.039137] [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540
[  254.039332] [<ffff000008234adc>] page_mkclean_one+0xac/0x278
[  254.039489] [<ffff000008234d98>] rmap_walk_file+0xf0/0x238
[  254.039642] [<ffff000008236e74>] rmap_walk+0x64/0xa0
[  254.039784] [<ffff0000082370c8>] page_mkclean+0x90/0xa8
[  254.040029] [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8
[  254.040311] [<ffff00000832f984>] mpage_submit_page+0x34/0x98
[  254.040518] [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170
[  254.040743] [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8
[  254.040969] [<ffff00000833530c>] ext4_writepages+0x484/0xe30
[  254.041175] [<ffff0000081f6ab4>] do_writepages+0x44/0xe8
[  254.041372] [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110
[  254.041568] [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8
[  254.041739] [<ffff000008324310>] ext4_sync_file+0x80/0x4b8
[  254.041907] [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0
[  254.042106] [<ffff0000082332b4>] SyS_msync+0x194/0x1e8

This patch fixes the problem by ensuring that READ_ONCE is used before
the initial checks on the pmd, and this value is subsequently used when
checking whether or not the pmd is present. pmd_check is removed and the
pmd_present check is inlined directly.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/page_vma_mapped.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 6a03946469a9..6b85f5464246 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -6,17 +6,6 @@
 
 #include "internal.h"
 
-static inline bool check_pmd(struct page_vma_mapped_walk *pvmw)
-{
-	pmd_t pmde;
-	/*
-	 * Make sure we don't re-load pmd between present and !trans_huge check.
-	 * We need a consistent view.
-	 */
-	pmde = READ_ONCE(*pvmw->pmd);
-	return pmd_present(pmde) && !pmd_trans_huge(pmde);
-}
-
 static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 {
 	page_vma_mapped_walk_done(pvmw);
@@ -116,6 +105,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
+	pmd_t pmde;
 
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
@@ -148,7 +138,13 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	if (!pud_present(*pud))
 		return false;
 	pvmw->pmd = pmd_offset(pud, pvmw->address);
-	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
+	/*
+	 * Make sure the pmd value isn't cached in a register by the
+	 * compiler and used as a stale value after we've observed a
+	 * subsequent update.
+	 */
+	pmde = READ_ONCE(*pvmw->pmd);
+	if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
 		pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 		if (likely(pmd_trans_huge(*pvmw->pmd))) {
 			if (pvmw->flags & PVMW_MIGRATION)
@@ -175,9 +171,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			spin_unlock(pvmw->ptl);
 			pvmw->ptl = NULL;
 		}
-	} else {
-		if (!check_pmd(pvmw))
-			return false;
+	} else if (!pmd_present(pmde)) {
+		return false;
 	}
 	if (!map_pte(pvmw))
 		goto next_pte;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
  2017-09-27 15:49 [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Will Deacon
  2017-09-27 15:49 ` [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables Will Deacon
  2017-09-27 15:49 ` [RFC PATCH 2/2] mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of lock Will Deacon
@ 2017-09-27 22:01 ` Yury Norov
  2017-09-28 17:30 ` Richard Ruigrok
  2017-09-28 19:38 ` Jon Masters
  4 siblings, 0 replies; 27+ messages in thread
From: Yury Norov @ 2017-09-27 22:01 UTC (permalink / raw)
  To: Will Deacon
  Cc: peterz, paulmck, kirill.shutemov, linux-kernel, rruigrok,
	linux-arch, akpm, catalin.marinas

On Wed, Sep 27, 2017 at 04:49:27PM +0100, Will Deacon wrote:
> Hi,
> 
> We recently had a crash report[1] on arm64 that involved a bad dereference
> in the page_vma_mapped code during ext4 writeback with THP active. I can
> reproduce this on -rc2:
> 
> [  254.032812] PC is at check_pte+0x20/0x170
> [  254.032948] LR is at page_vma_mapped_walk+0x2e0/0x540
> [...]
> [  254.036114] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
> [  254.036361] Call trace:
> [  254.038977] [<ffff000008233328>] check_pte+0x20/0x170
> [  254.039137] [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540
> [  254.039332] [<ffff000008234adc>] page_mkclean_one+0xac/0x278
> [  254.039489] [<ffff000008234d98>] rmap_walk_file+0xf0/0x238
> [  254.039642] [<ffff000008236e74>] rmap_walk+0x64/0xa0
> [  254.039784] [<ffff0000082370c8>] page_mkclean+0x90/0xa8
> [  254.040029] [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8
> [  254.040311] [<ffff00000832f984>] mpage_submit_page+0x34/0x98
> [  254.040518] [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170
> [  254.040743] [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8
> [  254.040969] [<ffff00000833530c>] ext4_writepages+0x484/0xe30
> [  254.041175] [<ffff0000081f6ab4>] do_writepages+0x44/0xe8
> [  254.041372] [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110
> [  254.041568] [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8
> [  254.041739] [<ffff000008324310>] ext4_sync_file+0x80/0x4b8
> [  254.041907] [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0
> [  254.042106] [<ffff0000082332b4>] SyS_msync+0x194/0x1e8
> 
> After digging into the issue, I found that we appear to be racing with
> a concurrent pmd update in page_vma_mapped_walk, assumedly due a THP
> splitting operation. Looking at the code there:
> 
> 	pvmw->pmd = pmd_offset(pud, pvmw->address);
> 	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
> 		[...]
> 	} else {
> 		if (!check_pmd(pvmw))
> 			return false;
> 	}
> 	if (!map_pte(pvmw))
> 		goto next_pte;
> 
> what happens in the crashing scenario is that we see all zeroes for the
> PMD in pmd_trans_huge(*pvmw->pmd), and so go to the 'else' case (migration
> isn't enabled, so the test is removed at compile-time). check_pmd then does:
> 
> 	pmde = READ_ONCE(*pvmw->pmd);
> 	return pmd_present(pmde) && !pmd_trans_huge(pmde);
> 
> and reads a valid table entry for the PMD because the splitting has completed
> (i.e. the first dereference reads from the pmdp_invalidate in the splitting
> code, whereas the second dereferenced reads from the following pmd_populate).
> It returns true because we should descend to the PTE level in map_pte. map_pte
> does:
> 
> 	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
> 
> which on arm64 (and this appears to be the same on x86) ends up doing:
> 
> 	(pmd_page_paddr((*(pvmw->pmd))) + pte_index(pvmw->address) * sizeof(pte_t))
> 
> as part of its calculation. However, this is horribly broken because GCC
> inlines everything and reuses the register it loaded for the initial
> pmd_trans_huge check (when we loaded the value of zero) here, so we end up
> calculating a junk pointer and crashing when we dereference it. Disassembly
> at the end of the mail[2] for those who are curious.
> 
> The moral of the story is that read-after-read (same address) ordering *only*
> applies if READ_ONCE is used consistently. This means we need to fix page
> table dereferences in the core code as well as the arch code to avoid this
> problem. The two RFC patches in this series fix arm64 (which is a bigger fix
> that necessary since I clean things up too) and page_vma_mapped_walk.
> 
> Comments welcome.
> 
> Will
> 
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/532786.html
> [2]
 
Hi Will, 

The fix works for me. Thanks.
My cross-compiler is:
$ /home/yury/work/thunderx-tools-28/bin/aarch64-thunderx-linux-gnu-gcc --version
aarch64-thunderx-linux-gnu-gcc (Cavium Inc. build 28) 7.1.0
Copyright (C) 2017 Free Software Foundation, Inc.

Tested-by: Yury Norov <ynorov@caviumnetworks.com>

Yury

> // page_vma_mapped_walk
> // pvmw->pmd = pmd_offset(pud, pvmw->address);
> ldr     x0, [x19, #24]		// pvmw->pmd
> 
> // if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
> ldr     x1, [x0]		// *pvmw->pmd
> cbz     x1, ffff0000082336a0 <page_vma_mapped_walk+0x228>
> tbz     w1, #1, ffff000008233788 <page_vma_mapped_walk+0x310>	// pmd_trans_huge?
> 
> // else if (!check_pmd(pvmw))
> ldr     x0, [x0]		// READ_ONCE in check_pmd
> tst     x0, x24			// pmd_present?
> b.eq    ffff000008233538 <page_vma_mapped_walk+0xc0>  // b.none
> tbz     w0, #1, ffff000008233538 <page_vma_mapped_walk+0xc0>	// pmd_trans_huge?
> 
> // if (!map_pte(pvmw))
> ldr     x0, [x19, #16]		// pvmw->address
> 
> // pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
> and     x1, x1, #0xfffffffff000	// Reusing the old value of *pvmw->pmd!!!
> [...]
> 
> --->8
> 
> Will Deacon (2):
>   arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
>   mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of
>     lock
> 
>  arch/arm64/include/asm/hugetlb.h     |   2 +-
>  arch/arm64/include/asm/kvm_mmu.h     |  18 +--
>  arch/arm64/include/asm/mmu_context.h |   4 +-
>  arch/arm64/include/asm/pgalloc.h     |  42 +++---
>  arch/arm64/include/asm/pgtable.h     |  29 ++--
>  arch/arm64/kernel/hibernate.c        | 148 +++++++++---------
>  arch/arm64/mm/dump.c                 |  54 ++++---
>  arch/arm64/mm/fault.c                |  44 +++---
>  arch/arm64/mm/hugetlbpage.c          |  94 ++++++------
>  arch/arm64/mm/kasan_init.c           |  62 ++++----
>  arch/arm64/mm/mmu.c                  | 281 ++++++++++++++++++-----------------
>  arch/arm64/mm/pageattr.c             |  30 ++--
>  mm/page_vma_mapped.c                 |  25 ++--
>  13 files changed, 427 insertions(+), 406 deletions(-)
> 
> -- 
> 2.1.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-27 15:49 ` [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables Will Deacon
@ 2017-09-28  8:38   ` Peter Zijlstra
  2017-09-28  8:45     ` Will Deacon
  2017-09-28 19:18   ` Timur Tabi
  1 sibling, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2017-09-28  8:38 UTC (permalink / raw)
  To: Will Deacon
  Cc: paulmck, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas

On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> In many cases, page tables can be accessed concurrently by either another
> CPU (due to things like fast gup) or by the hardware page table walker
> itself, which may set access/dirty bits. In such cases, it is important
> to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> entries cannot be torn, merged or subject to apparent loss of coherence.

In fact, we should use lockless_dereference() for many of them. Yes
Alpha is the only one that cares about the difference between that and
READ_ONCE() and they do have the extra barrier, but if we're going to do
this, we might as well do it 'right' :-)

Also, a very long standing item on my TODO list is to see how much of it
we can unify across the various architectures, because there's a giant
amount of boiler plate involved with all this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-28  8:38   ` Peter Zijlstra
@ 2017-09-28  8:45     ` Will Deacon
  2017-09-28 15:43       ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2017-09-28  8:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas

On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > In many cases, page tables can be accessed concurrently by either another
> > CPU (due to things like fast gup) or by the hardware page table walker
> > itself, which may set access/dirty bits. In such cases, it is important
> > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > entries cannot be torn, merged or subject to apparent loss of coherence.
> 
> In fact, we should use lockless_dereference() for many of them. Yes
> Alpha is the only one that cares about the difference between that and
> READ_ONCE() and they do have the extra barrier, but if we're going to do
> this, we might as well do it 'right' :-)

I know this sounds daft, but I think one of the big reasons why
lockless_dereference() doesn't get an awful lot of use is because it's
such a mouthful! Why don't we just move the smp_read_barrier_depends()
into READ_ONCE? Would anybody actually care about the potential impact on
Alpha (which, frankly, is treading on thin ice given the low adoption of
lockless_dereference())?

> Also, a very long standing item on my TODO list is to see how much of it
> we can unify across the various architectures, because there's a giant
> amount of boiler plate involved with all this.

Yeah, I'd be happy to help with that as a separate series. I already tripped
over 5 or 6 page table walkers in arch/arm64/ alone :(

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-28  8:45     ` Will Deacon
@ 2017-09-28 15:43       ` Paul E. McKenney
  2017-09-28 15:49         ` Will Deacon
  2017-09-28 18:59         ` Michael Cree
  0 siblings, 2 replies; 27+ messages in thread
From: Paul E. McKenney @ 2017-09-28 15:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas, rth, ink, mattst88,
	linux-alpha

On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > In many cases, page tables can be accessed concurrently by either another
> > > CPU (due to things like fast gup) or by the hardware page table walker
> > > itself, which may set access/dirty bits. In such cases, it is important
> > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > 
> > In fact, we should use lockless_dereference() for many of them. Yes
> > Alpha is the only one that cares about the difference between that and
> > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > this, we might as well do it 'right' :-)
> 
> I know this sounds daft, but I think one of the big reasons why
> lockless_dereference() doesn't get an awful lot of use is because it's
> such a mouthful! Why don't we just move the smp_read_barrier_depends()
> into READ_ONCE? Would anybody actually care about the potential impact on
> Alpha (which, frankly, is treading on thin ice given the low adoption of
> lockless_dereference())?

This is my cue to ask my usual question...  ;-)

Are people still running mainline kernels on Alpha?  (Added Alpha folks.)

As always, if anyone is, we must continue to support Alpha, but sounds
like time to check again.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-28 15:43       ` Paul E. McKenney
@ 2017-09-28 15:49         ` Will Deacon
  2017-09-28 16:07           ` Paul E. McKenney
  2017-09-28 18:59         ` Michael Cree
  1 sibling, 1 reply; 27+ messages in thread
From: Will Deacon @ 2017-09-28 15:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas, rth, ink, mattst88,
	linux-alpha

On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > In many cases, page tables can be accessed concurrently by either another
> > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > 
> > > In fact, we should use lockless_dereference() for many of them. Yes
> > > Alpha is the only one that cares about the difference between that and
> > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > this, we might as well do it 'right' :-)
> > 
> > I know this sounds daft, but I think one of the big reasons why
> > lockless_dereference() doesn't get an awful lot of use is because it's
> > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > into READ_ONCE? Would anybody actually care about the potential impact on
> > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > lockless_dereference())?
> 
> This is my cue to ask my usual question...  ;-)
> 
> Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> 
> As always, if anyone is, we must continue to support Alpha, but sounds
> like time to check again.

I'll be honest and say that I haven't updated mine for a while, but I do
have a soft spot for those machines :(

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-28 15:49         ` Will Deacon
@ 2017-09-28 16:07           ` Paul E. McKenney
  0 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2017-09-28 16:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas, rth, ink, mattst88,
	linux-alpha

On Thu, Sep 28, 2017 at 04:49:54PM +0100, Will Deacon wrote:
> On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > > In many cases, page tables can be accessed concurrently by either another
> > > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > > 
> > > > In fact, we should use lockless_dereference() for many of them. Yes
> > > > Alpha is the only one that cares about the difference between that and
> > > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > > this, we might as well do it 'right' :-)
> > > 
> > > I know this sounds daft, but I think one of the big reasons why
> > > lockless_dereference() doesn't get an awful lot of use is because it's
> > > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > > into READ_ONCE? Would anybody actually care about the potential impact on
> > > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > > lockless_dereference())?
> > 
> > This is my cue to ask my usual question...  ;-)
> > 
> > Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> > 
> > As always, if anyone is, we must continue to support Alpha, but sounds
> > like time to check again.
> 
> I'll be honest and say that I haven't updated mine for a while, but I do
> have a soft spot for those machines :(

Let's see what the Alpha folks say.  I myself have had a close
relationship with Alpha for almost 20 years, but I suspect that in
my case it is more a hard spot on my head rather than a soft spot in
my heart.  ;-)

								Thanx,
								Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
  2017-09-27 15:49 [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Will Deacon
                   ` (2 preceding siblings ...)
  2017-09-27 22:01 ` [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Yury Norov
@ 2017-09-28 17:30 ` Richard Ruigrok
  2017-09-28 19:38 ` Jon Masters
  4 siblings, 0 replies; 27+ messages in thread
From: Richard Ruigrok @ 2017-09-28 17:30 UTC (permalink / raw)
  To: Will Deacon, peterz, paulmck, kirill.shutemov
  Cc: linux-kernel, ynorov, linux-arch, akpm, catalin.marinas



On 9/27/2017 9:49 AM, Will Deacon wrote:
> Hi,
>
> We recently had a crash report[1] on arm64 that involved a bad dereference
> in the page_vma_mapped code during ext4 writeback with THP active. I can
> reproduce this on -rc2:
>
> [  254.032812] PC is at check_pte+0x20/0x170
> [  254.032948] LR is at page_vma_mapped_walk+0x2e0/0x540
> [...]
> [  254.036114] Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
> [  254.036361] Call trace:
> [  254.038977] [<ffff000008233328>] check_pte+0x20/0x170
> [  254.039137] [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540
> [  254.039332] [<ffff000008234adc>] page_mkclean_one+0xac/0x278
> [  254.039489] [<ffff000008234d98>] rmap_walk_file+0xf0/0x238
> [  254.039642] [<ffff000008236e74>] rmap_walk+0x64/0xa0
> [  254.039784] [<ffff0000082370c8>] page_mkclean+0x90/0xa8
> [  254.040029] [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8
> [  254.040311] [<ffff00000832f984>] mpage_submit_page+0x34/0x98
> [  254.040518] [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170
> [  254.040743] [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8
> [  254.040969] [<ffff00000833530c>] ext4_writepages+0x484/0xe30
> [  254.041175] [<ffff0000081f6ab4>] do_writepages+0x44/0xe8
> [  254.041372] [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110
> [  254.041568] [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8
> [  254.041739] [<ffff000008324310>] ext4_sync_file+0x80/0x4b8
> [  254.041907] [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0
> [  254.042106] [<ffff0000082332b4>] SyS_msync+0x194/0x1e8
>
> After digging into the issue, I found that we appear to be racing with
> a concurrent pmd update in page_vma_mapped_walk, assumedly due a THP
> splitting operation. Looking at the code there:
>
> 	pvmw->pmd = pmd_offset(pud, pvmw->address);
> 	if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
> 		[...]
> 	} else {
> 		if (!check_pmd(pvmw))
> 			return false;
> 	}
> 	if (!map_pte(pvmw))
> 		goto next_pte;
>
> what happens in the crashing scenario is that we see all zeroes for the
> PMD in pmd_trans_huge(*pvmw->pmd), and so go to the 'else' case (migration
> isn't enabled, so the test is removed at compile-time). check_pmd then does:
>
> 	pmde = READ_ONCE(*pvmw->pmd);
> 	return pmd_present(pmde) && !pmd_trans_huge(pmde);
>
> and reads a valid table entry for the PMD because the splitting has completed
> (i.e. the first dereference reads from the pmdp_invalidate in the splitting
> code, whereas the second dereferenced reads from the following pmd_populate).
> It returns true because we should descend to the PTE level in map_pte. map_pte
> does:
>
> 	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
>
> which on arm64 (and this appears to be the same on x86) ends up doing:
>
> 	(pmd_page_paddr((*(pvmw->pmd))) + pte_index(pvmw->address) * sizeof(pte_t))
>
> as part of its calculation. However, this is horribly broken because GCC
> inlines everything and reuses the register it loaded for the initial
> pmd_trans_huge check (when we loaded the value of zero) here, so we end up
> calculating a junk pointer and crashing when we dereference it. Disassembly
> at the end of the mail[2] for those who are curious.
>
> The moral of the story is that read-after-read (same address) ordering *only*
> applies if READ_ONCE is used consistently. This means we need to fix page
> table dereferences in the core code as well as the arch code to avoid this
> problem. The two RFC patches in this series fix arm64 (which is a bigger fix
> that necessary since I clean things up too) and page_vma_mapped_walk.
>
> Comments welcome.
Hi Will,
This fix works for me, tested with LTP rwtest 15 iterations on Qualcomm Centiq2400.
Compiler: gcc version 5.2.1 20151005 (Linaro GCC 5.2-2015.11-1)

Tested-by: Richard Ruigrok <rruigrok@codeaurora.org>

Thanks,
Richard
> Will
>
> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/532786.html
> [2]
>
> // page_vma_mapped_walk
> // pvmw->pmd = pmd_offset(pud, pvmw->address);
> ldr     x0, [x19, #24]		// pvmw->pmd
>
> // if (pmd_trans_huge(*pvmw->pmd) || is_pmd_migration_entry(*pvmw->pmd)) {
> ldr     x1, [x0]		// *pvmw->pmd
> cbz     x1, ffff0000082336a0 <page_vma_mapped_walk+0x228>
> tbz     w1, #1, ffff000008233788 <page_vma_mapped_walk+0x310>	// pmd_trans_huge?
>
> // else if (!check_pmd(pvmw))
> ldr     x0, [x0]		// READ_ONCE in check_pmd
> tst     x0, x24			// pmd_present?
> b.eq    ffff000008233538 <page_vma_mapped_walk+0xc0>  // b.none
> tbz     w0, #1, ffff000008233538 <page_vma_mapped_walk+0xc0>	// pmd_trans_huge?
>
> // if (!map_pte(pvmw))
> ldr     x0, [x19, #16]		// pvmw->address
>
> // pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
> and     x1, x1, #0xfffffffff000	// Reusing the old value of *pvmw->pmd!!!
> [...]
>
> --->8
>
> Will Deacon (2):
>   arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
>   mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of
>     lock
>
>  arch/arm64/include/asm/hugetlb.h     |   2 +-
>  arch/arm64/include/asm/kvm_mmu.h     |  18 +--
>  arch/arm64/include/asm/mmu_context.h |   4 +-
>  arch/arm64/include/asm/pgalloc.h     |  42 +++---
>  arch/arm64/include/asm/pgtable.h     |  29 ++--
>  arch/arm64/kernel/hibernate.c        | 148 +++++++++---------
>  arch/arm64/mm/dump.c                 |  54 ++++---
>  arch/arm64/mm/fault.c                |  44 +++---
>  arch/arm64/mm/hugetlbpage.c          |  94 ++++++------
>  arch/arm64/mm/kasan_init.c           |  62 ++++----
>  arch/arm64/mm/mmu.c                  | 281 ++++++++++++++++++-----------------
>  arch/arm64/mm/pageattr.c             |  30 ++--
>  mm/page_vma_mapped.c                 |  25 ++--
>  13 files changed, 427 insertions(+), 406 deletions(-)
>

-- 
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-28 15:43       ` Paul E. McKenney
  2017-09-28 15:49         ` Will Deacon
@ 2017-09-28 18:59         ` Michael Cree
  2017-09-29  0:58           ` Paul E. McKenney
  1 sibling, 1 reply; 27+ messages in thread
From: Michael Cree @ 2017-09-28 18:59 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > In many cases, page tables can be accessed concurrently by either another
> > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > 
> > > In fact, we should use lockless_dereference() for many of them. Yes
> > > Alpha is the only one that cares about the difference between that and
> > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > this, we might as well do it 'right' :-)
> > 
> > I know this sounds daft, but I think one of the big reasons why
> > lockless_dereference() doesn't get an awful lot of use is because it's
> > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > into READ_ONCE? Would anybody actually care about the potential impact on
> > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > lockless_dereference())?
> 
> This is my cue to ask my usual question...  ;-)
> 
> Are people still running mainline kernels on Alpha?  (Added Alpha folks.)

Yes.  I run two Alpha build daemons that build the unofficial
debian-alpha port.  Debian popcon reports nine machines running
Alpha, which are likely to be running the 4.12.y kernel which
is currently in debian-alpha, (and presumably soon to be 4.13.y
which is now built on Alpha in experimental).

Cheers
Michael

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-27 15:49 ` [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables Will Deacon
  2017-09-28  8:38   ` Peter Zijlstra
@ 2017-09-28 19:18   ` Timur Tabi
  1 sibling, 0 replies; 27+ messages in thread
From: Timur Tabi @ 2017-09-28 19:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, paulmck, kirill.shutemov, lkml, Yury Norov,
	rruigrok, linux-arch, akpm, Catalin Marinas

On Wed, Sep 27, 2017 at 10:49 AM, Will Deacon <will.deacon@arm.com> wrote:
> This patch consistently uses these macros in the arch
> code, as well as explicitly namespacing pointers to page table entries
> from the entries themselves by using adopting a 'p' suffix for the former
> (as is sometimes used elsewhere in the kernel source).

Would you consider splitting up this patch into two, where the second
patch makes all the cosmetic changes?  That would make the "meatier"
patch easier to back-port and review.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
  2017-09-27 15:49 [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Will Deacon
                   ` (3 preceding siblings ...)
  2017-09-28 17:30 ` Richard Ruigrok
@ 2017-09-28 19:38 ` Jon Masters
  2017-09-29  8:56   ` Will Deacon
  4 siblings, 1 reply; 27+ messages in thread
From: Jon Masters @ 2017-09-28 19:38 UTC (permalink / raw)
  To: Will Deacon, peterz, paulmck, kirill.shutemov
  Cc: linux-kernel, ynorov, rruigrok, linux-arch, akpm, catalin.marinas

On 09/27/2017 11:49 AM, Will Deacon wrote:

> The moral of the story is that read-after-read (same address) ordering *only*
> applies if READ_ONCE is used consistently. This means we need to fix page
> table dereferences in the core code as well as the arch code to avoid this
> problem. The two RFC patches in this series fix arm64 (which is a bigger fix
> that necessary since I clean things up too) and page_vma_mapped_walk.
> 
> Comments welcome.

Thanks for this Will. I'll echo Timur's comment that it would be ideal
to split this up into the critical piece needed for ordering
access/update to the PMD in the face of a THP split and separately have
the cosmetic cleanups. Needless to say, we've got a bunch of people who
are tracking this one and tracking it ready for backport. We just got
THP re-enabled so I'm pretty keen that we not have to disable again.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-28 18:59         ` Michael Cree
@ 2017-09-29  0:58           ` Paul E. McKenney
  2017-09-29  9:08             ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2017-09-29  0:58 UTC (permalink / raw)
  To: Michael Cree, Will Deacon, Peter Zijlstra, kirill.shutemov,
	linux-kernel, ynorov, rruigrok, linux-arch, akpm,
	catalin.marinas, rth, ink, mattst88, linux-alpha

On Fri, Sep 29, 2017 at 07:59:09AM +1300, Michael Cree wrote:
> On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > > In many cases, page tables can be accessed concurrently by either another
> > > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > > 
> > > > In fact, we should use lockless_dereference() for many of them. Yes
> > > > Alpha is the only one that cares about the difference between that and
> > > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > > this, we might as well do it 'right' :-)
> > > 
> > > I know this sounds daft, but I think one of the big reasons why
> > > lockless_dereference() doesn't get an awful lot of use is because it's
> > > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > > into READ_ONCE? Would anybody actually care about the potential impact on
> > > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > > lockless_dereference())?
> > 
> > This is my cue to ask my usual question...  ;-)
> > 
> > Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> 
> Yes.  I run two Alpha build daemons that build the unofficial
> debian-alpha port.  Debian popcon reports nine machines running
> Alpha, which are likely to be running the 4.12.y kernel which
> is currently in debian-alpha, (and presumably soon to be 4.13.y
> which is now built on Alpha in experimental).

I salute your dedication to Alpha!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
  2017-09-28 19:38 ` Jon Masters
@ 2017-09-29  8:56   ` Will Deacon
  2017-10-03  6:36     ` Jon Masters
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2017-09-29  8:56 UTC (permalink / raw)
  To: Jon Masters
  Cc: peterz, paulmck, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas, timur

[+ Timur]

On Thu, Sep 28, 2017 at 03:38:00PM -0400, Jon Masters wrote:
> On 09/27/2017 11:49 AM, Will Deacon wrote:
> 
> > The moral of the story is that read-after-read (same address) ordering *only*
> > applies if READ_ONCE is used consistently. This means we need to fix page
> > table dereferences in the core code as well as the arch code to avoid this
> > problem. The two RFC patches in this series fix arm64 (which is a bigger fix
> > that necessary since I clean things up too) and page_vma_mapped_walk.
> > 
> > Comments welcome.
> 
> Thanks for this Will. I'll echo Timur's comment that it would be ideal
> to split this up into the critical piece needed for ordering
> access/update to the PMD in the face of a THP split and separately have
> the cosmetic cleanups. Needless to say, we've got a bunch of people who
> are tracking this one and tracking it ready for backport. We just got
> THP re-enabled so I'm pretty keen that we not have to disable again.

Yeah, of course. I already posted a point diff to Yury in the original
thread:

http://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/533299.html

so I'd like to queue that as an arm64 fix after we've worked out the general
direction of the full fix. I also don't see why other architectures
(including x86) can't be hit by this, so an alternative (completely
untested) approach would just be to take patch 2 of this series.

The full fix isn't just cosmetic; it's also addressing the wider problem
of unannotated racing page table accesses outside of the specific failure
case we've run into.

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-29  0:58           ` Paul E. McKenney
@ 2017-09-29  9:08             ` Will Deacon
  2017-09-29 16:29               ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2017-09-29  9:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

On Thu, Sep 28, 2017 at 05:58:30PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 07:59:09AM +1300, Michael Cree wrote:
> > On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> > > On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > > > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > > > In many cases, page tables can be accessed concurrently by either another
> > > > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > > > 
> > > > > In fact, we should use lockless_dereference() for many of them. Yes
> > > > > Alpha is the only one that cares about the difference between that and
> > > > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > > > this, we might as well do it 'right' :-)
> > > > 
> > > > I know this sounds daft, but I think one of the big reasons why
> > > > lockless_dereference() doesn't get an awful lot of use is because it's
> > > > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > > > into READ_ONCE? Would anybody actually care about the potential impact on
> > > > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > > > lockless_dereference())?
> > > 
> > > This is my cue to ask my usual question...  ;-)
> > > 
> > > Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> > 
> > Yes.  I run two Alpha build daemons that build the unofficial
> > debian-alpha port.  Debian popcon reports nine machines running
> > Alpha, which are likely to be running the 4.12.y kernel which
> > is currently in debian-alpha, (and presumably soon to be 4.13.y
> > which is now built on Alpha in experimental).
> 
> I salute your dedication to Alpha!  ;-)

Ok, but where does that leave us wrt my initial proposal of moving
smp_read_barrier_depends() into READ_ONCE and getting rid of
lockless_dereference?

Michael (or anybody else running mainline on SMP Alpha) -- would you be
able to give the diff below a spin and see whether there's a measurable
performance impact?

Cheers,

Will

--->8

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..0ce21e25492a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 		__read_once_size(&(x), __u.__c, sizeof(x));		\
 	else								\
 		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
+	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
 	__u.__val;							\
 })
 #define READ_ONCE(x) __READ_ONCE(x, 1)

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-29  9:08             ` Will Deacon
@ 2017-09-29 16:29               ` Paul E. McKenney
  2017-09-29 16:33                 ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2017-09-29 16:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> On Thu, Sep 28, 2017 at 05:58:30PM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 29, 2017 at 07:59:09AM +1300, Michael Cree wrote:
> > > On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > > > > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > > > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > > > > In many cases, page tables can be accessed concurrently by either another
> > > > > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > > > > 
> > > > > > In fact, we should use lockless_dereference() for many of them. Yes
> > > > > > Alpha is the only one that cares about the difference between that and
> > > > > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > > > > this, we might as well do it 'right' :-)
> > > > > 
> > > > > I know this sounds daft, but I think one of the big reasons why
> > > > > lockless_dereference() doesn't get an awful lot of use is because it's
> > > > > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > > > > into READ_ONCE? Would anybody actually care about the potential impact on
> > > > > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > > > > lockless_dereference())?
> > > > 
> > > > This is my cue to ask my usual question...  ;-)
> > > > 
> > > > Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> > > 
> > > Yes.  I run two Alpha build daemons that build the unofficial
> > > debian-alpha port.  Debian popcon reports nine machines running
> > > Alpha, which are likely to be running the 4.12.y kernel which
> > > is currently in debian-alpha, (and presumably soon to be 4.13.y
> > > which is now built on Alpha in experimental).
> > 
> > I salute your dedication to Alpha!  ;-)
> 
> Ok, but where does that leave us wrt my initial proposal of moving
> smp_read_barrier_depends() into READ_ONCE and getting rid of
> lockless_dereference?
> 
> Michael (or anybody else running mainline on SMP Alpha) -- would you be
> able to give the diff below a spin and see whether there's a measurable
> performance impact?

This will be a sensitive test.  The smp_read_barrier_depends() can be
removed from lockless_dereference().  Without this removal Alpha will
get two memory barriers from rcu_dereference() and friends.

							Thanx, Paul

> Cheers,
> 
> Will
> 
> --->8
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index e95a2631e545..0ce21e25492a 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>  		__read_once_size(&(x), __u.__c, sizeof(x));		\
>  	else								\
>  		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
> +	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
>  	__u.__val;							\
>  })
>  #define READ_ONCE(x) __READ_ONCE(x, 1)
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-29 16:29               ` Paul E. McKenney
@ 2017-09-29 16:33                 ` Will Deacon
  2017-10-03 19:11                   ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2017-09-29 16:33 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > On Thu, Sep 28, 2017 at 05:58:30PM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 29, 2017 at 07:59:09AM +1300, Michael Cree wrote:
> > > > On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > > > > > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > > > > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > > > > > In many cases, page tables can be accessed concurrently by either another
> > > > > > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > > > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > > > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > > > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > > > > > 
> > > > > > > In fact, we should use lockless_dereference() for many of them. Yes
> > > > > > > Alpha is the only one that cares about the difference between that and
> > > > > > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > > > > > this, we might as well do it 'right' :-)
> > > > > > 
> > > > > > I know this sounds daft, but I think one of the big reasons why
> > > > > > lockless_dereference() doesn't get an awful lot of use is because it's
> > > > > > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > > > > > into READ_ONCE? Would anybody actually care about the potential impact on
> > > > > > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > > > > > lockless_dereference())?
> > > > > 
> > > > > This is my cue to ask my usual question...  ;-)
> > > > > 
> > > > > Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> > > > 
> > > > Yes.  I run two Alpha build daemons that build the unofficial
> > > > debian-alpha port.  Debian popcon reports nine machines running
> > > > Alpha, which are likely to be running the 4.12.y kernel which
> > > > is currently in debian-alpha, (and presumably soon to be 4.13.y
> > > > which is now built on Alpha in experimental).
> > > 
> > > I salute your dedication to Alpha!  ;-)
> > 
> > Ok, but where does that leave us wrt my initial proposal of moving
> > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > lockless_dereference?
> > 
> > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > able to give the diff below a spin and see whether there's a measurable
> > performance impact?
> 
> This will be a sensitive test.  The smp_read_barrier_depends() can be
> removed from lockless_dereference().  Without this removal Alpha will
> get two memory barriers from rcu_dereference() and friends.

Oh yes, good point. I was trying to keep the diff simple, but you're
right that this is packing too many barriers. Fixed diff below.

Thanks,

Will

--->8

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..c4ee9d6d8f2d 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 		__read_once_size(&(x), __u.__c, sizeof(x));		\
 	else								\
 		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
+	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
 	__u.__val;							\
 })
 #define READ_ONCE(x) __READ_ONCE(x, 1)
@@ -620,7 +621,6 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 ({ \
 	typeof(p) _________p1 = READ_ONCE(p); \
 	typeof(*(p)) *___typecheck_p __maybe_unused; \
-	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
 	(_________p1); \
 })
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
  2017-09-29  8:56   ` Will Deacon
@ 2017-10-03  6:36     ` Jon Masters
  2017-10-05 16:54       ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Jon Masters @ 2017-10-03  6:36 UTC (permalink / raw)
  To: Will Deacon, Jon Masters
  Cc: peterz, paulmck, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas, timur

On 09/29/2017 04:56 AM, Will Deacon wrote:

> The full fix isn't just cosmetic; it's also addressing the wider problem
> of unannotated racing page table accesses outside of the specific failure
> case we've run into.

Let us know if there are additional tests we should be running on the
Red Hat end. We've got high hundreds of ARM server systems at this
point, including pretty much everything out there.

Jon.

-- 
Computer Architect | Sent from my Fedora powered laptop

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-09-29 16:33                 ` Will Deacon
@ 2017-10-03 19:11                   ` Paul E. McKenney
  2017-10-05 16:31                       ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2017-10-03 19:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > On Thu, Sep 28, 2017 at 05:58:30PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 29, 2017 at 07:59:09AM +1300, Michael Cree wrote:
> > > > > On Thu, Sep 28, 2017 at 08:43:54AM -0700, Paul E. McKenney wrote:
> > > > > > On Thu, Sep 28, 2017 at 09:45:35AM +0100, Will Deacon wrote:
> > > > > > > On Thu, Sep 28, 2017 at 10:38:01AM +0200, Peter Zijlstra wrote:
> > > > > > > > On Wed, Sep 27, 2017 at 04:49:28PM +0100, Will Deacon wrote:
> > > > > > > > > In many cases, page tables can be accessed concurrently by either another
> > > > > > > > > CPU (due to things like fast gup) or by the hardware page table walker
> > > > > > > > > itself, which may set access/dirty bits. In such cases, it is important
> > > > > > > > > to use READ_ONCE/WRITE_ONCE when accessing page table entries so that
> > > > > > > > > entries cannot be torn, merged or subject to apparent loss of coherence.
> > > > > > > > 
> > > > > > > > In fact, we should use lockless_dereference() for many of them. Yes
> > > > > > > > Alpha is the only one that cares about the difference between that and
> > > > > > > > READ_ONCE() and they do have the extra barrier, but if we're going to do
> > > > > > > > this, we might as well do it 'right' :-)
> > > > > > > 
> > > > > > > I know this sounds daft, but I think one of the big reasons why
> > > > > > > lockless_dereference() doesn't get an awful lot of use is because it's
> > > > > > > such a mouthful! Why don't we just move the smp_read_barrier_depends()
> > > > > > > into READ_ONCE? Would anybody actually care about the potential impact on
> > > > > > > Alpha (which, frankly, is treading on thin ice given the low adoption of
> > > > > > > lockless_dereference())?
> > > > > > 
> > > > > > This is my cue to ask my usual question...  ;-)
> > > > > > 
> > > > > > Are people still running mainline kernels on Alpha?  (Added Alpha folks.)
> > > > > 
> > > > > Yes.  I run two Alpha build daemons that build the unofficial
> > > > > debian-alpha port.  Debian popcon reports nine machines running
> > > > > Alpha, which are likely to be running the 4.12.y kernel which
> > > > > is currently in debian-alpha, (and presumably soon to be 4.13.y
> > > > > which is now built on Alpha in experimental).
> > > > 
> > > > I salute your dedication to Alpha!  ;-)
> > > 
> > > Ok, but where does that leave us wrt my initial proposal of moving
> > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > lockless_dereference?
> > > 
> > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > able to give the diff below a spin and see whether there's a measurable
> > > performance impact?
> > 
> > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > removed from lockless_dereference().  Without this removal Alpha will
> > get two memory barriers from rcu_dereference() and friends.
> 
> Oh yes, good point. I was trying to keep the diff simple, but you're
> right that this is packing too many barriers. Fixed diff below.

Not seeing any objections thus far.  If there are none by (say) the
end of this week, I would be happy to queue a patch for the 4.16
merge window.  That should give ample opportunity for further review
and testing.

							Thanx, Paul

> Thanks,
> 
> Will
> 
> --->8
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index e95a2631e545..c4ee9d6d8f2d 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>  		__read_once_size(&(x), __u.__c, sizeof(x));		\
>  	else								\
>  		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
> +	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
>  	__u.__val;							\
>  })
>  #define READ_ONCE(x) __READ_ONCE(x, 1)
> @@ -620,7 +621,6 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>  ({ \
>  	typeof(p) _________p1 = READ_ONCE(p); \
>  	typeof(*(p)) *___typecheck_p __maybe_unused; \
> -	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
>  	(_________p1); \
>  })
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-10-03 19:11                   ` Paul E. McKenney
  2017-10-05 16:31                       ` Will Deacon
@ 2017-10-05 16:31                       ` Will Deacon
  0 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2017-10-05 16:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

Hi Paul,

On Tue, Oct 03, 2017 at 12:11:10PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> > On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > > Ok, but where does that leave us wrt my initial proposal of moving
> > > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > > lockless_dereference?
> > > > 
> > > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > > able to give the diff below a spin and see whether there's a measurable
> > > > performance impact?
> > > 
> > > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > > removed from lockless_dereference().  Without this removal Alpha will
> > > get two memory barriers from rcu_dereference() and friends.
> > 
> > Oh yes, good point. I was trying to keep the diff simple, but you're
> > right that this is packing too many barriers. Fixed diff below.
> 
> Not seeing any objections thus far.  If there are none by (say) the
> end of this week, I would be happy to queue a patch for the 4.16
> merge window.  That should give ample opportunity for further review
> and testing.

Ok, full patch below.

Will

--->8

>From 15956d0cc6b37208d8542b1858a8d8b64227acf4 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Thu, 5 Oct 2017 16:57:36 +0100
Subject: [PATCH] locking/barriers: Kill lockless_dereference

lockless_dereference is a nice idea, but it's gained little traction in
kernel code since it's introduction three years ago. This is partly
because it's a pain to type, but also because using READ_ONCE instead
will work correctly on all architectures apart from Alpha, which is a
fully supported but somewhat niche architecture these days.

This patch moves smp_read_barrier_depends() (a NOP on all architectures
other than Alpha) from lockless_dereference into READ_ONCE, converts
the few actual users over to READ_ONCE and then finally removes
lockless_dereference altogether.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 Documentation/memory-barriers.txt                   | 12 ------------
 .../translations/ko_KR/memory-barriers.txt          | 12 ------------
 arch/x86/events/core.c                              |  2 +-
 arch/x86/include/asm/mmu_context.h                  |  4 ++--
 arch/x86/kernel/ldt.c                               |  2 +-
 drivers/md/dm-mpath.c                               | 20 ++++++++++----------
 fs/dcache.c                                         |  4 ++--
 fs/overlayfs/ovl_entry.h                            |  2 +-
 fs/overlayfs/readdir.c                              |  2 +-
 include/linux/compiler.h                            | 21 +--------------------
 include/linux/rculist.h                             |  4 ++--
 include/linux/rcupdate.h                            |  4 ++--
 kernel/events/core.c                                |  4 ++--
 kernel/seccomp.c                                    |  2 +-
 kernel/task_work.c                                  |  2 +-
 mm/slab.h                                           |  2 +-
 16 files changed, 28 insertions(+), 71 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index b759a60624fd..470a682f3fa4 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1886,18 +1886,6 @@ There are some more advanced barrier functions:
      See Documentation/atomic_{t,bitops}.txt for more information.
 
 
- (*) lockless_dereference();
-
-     This can be thought of as a pointer-fetch wrapper around the
-     smp_read_barrier_depends() data-dependency barrier.
-
-     This is also similar to rcu_dereference(), but in cases where
-     object lifetime is handled by some mechanism other than RCU, for
-     example, when the objects removed only when the system goes down.
-     In addition, lockless_dereference() is used in some data structures
-     that can be used both with and without RCU.
-
-
  (*) dma_wmb();
  (*) dma_rmb();
 
diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
index a7a813258013..ec3b46e27b7a 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -1858,18 +1858,6 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효
      참고하세요.
 
 
- (*) lockless_dereference();
-
-     이 함수는 smp_read_barrier_depends() 데이터 의존성 배리어를 사용하는
-     포인터 읽어오기 래퍼(wrapper) 함수로 생각될 수 있습니다.
-
-     객체의 라이프타임이 RCU 외의 메커니즘으로 관리된다는 점을 제외하면
-     rcu_dereference() 와도 유사한데, 예를 들면 객체가 시스템이 꺼질 때에만
-     제거되는 경우 등입니다.  또한, lockless_dereference() 은 RCU 와 함께
-     사용될수도, RCU 없이 사용될 수도 있는 일부 데이터 구조에 사용되고
-     있습니다.
-
-
  (*) dma_wmb();
  (*) dma_rmb();
 
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 80534d3c2480..589af1eec7c1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2371,7 +2371,7 @@ static unsigned long get_segment_base(unsigned int segment)
 		struct ldt_struct *ldt;
 
 		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = lockless_dereference(current->active_mm->context.ldt);
+		ldt = READ_ONCE(current->active_mm->context.ldt);
 		if (!ldt || idx >= ldt->nr_entries)
 			return 0;
 
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c120b5db178a..9037a4e546e8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -72,8 +72,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
 
-	/* lockless_dereference synchronizes with smp_store_release */
-	ldt = lockless_dereference(mm->context.ldt);
+	/* READ_ONCE synchronizes with smp_store_release */
+	ldt = READ_ONCE(mm->context.ldt);
 
 	/*
 	 * Any change to mm->context.ldt is followed by an IPI to all
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index f0e64db18ac8..0a21390642c4 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -101,7 +101,7 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
 static void install_ldt(struct mm_struct *current_mm,
 			struct ldt_struct *ldt)
 {
-	/* Synchronizes with lockless_dereference in load_mm_ldt. */
+	/* Synchronizes with READ_ONCE in load_mm_ldt. */
 	smp_store_release(&current_mm->context.ldt, ldt);
 
 	/* Activate the LDT for all CPUs using current_mm. */
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 11f273d2f018..3f88c9d32f7e 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -366,7 +366,7 @@ static struct pgpath *choose_path_in_pg(struct multipath *m,
 
 	pgpath = path_to_pgpath(path);
 
-	if (unlikely(lockless_dereference(m->current_pg) != pg)) {
+	if (unlikely(READ_ONCE(m->current_pg) != pg)) {
 		/* Only update current_pgpath if pg changed */
 		spin_lock_irqsave(&m->lock, flags);
 		m->current_pgpath = pgpath;
@@ -390,7 +390,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 	}
 
 	/* Were we instructed to switch PG? */
-	if (lockless_dereference(m->next_pg)) {
+	if (READ_ONCE(m->next_pg)) {
 		spin_lock_irqsave(&m->lock, flags);
 		pg = m->next_pg;
 		if (!pg) {
@@ -406,7 +406,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 
 	/* Don't change PG until it has no remaining paths */
 check_current_pg:
-	pg = lockless_dereference(m->current_pg);
+	pg = READ_ONCE(m->current_pg);
 	if (pg) {
 		pgpath = choose_path_in_pg(m, pg, nr_bytes);
 		if (!IS_ERR_OR_NULL(pgpath))
@@ -473,7 +473,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 	struct request *clone;
 
 	/* Do we need to select a new pgpath? */
-	pgpath = lockless_dereference(m->current_pgpath);
+	pgpath = READ_ONCE(m->current_pgpath);
 	if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
 		pgpath = choose_pgpath(m, nr_bytes);
 
@@ -535,7 +535,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
 	bool queue_io;
 
 	/* Do we need to select a new pgpath? */
-	pgpath = lockless_dereference(m->current_pgpath);
+	pgpath = READ_ONCE(m->current_pgpath);
 	queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
 	if (!pgpath || !queue_io)
 		pgpath = choose_pgpath(m, nr_bytes);
@@ -1804,7 +1804,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 	struct pgpath *current_pgpath;
 	int r;
 
-	current_pgpath = lockless_dereference(m->current_pgpath);
+	current_pgpath = READ_ONCE(m->current_pgpath);
 	if (!current_pgpath)
 		current_pgpath = choose_pgpath(m, 0);
 
@@ -1826,7 +1826,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 	}
 
 	if (r == -ENOTCONN) {
-		if (!lockless_dereference(m->current_pg)) {
+		if (!READ_ONCE(m->current_pg)) {
 			/* Path status changed, redo selection */
 			(void) choose_pgpath(m, 0);
 		}
@@ -1895,9 +1895,9 @@ static int multipath_busy(struct dm_target *ti)
 		return (m->queue_mode != DM_TYPE_MQ_REQUEST_BASED);
 
 	/* Guess which priority_group will be used at next mapping time */
-	pg = lockless_dereference(m->current_pg);
-	next_pg = lockless_dereference(m->next_pg);
-	if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg))
+	pg = READ_ONCE(m->current_pg);
+	next_pg = READ_ONCE(m->next_pg);
+	if (unlikely(!READ_ONCE(m->current_pgpath) && next_pg))
 		pg = next_pg;
 
 	if (!pg) {
diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..34c852af215c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 {
 	/*
 	 * Be careful about RCU walk racing with rename:
-	 * use 'lockless_dereference' to fetch the name pointer.
+	 * use 'READ_ONCE' to fetch the name pointer.
 	 *
 	 * NOTE! Even if a rename will mean that the length
 	 * was not loaded atomically, we don't care. The
@@ -245,7 +245,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	 * early because the data cannot match (there can
 	 * be no NUL in the ct/tcount data)
 	 */
-	const unsigned char *cs = lockless_dereference(dentry->d_name.name);
+	const unsigned char *cs = READ_ONCE(dentry->d_name.name);
 
 	return dentry_string_cmp(cs, ct, tcount);
 }
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 878a750986dd..0f6809fa6628 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -74,5 +74,5 @@ static inline struct ovl_inode *OVL_I(struct inode *inode)
 
 static inline struct dentry *ovl_upperdentry_dereference(struct ovl_inode *oi)
 {
-	return lockless_dereference(oi->__upperdentry);
+	return READ_ONCE(oi->__upperdentry);
 }
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 62e9b22a2077..0b389d330613 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -754,7 +754,7 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	if (!od->is_upper && OVL_TYPE_UPPER(ovl_path_type(dentry))) {
 		struct inode *inode = file_inode(file);
 
-		realfile = lockless_dereference(od->upperfile);
+		realfile = READ_ONCE(od->upperfile);
 		if (!realfile) {
 			struct path upperpath;
 
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..f260ff39f90f 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 		__read_once_size(&(x), __u.__c, sizeof(x));		\
 	else								\
 		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
+	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
 	__u.__val;							\
 })
 #define READ_ONCE(x) __READ_ONCE(x, 1)
@@ -604,24 +605,4 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 	(volatile typeof(x) *)&(x); })
 #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
 
-/**
- * lockless_dereference() - safely load a pointer for later dereference
- * @p: The pointer to load
- *
- * Similar to rcu_dereference(), but for situations where the pointed-to
- * object's lifetime is managed by something other than RCU.  That
- * "something other" might be reference counting or simple immortality.
- *
- * The seemingly unused variable ___typecheck_p validates that @p is
- * indeed a pointer type by using a pointer to typeof(*p) as the type.
- * Taking a pointer to typeof(*p) again is needed in case p is void *.
- */
-#define lockless_dereference(p) \
-({ \
-	typeof(p) _________p1 = READ_ONCE(p); \
-	typeof(*(p)) *___typecheck_p __maybe_unused; \
-	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
-	(_________p1); \
-})
-
 #endif /* __LINUX_COMPILER_H */
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index b1fd8bf85fdc..3a2bb7d8ed4d 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -274,7 +274,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
  */
 #define list_entry_rcu(ptr, type, member) \
-	container_of(lockless_dereference(ptr), type, member)
+	container_of(READ_ONCE(ptr), type, member)
 
 /**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
@@ -367,7 +367,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  * example is when items are added to the list, but never deleted.
  */
 #define list_entry_lockless(ptr, type, member) \
-	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+	container_of((typeof(ptr))READ_ONCE(ptr), type, member)
 
 /**
  * list_for_each_entry_lockless - iterate over rcu list of given type
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de50d8a4cf41..380a3aeb09d7 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -346,7 +346,7 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define __rcu_dereference_check(p, c, space) \
 ({ \
 	/* Dependency order vs. p above. */ \
-	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
+	typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
 	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(________p1)); \
@@ -360,7 +360,7 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define rcu_dereference_raw(p) \
 ({ \
 	/* Dependency order vs. p above. */ \
-	typeof(p) ________p1 = lockless_dereference(p); \
+	typeof(p) ________p1 = READ_ONCE(p); \
 	((typeof(*p) __force __kernel *)(________p1)); \
 })
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6bc21e202ae4..417812ce0099 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4231,7 +4231,7 @@ static void perf_remove_from_owner(struct perf_event *event)
 	 * indeed free this event, otherwise we need to serialize on
 	 * owner->perf_event_mutex.
 	 */
-	owner = lockless_dereference(event->owner);
+	owner = READ_ONCE(event->owner);
 	if (owner) {
 		/*
 		 * Since delayed_put_task_struct() also drops the last
@@ -4328,7 +4328,7 @@ int perf_event_release_kernel(struct perf_event *event)
 		 * Cannot change, child events are not migrated, see the
 		 * comment with perf_event_ctx_lock_nested().
 		 */
-		ctx = lockless_dereference(child->ctx);
+		ctx = READ_ONCE(child->ctx);
 		/*
 		 * Since child_mutex nests inside ctx::mutex, we must jump
 		 * through hoops. We start by grabbing a reference on the ctx.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index bb3a38005b9c..1daa8b61a268 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -189,7 +189,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	u32 ret = SECCOMP_RET_ALLOW;
 	/* Make sure cross-thread synced filter points somewhere sane. */
 	struct seccomp_filter *f =
-			lockless_dereference(current->seccomp.filter);
+			READ_ONCE(current->seccomp.filter);
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (unlikely(WARN_ON(f == NULL)))
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 836a72a66fba..9a9f262fc53d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -67,7 +67,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	 * we raced with task_work_run(), *pprev == NULL/exited.
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	while ((work = lockless_dereference(*pprev))) {
+	while ((work = READ_ONCE(*pprev))) {
 		if (work->func != func)
 			pprev = &work->next;
 		else if (cmpxchg(pprev, work, work->next) == work)
diff --git a/mm/slab.h b/mm/slab.h
index 073362816acc..8894f811a89d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -258,7 +258,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
 	 * memcg_caches issues a write barrier to match this (see
 	 * memcg_create_kmem_cache()).
 	 */
-	cachep = lockless_dereference(arr->entries[idx]);
+	cachep = READ_ONCE(arr->entries[idx]);
 	rcu_read_unlock();
 
 	return cachep;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
@ 2017-10-05 16:31                       ` Will Deacon
  0 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2017-10-05 16:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

Hi Paul,

On Tue, Oct 03, 2017 at 12:11:10PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> > On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > > Ok, but where does that leave us wrt my initial proposal of moving
> > > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > > lockless_dereference?
> > > > 
> > > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > > able to give the diff below a spin and see whether there's a measurable
> > > > performance impact?
> > > 
> > > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > > removed from lockless_dereference().  Without this removal Alpha will
> > > get two memory barriers from rcu_dereference() and friends.
> > 
> > Oh yes, good point. I was trying to keep the diff simple, but you're
> > right that this is packing too many barriers. Fixed diff below.
> 
> Not seeing any objections thus far.  If there are none by (say) the
> end of this week, I would be happy to queue a patch for the 4.16
> merge window.  That should give ample opportunity for further review
> and testing.

Ok, full patch below.

Will

--->8

From 15956d0cc6b37208d8542b1858a8d8b64227acf4 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Thu, 5 Oct 2017 16:57:36 +0100
Subject: [PATCH] locking/barriers: Kill lockless_dereference

lockless_dereference is a nice idea, but it's gained little traction in
kernel code since it's introduction three years ago. This is partly
because it's a pain to type, but also because using READ_ONCE instead
will work correctly on all architectures apart from Alpha, which is a
fully supported but somewhat niche architecture these days.

This patch moves smp_read_barrier_depends() (a NOP on all architectures
other than Alpha) from lockless_dereference into READ_ONCE, converts
the few actual users over to READ_ONCE and then finally removes
lockless_dereference altogether.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 Documentation/memory-barriers.txt                   | 12 ------------
 .../translations/ko_KR/memory-barriers.txt          | 12 ------------
 arch/x86/events/core.c                              |  2 +-
 arch/x86/include/asm/mmu_context.h                  |  4 ++--
 arch/x86/kernel/ldt.c                               |  2 +-
 drivers/md/dm-mpath.c                               | 20 ++++++++++----------
 fs/dcache.c                                         |  4 ++--
 fs/overlayfs/ovl_entry.h                            |  2 +-
 fs/overlayfs/readdir.c                              |  2 +-
 include/linux/compiler.h                            | 21 +--------------------
 include/linux/rculist.h                             |  4 ++--
 include/linux/rcupdate.h                            |  4 ++--
 kernel/events/core.c                                |  4 ++--
 kernel/seccomp.c                                    |  2 +-
 kernel/task_work.c                                  |  2 +-
 mm/slab.h                                           |  2 +-
 16 files changed, 28 insertions(+), 71 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index b759a60624fd..470a682f3fa4 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1886,18 +1886,6 @@ There are some more advanced barrier functions:
      See Documentation/atomic_{t,bitops}.txt for more information.
 
 
- (*) lockless_dereference();
-
-     This can be thought of as a pointer-fetch wrapper around the
-     smp_read_barrier_depends() data-dependency barrier.
-
-     This is also similar to rcu_dereference(), but in cases where
-     object lifetime is handled by some mechanism other than RCU, for
-     example, when the objects removed only when the system goes down.
-     In addition, lockless_dereference() is used in some data structures
-     that can be used both with and without RCU.
-
-
  (*) dma_wmb();
  (*) dma_rmb();
 
diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
index a7a813258013..ec3b46e27b7a 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -1858,18 +1858,6 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효
      참고하세요.
 
 
- (*) lockless_dereference();
-
-     이 함수는 smp_read_barrier_depends() 데이터 의존성 배리어를 사용하는
-     포인터 읽어오기 래퍼(wrapper) 함수로 생각될 수 있습니다.
-
-     객체의 라이프타임이 RCU 외의 메커니즘으로 관리된다는 점을 제외하면
-     rcu_dereference() 와도 유사한데, 예를 들면 객체가 시스템이 꺼질 때에만
-     제거되는 경우 등입니다.  또한, lockless_dereference() 은 RCU 와 함께
-     사용될수도, RCU 없이 사용될 수도 있는 일부 데이터 구조에 사용되고
-     있습니다.
-
-
  (*) dma_wmb();
  (*) dma_rmb();
 
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 80534d3c2480..589af1eec7c1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2371,7 +2371,7 @@ static unsigned long get_segment_base(unsigned int segment)
 		struct ldt_struct *ldt;
 
 		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = lockless_dereference(current->active_mm->context.ldt);
+		ldt = READ_ONCE(current->active_mm->context.ldt);
 		if (!ldt || idx >= ldt->nr_entries)
 			return 0;
 
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c120b5db178a..9037a4e546e8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -72,8 +72,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
 
-	/* lockless_dereference synchronizes with smp_store_release */
-	ldt = lockless_dereference(mm->context.ldt);
+	/* READ_ONCE synchronizes with smp_store_release */
+	ldt = READ_ONCE(mm->context.ldt);
 
 	/*
 	 * Any change to mm->context.ldt is followed by an IPI to all
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index f0e64db18ac8..0a21390642c4 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -101,7 +101,7 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
 static void install_ldt(struct mm_struct *current_mm,
 			struct ldt_struct *ldt)
 {
-	/* Synchronizes with lockless_dereference in load_mm_ldt. */
+	/* Synchronizes with READ_ONCE in load_mm_ldt. */
 	smp_store_release(&current_mm->context.ldt, ldt);
 
 	/* Activate the LDT for all CPUs using current_mm. */
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 11f273d2f018..3f88c9d32f7e 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -366,7 +366,7 @@ static struct pgpath *choose_path_in_pg(struct multipath *m,
 
 	pgpath = path_to_pgpath(path);
 
-	if (unlikely(lockless_dereference(m->current_pg) != pg)) {
+	if (unlikely(READ_ONCE(m->current_pg) != pg)) {
 		/* Only update current_pgpath if pg changed */
 		spin_lock_irqsave(&m->lock, flags);
 		m->current_pgpath = pgpath;
@@ -390,7 +390,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 	}
 
 	/* Were we instructed to switch PG? */
-	if (lockless_dereference(m->next_pg)) {
+	if (READ_ONCE(m->next_pg)) {
 		spin_lock_irqsave(&m->lock, flags);
 		pg = m->next_pg;
 		if (!pg) {
@@ -406,7 +406,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 
 	/* Don't change PG until it has no remaining paths */
 check_current_pg:
-	pg = lockless_dereference(m->current_pg);
+	pg = READ_ONCE(m->current_pg);
 	if (pg) {
 		pgpath = choose_path_in_pg(m, pg, nr_bytes);
 		if (!IS_ERR_OR_NULL(pgpath))
@@ -473,7 +473,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 	struct request *clone;
 
 	/* Do we need to select a new pgpath? */
-	pgpath = lockless_dereference(m->current_pgpath);
+	pgpath = READ_ONCE(m->current_pgpath);
 	if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
 		pgpath = choose_pgpath(m, nr_bytes);
 
@@ -535,7 +535,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
 	bool queue_io;
 
 	/* Do we need to select a new pgpath? */
-	pgpath = lockless_dereference(m->current_pgpath);
+	pgpath = READ_ONCE(m->current_pgpath);
 	queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
 	if (!pgpath || !queue_io)
 		pgpath = choose_pgpath(m, nr_bytes);
@@ -1804,7 +1804,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 	struct pgpath *current_pgpath;
 	int r;
 
-	current_pgpath = lockless_dereference(m->current_pgpath);
+	current_pgpath = READ_ONCE(m->current_pgpath);
 	if (!current_pgpath)
 		current_pgpath = choose_pgpath(m, 0);
 
@@ -1826,7 +1826,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 	}
 
 	if (r == -ENOTCONN) {
-		if (!lockless_dereference(m->current_pg)) {
+		if (!READ_ONCE(m->current_pg)) {
 			/* Path status changed, redo selection */
 			(void) choose_pgpath(m, 0);
 		}
@@ -1895,9 +1895,9 @@ static int multipath_busy(struct dm_target *ti)
 		return (m->queue_mode != DM_TYPE_MQ_REQUEST_BASED);
 
 	/* Guess which priority_group will be used at next mapping time */
-	pg = lockless_dereference(m->current_pg);
-	next_pg = lockless_dereference(m->next_pg);
-	if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg))
+	pg = READ_ONCE(m->current_pg);
+	next_pg = READ_ONCE(m->next_pg);
+	if (unlikely(!READ_ONCE(m->current_pgpath) && next_pg))
 		pg = next_pg;
 
 	if (!pg) {
diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..34c852af215c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 {
 	/*
 	 * Be careful about RCU walk racing with rename:
-	 * use 'lockless_dereference' to fetch the name pointer.
+	 * use 'READ_ONCE' to fetch the name pointer.
 	 *
 	 * NOTE! Even if a rename will mean that the length
 	 * was not loaded atomically, we don't care. The
@@ -245,7 +245,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	 * early because the data cannot match (there can
 	 * be no NUL in the ct/tcount data)
 	 */
-	const unsigned char *cs = lockless_dereference(dentry->d_name.name);
+	const unsigned char *cs = READ_ONCE(dentry->d_name.name);
 
 	return dentry_string_cmp(cs, ct, tcount);
 }
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 878a750986dd..0f6809fa6628 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -74,5 +74,5 @@ static inline struct ovl_inode *OVL_I(struct inode *inode)
 
 static inline struct dentry *ovl_upperdentry_dereference(struct ovl_inode *oi)
 {
-	return lockless_dereference(oi->__upperdentry);
+	return READ_ONCE(oi->__upperdentry);
 }
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 62e9b22a2077..0b389d330613 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -754,7 +754,7 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	if (!od->is_upper && OVL_TYPE_UPPER(ovl_path_type(dentry))) {
 		struct inode *inode = file_inode(file);
 
-		realfile = lockless_dereference(od->upperfile);
+		realfile = READ_ONCE(od->upperfile);
 		if (!realfile) {
 			struct path upperpath;
 
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..f260ff39f90f 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 		__read_once_size(&(x), __u.__c, sizeof(x));		\
 	else								\
 		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
+	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
 	__u.__val;							\
 })
 #define READ_ONCE(x) __READ_ONCE(x, 1)
@@ -604,24 +605,4 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 	(volatile typeof(x) *)&(x); })
 #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
 
-/**
- * lockless_dereference() - safely load a pointer for later dereference
- * @p: The pointer to load
- *
- * Similar to rcu_dereference(), but for situations where the pointed-to
- * object's lifetime is managed by something other than RCU.  That
- * "something other" might be reference counting or simple immortality.
- *
- * The seemingly unused variable ___typecheck_p validates that @p is
- * indeed a pointer type by using a pointer to typeof(*p) as the type.
- * Taking a pointer to typeof(*p) again is needed in case p is void *.
- */
-#define lockless_dereference(p) \
-({ \
-	typeof(p) _________p1 = READ_ONCE(p); \
-	typeof(*(p)) *___typecheck_p __maybe_unused; \
-	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
-	(_________p1); \
-})
-
 #endif /* __LINUX_COMPILER_H */
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index b1fd8bf85fdc..3a2bb7d8ed4d 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -274,7 +274,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
  */
 #define list_entry_rcu(ptr, type, member) \
-	container_of(lockless_dereference(ptr), type, member)
+	container_of(READ_ONCE(ptr), type, member)
 
 /**
  * Where are list_empty_rcu() and list_first_entry_rcu()?
@@ -367,7 +367,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  * example is when items are added to the list, but never deleted.
  */
 #define list_entry_lockless(ptr, type, member) \
-	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+	container_of((typeof(ptr))READ_ONCE(ptr), type, member)
 
 /**
  * list_for_each_entry_lockless - iterate over rcu list of given type
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index de50d8a4cf41..380a3aeb09d7 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -346,7 +346,7 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define __rcu_dereference_check(p, c, space) \
 ({ \
 	/* Dependency order vs. p above. */ \
-	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
+	typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
 	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(________p1)); \
@@ -360,7 +360,7 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define rcu_dereference_raw(p) \
 ({ \
 	/* Dependency order vs. p above. */ \
-	typeof(p) ________p1 = lockless_dereference(p); \
+	typeof(p) ________p1 = READ_ONCE(p); \
 	((typeof(*p) __force __kernel *)(________p1)); \
 })
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6bc21e202ae4..417812ce0099 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4231,7 +4231,7 @@ static void perf_remove_from_owner(struct perf_event *event)
 	 * indeed free this event, otherwise we need to serialize on
 	 * owner->perf_event_mutex.
 	 */
-	owner = lockless_dereference(event->owner);
+	owner = READ_ONCE(event->owner);
 	if (owner) {
 		/*
 		 * Since delayed_put_task_struct() also drops the last
@@ -4328,7 +4328,7 @@ int perf_event_release_kernel(struct perf_event *event)
 		 * Cannot change, child events are not migrated, see the
 		 * comment with perf_event_ctx_lock_nested().
 		 */
-		ctx = lockless_dereference(child->ctx);
+		ctx = READ_ONCE(child->ctx);
 		/*
 		 * Since child_mutex nests inside ctx::mutex, we must jump
 		 * through hoops. We start by grabbing a reference on the ctx.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index bb3a38005b9c..1daa8b61a268 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -189,7 +189,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	u32 ret = SECCOMP_RET_ALLOW;
 	/* Make sure cross-thread synced filter points somewhere sane. */
 	struct seccomp_filter *f =
-			lockless_dereference(current->seccomp.filter);
+			READ_ONCE(current->seccomp.filter);
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (unlikely(WARN_ON(f == NULL)))
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 836a72a66fba..9a9f262fc53d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -67,7 +67,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	 * we raced with task_work_run(), *pprev == NULL/exited.
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	while ((work = lockless_dereference(*pprev))) {
+	while ((work = READ_ONCE(*pprev))) {
 		if (work->func != func)
 			pprev = &work->next;
 		else if (cmpxchg(pprev, work, work->next) == work)
diff --git a/mm/slab.h b/mm/slab.h
index 073362816acc..8894f811a89d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -258,7 +258,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
 	 * memcg_caches issues a write barrier to match this (see
 	 * memcg_create_kmem_cache()).
 	 */
-	cachep = lockless_dereference(arr->entries[idx]);
+	cachep = READ_ONCE(arr->entries[idx]);
 	rcu_read_unlock();
 
 	return cachep;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
@ 2017-10-05 16:31                       ` Will Deacon
  0 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2017-10-05 16:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

Hi Paul,

On Tue, Oct 03, 2017 at 12:11:10PM -0700, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> > On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > > Ok, but where does that leave us wrt my initial proposal of moving
> > > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > > lockless_dereference?
> > > > 
> > > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > > able to give the diff below a spin and see whether there's a measurable
> > > > performance impact?
> > > 
> > > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > > removed from lockless_dereference().  Without this removal Alpha will
> > > get two memory barriers from rcu_dereference() and friends.
> > 
> > Oh yes, good point. I was trying to keep the diff simple, but you're
> > right that this is packing too many barriers. Fixed diff below.
> 
> Not seeing any objections thus far.  If there are none by (say) the
> end of this week, I would be happy to queue a patch for the 4.16
> merge window.  That should give ample opportunity for further review
> and testing.

Ok, full patch below.

Will

--->8

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes
  2017-10-03  6:36     ` Jon Masters
@ 2017-10-05 16:54       ` Will Deacon
  0 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2017-10-05 16:54 UTC (permalink / raw)
  To: Jon Masters
  Cc: peterz, paulmck, kirill.shutemov, linux-kernel, ynorov, rruigrok,
	linux-arch, akpm, catalin.marinas, timur

On Tue, Oct 03, 2017 at 02:36:42AM -0400, Jon Masters wrote:
> On 09/29/2017 04:56 AM, Will Deacon wrote:
> 
> > The full fix isn't just cosmetic; it's also addressing the wider problem
> > of unannotated racing page table accesses outside of the specific failure
> > case we've run into.
> 
> Let us know if there are additional tests we should be running on the
> Red Hat end. We've got high hundreds of ARM server systems at this
> point, including pretty much everything out there.

TBH, there's nothing ARM-specific about this issue afaict and it should
be reproducible on x86 if the compiler can keep the initially loaded pmd
live in a GPR for long enough.

As for wider problems, you want to stress anything that does page table
modification concurrently with lockless walkers (although GUP looks mostly
ok modulo the lack of pud_trans_huge support, which I'll try to fix if
I find time).

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-10-05 16:31                       ` Will Deacon
  (?)
  (?)
@ 2017-10-05 19:22                       ` Paul E. McKenney
  -1 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2017-10-05 19:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Cree, Peter Zijlstra, kirill.shutemov, linux-kernel,
	ynorov, rruigrok, linux-arch, akpm, catalin.marinas, rth, ink,
	mattst88, linux-alpha

On Thu, Oct 05, 2017 at 05:31:54PM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Tue, Oct 03, 2017 at 12:11:10PM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> > > On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > > > Ok, but where does that leave us wrt my initial proposal of moving
> > > > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > > > lockless_dereference?
> > > > > 
> > > > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > > > able to give the diff below a spin and see whether there's a measurable
> > > > > performance impact?
> > > > 
> > > > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > > > removed from lockless_dereference().  Without this removal Alpha will
> > > > get two memory barriers from rcu_dereference() and friends.
> > > 
> > > Oh yes, good point. I was trying to keep the diff simple, but you're
> > > right that this is packing too many barriers. Fixed diff below.
> > 
> > Not seeing any objections thus far.  If there are none by (say) the
> > end of this week, I would be happy to queue a patch for the 4.16
> > merge window.  That should give ample opportunity for further review
> > and testing.
> 
> Ok, full patch below.

Very good, applied for testing and review.  I did have to adjust context
a bit for other -rcu commits, and the result is below.  (Removed a single
"*" in a comment.)

							Thanx, Paul

------------------------------------------------------------------------

commit 7e3675cc18bbf4d84f60bfc02ff563ae3764ad35
Author: Will Deacon <will.deacon@arm.com>
Date:   Thu Oct 5 17:31:54 2017 +0100

    locking/barriers: Kill lockless_dereference
    
    lockless_dereference is a nice idea, but it's gained little traction in
    kernel code since it's introduction three years ago. This is partly
    because it's a pain to type, but also because using READ_ONCE instead
    will work correctly on all architectures apart from Alpha, which is a
    fully supported but somewhat niche architecture these days.
    
    This patch moves smp_read_barrier_depends() (a NOP on all architectures
    other than Alpha) from lockless_dereference into READ_ONCE, converts
    the few actual users over to READ_ONCE and then finally removes
    lockless_dereference altogether.
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 519940ec767f..479ecec80593 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1880,18 +1880,6 @@ There are some more advanced barrier functions:
      See Documentation/atomic_{t,bitops}.txt for more information.
 
 
- (*) lockless_dereference();
-
-     This can be thought of as a pointer-fetch wrapper around the
-     smp_read_barrier_depends() data-dependency barrier.
-
-     This is also similar to rcu_dereference(), but in cases where
-     object lifetime is handled by some mechanism other than RCU, for
-     example, when the objects removed only when the system goes down.
-     In addition, lockless_dereference() is used in some data structures
-     that can be used both with and without RCU.
-
-
  (*) dma_wmb();
  (*) dma_rmb();
 
diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
index a7a813258013..ec3b46e27b7a 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -1858,18 +1858,6 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효
      참고하세요.
 
 
- (*) lockless_dereference();
-
-     이 함수는 smp_read_barrier_depends() 데이터 의존성 배리어를 사용하는
-     포인터 읽어오기 래퍼(wrapper) 함수로 생각될 수 있습니다.
-
-     객체의 라이프타임이 RCU 외의 메커니즘으로 관리된다는 점을 제외하면
-     rcu_dereference() 와도 유사한데, 예를 들면 객체가 시스템이 꺼질 때에만
-     제거되는 경우 등입니다.  또한, lockless_dereference() 은 RCU 와 함께
-     사용될수도, RCU 없이 사용될 수도 있는 일부 데이터 구조에 사용되고
-     있습니다.
-
-
  (*) dma_wmb();
  (*) dma_rmb();
 
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 80534d3c2480..589af1eec7c1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2371,7 +2371,7 @@ static unsigned long get_segment_base(unsigned int segment)
 		struct ldt_struct *ldt;
 
 		/* IRQs are off, so this synchronizes with smp_store_release */
-		ldt = lockless_dereference(current->active_mm->context.ldt);
+		ldt = READ_ONCE(current->active_mm->context.ldt);
 		if (!ldt || idx >= ldt->nr_entries)
 			return 0;
 
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index c120b5db178a..9037a4e546e8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -72,8 +72,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
 	struct ldt_struct *ldt;
 
-	/* lockless_dereference synchronizes with smp_store_release */
-	ldt = lockless_dereference(mm->context.ldt);
+	/* READ_ONCE synchronizes with smp_store_release */
+	ldt = READ_ONCE(mm->context.ldt);
 
 	/*
 	 * Any change to mm->context.ldt is followed by an IPI to all
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index f0e64db18ac8..0a21390642c4 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -101,7 +101,7 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
 static void install_ldt(struct mm_struct *current_mm,
 			struct ldt_struct *ldt)
 {
-	/* Synchronizes with lockless_dereference in load_mm_ldt. */
+	/* Synchronizes with READ_ONCE in load_mm_ldt. */
 	smp_store_release(&current_mm->context.ldt, ldt);
 
 	/* Activate the LDT for all CPUs using current_mm. */
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 11f273d2f018..3f88c9d32f7e 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -366,7 +366,7 @@ static struct pgpath *choose_path_in_pg(struct multipath *m,
 
 	pgpath = path_to_pgpath(path);
 
-	if (unlikely(lockless_dereference(m->current_pg) != pg)) {
+	if (unlikely(READ_ONCE(m->current_pg) != pg)) {
 		/* Only update current_pgpath if pg changed */
 		spin_lock_irqsave(&m->lock, flags);
 		m->current_pgpath = pgpath;
@@ -390,7 +390,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 	}
 
 	/* Were we instructed to switch PG? */
-	if (lockless_dereference(m->next_pg)) {
+	if (READ_ONCE(m->next_pg)) {
 		spin_lock_irqsave(&m->lock, flags);
 		pg = m->next_pg;
 		if (!pg) {
@@ -406,7 +406,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
 
 	/* Don't change PG until it has no remaining paths */
 check_current_pg:
-	pg = lockless_dereference(m->current_pg);
+	pg = READ_ONCE(m->current_pg);
 	if (pg) {
 		pgpath = choose_path_in_pg(m, pg, nr_bytes);
 		if (!IS_ERR_OR_NULL(pgpath))
@@ -473,7 +473,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
 	struct request *clone;
 
 	/* Do we need to select a new pgpath? */
-	pgpath = lockless_dereference(m->current_pgpath);
+	pgpath = READ_ONCE(m->current_pgpath);
 	if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
 		pgpath = choose_pgpath(m, nr_bytes);
 
@@ -535,7 +535,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
 	bool queue_io;
 
 	/* Do we need to select a new pgpath? */
-	pgpath = lockless_dereference(m->current_pgpath);
+	pgpath = READ_ONCE(m->current_pgpath);
 	queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
 	if (!pgpath || !queue_io)
 		pgpath = choose_pgpath(m, nr_bytes);
@@ -1804,7 +1804,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 	struct pgpath *current_pgpath;
 	int r;
 
-	current_pgpath = lockless_dereference(m->current_pgpath);
+	current_pgpath = READ_ONCE(m->current_pgpath);
 	if (!current_pgpath)
 		current_pgpath = choose_pgpath(m, 0);
 
@@ -1826,7 +1826,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
 	}
 
 	if (r == -ENOTCONN) {
-		if (!lockless_dereference(m->current_pg)) {
+		if (!READ_ONCE(m->current_pg)) {
 			/* Path status changed, redo selection */
 			(void) choose_pgpath(m, 0);
 		}
@@ -1895,9 +1895,9 @@ static int multipath_busy(struct dm_target *ti)
 		return (m->queue_mode != DM_TYPE_MQ_REQUEST_BASED);
 
 	/* Guess which priority_group will be used at next mapping time */
-	pg = lockless_dereference(m->current_pg);
-	next_pg = lockless_dereference(m->next_pg);
-	if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg))
+	pg = READ_ONCE(m->current_pg);
+	next_pg = READ_ONCE(m->next_pg);
+	if (unlikely(!READ_ONCE(m->current_pgpath) && next_pg))
 		pg = next_pg;
 
 	if (!pg) {
diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..34c852af215c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -231,7 +231,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 {
 	/*
 	 * Be careful about RCU walk racing with rename:
-	 * use 'lockless_dereference' to fetch the name pointer.
+	 * use 'READ_ONCE' to fetch the name pointer.
 	 *
 	 * NOTE! Even if a rename will mean that the length
 	 * was not loaded atomically, we don't care. The
@@ -245,7 +245,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
 	 * early because the data cannot match (there can
 	 * be no NUL in the ct/tcount data)
 	 */
-	const unsigned char *cs = lockless_dereference(dentry->d_name.name);
+	const unsigned char *cs = READ_ONCE(dentry->d_name.name);
 
 	return dentry_string_cmp(cs, ct, tcount);
 }
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 878a750986dd..0f6809fa6628 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -74,5 +74,5 @@ static inline struct ovl_inode *OVL_I(struct inode *inode)
 
 static inline struct dentry *ovl_upperdentry_dereference(struct ovl_inode *oi)
 {
-	return lockless_dereference(oi->__upperdentry);
+	return READ_ONCE(oi->__upperdentry);
 }
diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
index 62e9b22a2077..0b389d330613 100644
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -754,7 +754,7 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
 	if (!od->is_upper && OVL_TYPE_UPPER(ovl_path_type(dentry))) {
 		struct inode *inode = file_inode(file);
 
-		realfile = lockless_dereference(od->upperfile);
+		realfile = READ_ONCE(od->upperfile);
 		if (!realfile) {
 			struct path upperpath;
 
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e95a2631e545..f260ff39f90f 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 		__read_once_size(&(x), __u.__c, sizeof(x));		\
 	else								\
 		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
+	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
 	__u.__val;							\
 })
 #define READ_ONCE(x) __READ_ONCE(x, 1)
@@ -604,24 +605,4 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
 	(volatile typeof(x) *)&(x); })
 #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
 
-/**
- * lockless_dereference() - safely load a pointer for later dereference
- * @p: The pointer to load
- *
- * Similar to rcu_dereference(), but for situations where the pointed-to
- * object's lifetime is managed by something other than RCU.  That
- * "something other" might be reference counting or simple immortality.
- *
- * The seemingly unused variable ___typecheck_p validates that @p is
- * indeed a pointer type by using a pointer to typeof(*p) as the type.
- * Taking a pointer to typeof(*p) again is needed in case p is void *.
- */
-#define lockless_dereference(p) \
-({ \
-	typeof(p) _________p1 = READ_ONCE(p); \
-	typeof(*(p)) *___typecheck_p __maybe_unused; \
-	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
-	(_________p1); \
-})
-
 #endif /* __LINUX_COMPILER_H */
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 2bea1d5e9930..5ed091c064b2 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -274,7 +274,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
  */
 #define list_entry_rcu(ptr, type, member) \
-	container_of(lockless_dereference(ptr), type, member)
+	container_of(READ_ONCE(ptr), type, member)
 
 /*
  * Where are list_empty_rcu() and list_first_entry_rcu()?
@@ -367,7 +367,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
  * example is when items are added to the list, but never deleted.
  */
 #define list_entry_lockless(ptr, type, member) \
-	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
+	container_of((typeof(ptr))READ_ONCE(ptr), type, member)
 
 /**
  * list_for_each_entry_lockless - iterate over rcu list of given type
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1a9f70d44af9..a6ddc42f87a5 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -346,7 +346,7 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define __rcu_dereference_check(p, c, space) \
 ({ \
 	/* Dependency order vs. p above. */ \
-	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
+	typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
 	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
 	rcu_dereference_sparse(p, space); \
 	((typeof(*p) __force __kernel *)(________p1)); \
@@ -360,7 +360,7 @@ static inline void rcu_preempt_sleep_check(void) { }
 #define rcu_dereference_raw(p) \
 ({ \
 	/* Dependency order vs. p above. */ \
-	typeof(p) ________p1 = lockless_dereference(p); \
+	typeof(p) ________p1 = READ_ONCE(p); \
 	((typeof(*p) __force __kernel *)(________p1)); \
 })
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6bc21e202ae4..417812ce0099 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4231,7 +4231,7 @@ static void perf_remove_from_owner(struct perf_event *event)
 	 * indeed free this event, otherwise we need to serialize on
 	 * owner->perf_event_mutex.
 	 */
-	owner = lockless_dereference(event->owner);
+	owner = READ_ONCE(event->owner);
 	if (owner) {
 		/*
 		 * Since delayed_put_task_struct() also drops the last
@@ -4328,7 +4328,7 @@ int perf_event_release_kernel(struct perf_event *event)
 		 * Cannot change, child events are not migrated, see the
 		 * comment with perf_event_ctx_lock_nested().
 		 */
-		ctx = lockless_dereference(child->ctx);
+		ctx = READ_ONCE(child->ctx);
 		/*
 		 * Since child_mutex nests inside ctx::mutex, we must jump
 		 * through hoops. We start by grabbing a reference on the ctx.
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index bb3a38005b9c..1daa8b61a268 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -189,7 +189,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
 	u32 ret = SECCOMP_RET_ALLOW;
 	/* Make sure cross-thread synced filter points somewhere sane. */
 	struct seccomp_filter *f =
-			lockless_dereference(current->seccomp.filter);
+			READ_ONCE(current->seccomp.filter);
 
 	/* Ensure unexpected behavior doesn't result in failing open. */
 	if (unlikely(WARN_ON(f == NULL)))
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 836a72a66fba..9a9f262fc53d 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -67,7 +67,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
 	 * we raced with task_work_run(), *pprev == NULL/exited.
 	 */
 	raw_spin_lock_irqsave(&task->pi_lock, flags);
-	while ((work = lockless_dereference(*pprev))) {
+	while ((work = READ_ONCE(*pprev))) {
 		if (work->func != func)
 			pprev = &work->next;
 		else if (cmpxchg(pprev, work, work->next) == work)
diff --git a/mm/slab.h b/mm/slab.h
index 073362816acc..8894f811a89d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -258,7 +258,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
 	 * memcg_caches issues a write barrier to match this (see
 	 * memcg_create_kmem_cache()).
 	 */
-	cachep = lockless_dereference(arr->entries[idx]);
+	cachep = READ_ONCE(arr->entries[idx]);
 	rcu_read_unlock();
 
 	return cachep;

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-10-05 16:31                       ` Will Deacon
                                         ` (2 preceding siblings ...)
  (?)
@ 2017-10-05 19:31                       ` Andrea Parri
  2017-10-05 20:09                         ` Paul E. McKenney
  -1 siblings, 1 reply; 27+ messages in thread
From: Andrea Parri @ 2017-10-05 19:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Michael Cree, Peter Zijlstra, kirill.shutemov,
	linux-kernel, ynorov, rruigrok, linux-arch, akpm,
	catalin.marinas, rth, ink, mattst88, linux-alpha

Hi Will,

none of my comments below represent objections to this patch, but
let me remark:


On Thu, Oct 05, 2017 at 05:31:54PM +0100, Will Deacon wrote:
> Hi Paul,
> 
> On Tue, Oct 03, 2017 at 12:11:10PM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> > > On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > > > Ok, but where does that leave us wrt my initial proposal of moving
> > > > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > > > lockless_dereference?
> > > > > 
> > > > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > > > able to give the diff below a spin and see whether there's a measurable
> > > > > performance impact?
> > > > 
> > > > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > > > removed from lockless_dereference().  Without this removal Alpha will
> > > > get two memory barriers from rcu_dereference() and friends.
> > > 
> > > Oh yes, good point. I was trying to keep the diff simple, but you're
> > > right that this is packing too many barriers. Fixed diff below.
> > 
> > Not seeing any objections thus far.  If there are none by (say) the
> > end of this week, I would be happy to queue a patch for the 4.16
> > merge window.  That should give ample opportunity for further review
> > and testing.
> 
> Ok, full patch below.
> 
> Will
> 
> --->8
> 
> From 15956d0cc6b37208d8542b1858a8d8b64227acf4 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Thu, 5 Oct 2017 16:57:36 +0100
> Subject: [PATCH] locking/barriers: Kill lockless_dereference
> 
> lockless_dereference is a nice idea, but it's gained little traction in
> kernel code since it's introduction three years ago. This is partly
> because it's a pain to type, but also because using READ_ONCE instead
> will work correctly on all architectures apart from Alpha, which is a
> fully supported but somewhat niche architecture these days.

lockless_dereference might be a mouthful, but it does (explicitly)
say/remark: "Yep, we are relying on the following address dep. to
be "in strong-ppo" ".

Such information will be lost or, at least, not immediately clear
by just reading a READ_ONCE(). (And Yes, this information is only
relevant when we "include" Alpha in the picture/analysis.)


> 
> This patch moves smp_read_barrier_depends() (a NOP on all architectures
> other than Alpha) from lockless_dereference into READ_ONCE, converts
> the few actual users over to READ_ONCE and then finally removes
> lockless_dereference altogether.

Notice that several "potential users" of lockless_dereference are
currently hidden in other call sites for smp_read_barrier_depends
(i.e., cases where this barrier is not called from within a lock-
less or an RCU dereference).

Some of these usages (e.g.,

  include/linux/percpu-refcount.h:__ref_is_percpu,
  mm/ksm.c:get_ksm_page,
  security/keys/keyring.c:search_nested_keyrings )

precedes this barrier with a READ_ONCE; others (e.g.,

  arch/alpha/include/asm/pgtable.h:pmd_offset,
  net/ipv4/netfilter/arp_tables.c:arpt_do_table
  kernel/kernel/events/uprobes.c:get_trampiline_vaddr )

with a plain read.

There also appear to be cases where the barrier is preceded by an
ACCESS_ONCE (c.f, fs/dcache.c:prepend_name) or by an xchg_release
(c.f., kernel/locking/qspinlock.c:queued_spin_lock_slowpath), and
it would not be difficult to imagine/create different usages.


> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>

I understand that we all agree we're missing a Tested-by here ;-).

  Andrea


> ---
>  Documentation/memory-barriers.txt                   | 12 ------------
>  .../translations/ko_KR/memory-barriers.txt          | 12 ------------
>  arch/x86/events/core.c                              |  2 +-
>  arch/x86/include/asm/mmu_context.h                  |  4 ++--
>  arch/x86/kernel/ldt.c                               |  2 +-
>  drivers/md/dm-mpath.c                               | 20 ++++++++++----------
>  fs/dcache.c                                         |  4 ++--
>  fs/overlayfs/ovl_entry.h                            |  2 +-
>  fs/overlayfs/readdir.c                              |  2 +-
>  include/linux/compiler.h                            | 21 +--------------------
>  include/linux/rculist.h                             |  4 ++--
>  include/linux/rcupdate.h                            |  4 ++--
>  kernel/events/core.c                                |  4 ++--
>  kernel/seccomp.c                                    |  2 +-
>  kernel/task_work.c                                  |  2 +-
>  mm/slab.h                                           |  2 +-
>  16 files changed, 28 insertions(+), 71 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index b759a60624fd..470a682f3fa4 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1886,18 +1886,6 @@ There are some more advanced barrier functions:
>       See Documentation/atomic_{t,bitops}.txt for more information.
>  
>  
> - (*) lockless_dereference();
> -
> -     This can be thought of as a pointer-fetch wrapper around the
> -     smp_read_barrier_depends() data-dependency barrier.
> -
> -     This is also similar to rcu_dereference(), but in cases where
> -     object lifetime is handled by some mechanism other than RCU, for
> -     example, when the objects removed only when the system goes down.
> -     In addition, lockless_dereference() is used in some data structures
> -     that can be used both with and without RCU.
> -
> -
>   (*) dma_wmb();
>   (*) dma_rmb();
>  
> diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
> index a7a813258013..ec3b46e27b7a 100644
> --- a/Documentation/translations/ko_KR/memory-barriers.txt
> +++ b/Documentation/translations/ko_KR/memory-barriers.txt
> @@ -1858,18 +1858,6 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효
>       참고하세요.
>  
>  
> - (*) lockless_dereference();
> -
> -     이 함수는 smp_read_barrier_depends() 데이터 의존성 배리어를 사용하는
> -     포인터 읽어오기 래퍼(wrapper) 함수로 생각될 수 있습니다.
> -
> -     객체의 라이프타임이 RCU 외의 메커니즘으로 관리된다는 점을 제외하면
> -     rcu_dereference() 와도 유사한데, 예를 들면 객체가 시스템이 꺼질 때에만
> -     제거되는 경우 등입니다.  또한, lockless_dereference() 은 RCU 와 함께
> -     사용될수도, RCU 없이 사용될 수도 있는 일부 데이터 구조에 사용되고
> -     있습니다.
> -
> -
>   (*) dma_wmb();
>   (*) dma_rmb();
>  
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 80534d3c2480..589af1eec7c1 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2371,7 +2371,7 @@ static unsigned long get_segment_base(unsigned int segment)
>  		struct ldt_struct *ldt;
>  
>  		/* IRQs are off, so this synchronizes with smp_store_release */
> -		ldt = lockless_dereference(current->active_mm->context.ldt);
> +		ldt = READ_ONCE(current->active_mm->context.ldt);
>  		if (!ldt || idx >= ldt->nr_entries)
>  			return 0;
>  
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index c120b5db178a..9037a4e546e8 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -72,8 +72,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
>  #ifdef CONFIG_MODIFY_LDT_SYSCALL
>  	struct ldt_struct *ldt;
>  
> -	/* lockless_dereference synchronizes with smp_store_release */
> -	ldt = lockless_dereference(mm->context.ldt);
> +	/* READ_ONCE synchronizes with smp_store_release */
> +	ldt = READ_ONCE(mm->context.ldt);
>  
>  	/*
>  	 * Any change to mm->context.ldt is followed by an IPI to all
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index f0e64db18ac8..0a21390642c4 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -101,7 +101,7 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
>  static void install_ldt(struct mm_struct *current_mm,
>  			struct ldt_struct *ldt)
>  {
> -	/* Synchronizes with lockless_dereference in load_mm_ldt. */
> +	/* Synchronizes with READ_ONCE in load_mm_ldt. */
>  	smp_store_release(&current_mm->context.ldt, ldt);
>  
>  	/* Activate the LDT for all CPUs using current_mm. */
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 11f273d2f018..3f88c9d32f7e 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -366,7 +366,7 @@ static struct pgpath *choose_path_in_pg(struct multipath *m,
>  
>  	pgpath = path_to_pgpath(path);
>  
> -	if (unlikely(lockless_dereference(m->current_pg) != pg)) {
> +	if (unlikely(READ_ONCE(m->current_pg) != pg)) {
>  		/* Only update current_pgpath if pg changed */
>  		spin_lock_irqsave(&m->lock, flags);
>  		m->current_pgpath = pgpath;
> @@ -390,7 +390,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
>  	}
>  
>  	/* Were we instructed to switch PG? */
> -	if (lockless_dereference(m->next_pg)) {
> +	if (READ_ONCE(m->next_pg)) {
>  		spin_lock_irqsave(&m->lock, flags);
>  		pg = m->next_pg;
>  		if (!pg) {
> @@ -406,7 +406,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
>  
>  	/* Don't change PG until it has no remaining paths */
>  check_current_pg:
> -	pg = lockless_dereference(m->current_pg);
> +	pg = READ_ONCE(m->current_pg);
>  	if (pg) {
>  		pgpath = choose_path_in_pg(m, pg, nr_bytes);
>  		if (!IS_ERR_OR_NULL(pgpath))
> @@ -473,7 +473,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>  	struct request *clone;
>  
>  	/* Do we need to select a new pgpath? */
> -	pgpath = lockless_dereference(m->current_pgpath);
> +	pgpath = READ_ONCE(m->current_pgpath);
>  	if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
>  		pgpath = choose_pgpath(m, nr_bytes);
>  
> @@ -535,7 +535,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
>  	bool queue_io;
>  
>  	/* Do we need to select a new pgpath? */
> -	pgpath = lockless_dereference(m->current_pgpath);
> +	pgpath = READ_ONCE(m->current_pgpath);
>  	queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
>  	if (!pgpath || !queue_io)
>  		pgpath = choose_pgpath(m, nr_bytes);
> @@ -1804,7 +1804,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
>  	struct pgpath *current_pgpath;
>  	int r;
>  
> -	current_pgpath = lockless_dereference(m->current_pgpath);
> +	current_pgpath = READ_ONCE(m->current_pgpath);
>  	if (!current_pgpath)
>  		current_pgpath = choose_pgpath(m, 0);
>  
> @@ -1826,7 +1826,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
>  	}
>  
>  	if (r == -ENOTCONN) {
> -		if (!lockless_dereference(m->current_pg)) {
> +		if (!READ_ONCE(m->current_pg)) {
>  			/* Path status changed, redo selection */
>  			(void) choose_pgpath(m, 0);
>  		}
> @@ -1895,9 +1895,9 @@ static int multipath_busy(struct dm_target *ti)
>  		return (m->queue_mode != DM_TYPE_MQ_REQUEST_BASED);
>  
>  	/* Guess which priority_group will be used at next mapping time */
> -	pg = lockless_dereference(m->current_pg);
> -	next_pg = lockless_dereference(m->next_pg);
> -	if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg))
> +	pg = READ_ONCE(m->current_pg);
> +	next_pg = READ_ONCE(m->next_pg);
> +	if (unlikely(!READ_ONCE(m->current_pgpath) && next_pg))
>  		pg = next_pg;
>  
>  	if (!pg) {
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f90141387f01..34c852af215c 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -231,7 +231,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
>  {
>  	/*
>  	 * Be careful about RCU walk racing with rename:
> -	 * use 'lockless_dereference' to fetch the name pointer.
> +	 * use 'READ_ONCE' to fetch the name pointer.
>  	 *
>  	 * NOTE! Even if a rename will mean that the length
>  	 * was not loaded atomically, we don't care. The
> @@ -245,7 +245,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
>  	 * early because the data cannot match (there can
>  	 * be no NUL in the ct/tcount data)
>  	 */
> -	const unsigned char *cs = lockless_dereference(dentry->d_name.name);
> +	const unsigned char *cs = READ_ONCE(dentry->d_name.name);
>  
>  	return dentry_string_cmp(cs, ct, tcount);
>  }
> diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> index 878a750986dd..0f6809fa6628 100644
> --- a/fs/overlayfs/ovl_entry.h
> +++ b/fs/overlayfs/ovl_entry.h
> @@ -74,5 +74,5 @@ static inline struct ovl_inode *OVL_I(struct inode *inode)
>  
>  static inline struct dentry *ovl_upperdentry_dereference(struct ovl_inode *oi)
>  {
> -	return lockless_dereference(oi->__upperdentry);
> +	return READ_ONCE(oi->__upperdentry);
>  }
> diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> index 62e9b22a2077..0b389d330613 100644
> --- a/fs/overlayfs/readdir.c
> +++ b/fs/overlayfs/readdir.c
> @@ -754,7 +754,7 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
>  	if (!od->is_upper && OVL_TYPE_UPPER(ovl_path_type(dentry))) {
>  		struct inode *inode = file_inode(file);
>  
> -		realfile = lockless_dereference(od->upperfile);
> +		realfile = READ_ONCE(od->upperfile);
>  		if (!realfile) {
>  			struct path upperpath;
>  
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index e95a2631e545..f260ff39f90f 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>  		__read_once_size(&(x), __u.__c, sizeof(x));		\
>  	else								\
>  		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
> +	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
>  	__u.__val;							\
>  })
>  #define READ_ONCE(x) __READ_ONCE(x, 1)
> @@ -604,24 +605,4 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
>  	(volatile typeof(x) *)&(x); })
>  #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
>  
> -/**
> - * lockless_dereference() - safely load a pointer for later dereference
> - * @p: The pointer to load
> - *
> - * Similar to rcu_dereference(), but for situations where the pointed-to
> - * object's lifetime is managed by something other than RCU.  That
> - * "something other" might be reference counting or simple immortality.
> - *
> - * The seemingly unused variable ___typecheck_p validates that @p is
> - * indeed a pointer type by using a pointer to typeof(*p) as the type.
> - * Taking a pointer to typeof(*p) again is needed in case p is void *.
> - */
> -#define lockless_dereference(p) \
> -({ \
> -	typeof(p) _________p1 = READ_ONCE(p); \
> -	typeof(*(p)) *___typecheck_p __maybe_unused; \
> -	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> -	(_________p1); \
> -})
> -
>  #endif /* __LINUX_COMPILER_H */
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index b1fd8bf85fdc..3a2bb7d8ed4d 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -274,7 +274,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>   * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>   */
>  #define list_entry_rcu(ptr, type, member) \
> -	container_of(lockless_dereference(ptr), type, member)
> +	container_of(READ_ONCE(ptr), type, member)
>  
>  /**
>   * Where are list_empty_rcu() and list_first_entry_rcu()?
> @@ -367,7 +367,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>   * example is when items are added to the list, but never deleted.
>   */
>  #define list_entry_lockless(ptr, type, member) \
> -	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
> +	container_of((typeof(ptr))READ_ONCE(ptr), type, member)
>  
>  /**
>   * list_for_each_entry_lockless - iterate over rcu list of given type
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index de50d8a4cf41..380a3aeb09d7 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -346,7 +346,7 @@ static inline void rcu_preempt_sleep_check(void) { }
>  #define __rcu_dereference_check(p, c, space) \
>  ({ \
>  	/* Dependency order vs. p above. */ \
> -	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
> +	typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
>  	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
>  	rcu_dereference_sparse(p, space); \
>  	((typeof(*p) __force __kernel *)(________p1)); \
> @@ -360,7 +360,7 @@ static inline void rcu_preempt_sleep_check(void) { }
>  #define rcu_dereference_raw(p) \
>  ({ \
>  	/* Dependency order vs. p above. */ \
> -	typeof(p) ________p1 = lockless_dereference(p); \
> +	typeof(p) ________p1 = READ_ONCE(p); \
>  	((typeof(*p) __force __kernel *)(________p1)); \
>  })
>  
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6bc21e202ae4..417812ce0099 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4231,7 +4231,7 @@ static void perf_remove_from_owner(struct perf_event *event)
>  	 * indeed free this event, otherwise we need to serialize on
>  	 * owner->perf_event_mutex.
>  	 */
> -	owner = lockless_dereference(event->owner);
> +	owner = READ_ONCE(event->owner);
>  	if (owner) {
>  		/*
>  		 * Since delayed_put_task_struct() also drops the last
> @@ -4328,7 +4328,7 @@ int perf_event_release_kernel(struct perf_event *event)
>  		 * Cannot change, child events are not migrated, see the
>  		 * comment with perf_event_ctx_lock_nested().
>  		 */
> -		ctx = lockless_dereference(child->ctx);
> +		ctx = READ_ONCE(child->ctx);
>  		/*
>  		 * Since child_mutex nests inside ctx::mutex, we must jump
>  		 * through hoops. We start by grabbing a reference on the ctx.
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index bb3a38005b9c..1daa8b61a268 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -189,7 +189,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
>  	u32 ret = SECCOMP_RET_ALLOW;
>  	/* Make sure cross-thread synced filter points somewhere sane. */
>  	struct seccomp_filter *f =
> -			lockless_dereference(current->seccomp.filter);
> +			READ_ONCE(current->seccomp.filter);
>  
>  	/* Ensure unexpected behavior doesn't result in failing open. */
>  	if (unlikely(WARN_ON(f == NULL)))
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 836a72a66fba..9a9f262fc53d 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -67,7 +67,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
>  	 * we raced with task_work_run(), *pprev == NULL/exited.
>  	 */
>  	raw_spin_lock_irqsave(&task->pi_lock, flags);
> -	while ((work = lockless_dereference(*pprev))) {
> +	while ((work = READ_ONCE(*pprev))) {
>  		if (work->func != func)
>  			pprev = &work->next;
>  		else if (cmpxchg(pprev, work, work->next) == work)
> diff --git a/mm/slab.h b/mm/slab.h
> index 073362816acc..8894f811a89d 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -258,7 +258,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
>  	 * memcg_caches issues a write barrier to match this (see
>  	 * memcg_create_kmem_cache()).
>  	 */
> -	cachep = lockless_dereference(arr->entries[idx]);
> +	cachep = READ_ONCE(arr->entries[idx]);
>  	rcu_read_unlock();
>  
>  	return cachep;
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables
  2017-10-05 19:31                       ` Andrea Parri
@ 2017-10-05 20:09                         ` Paul E. McKenney
  0 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2017-10-05 20:09 UTC (permalink / raw)
  To: Andrea Parri
  Cc: Will Deacon, Michael Cree, Peter Zijlstra, kirill.shutemov,
	linux-kernel, ynorov, rruigrok, linux-arch, akpm,
	catalin.marinas, rth, ink, mattst88, linux-alpha

On Thu, Oct 05, 2017 at 09:31:48PM +0200, Andrea Parri wrote:
> Hi Will,
> 
> none of my comments below represent objections to this patch, but
> let me remark:
> 
> 
> On Thu, Oct 05, 2017 at 05:31:54PM +0100, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Tue, Oct 03, 2017 at 12:11:10PM -0700, Paul E. McKenney wrote:
> > > On Fri, Sep 29, 2017 at 05:33:49PM +0100, Will Deacon wrote:
> > > > On Fri, Sep 29, 2017 at 09:29:39AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Sep 29, 2017 at 10:08:43AM +0100, Will Deacon wrote:
> > > > > > Ok, but where does that leave us wrt my initial proposal of moving
> > > > > > smp_read_barrier_depends() into READ_ONCE and getting rid of
> > > > > > lockless_dereference?
> > > > > > 
> > > > > > Michael (or anybody else running mainline on SMP Alpha) -- would you be
> > > > > > able to give the diff below a spin and see whether there's a measurable
> > > > > > performance impact?
> > > > > 
> > > > > This will be a sensitive test.  The smp_read_barrier_depends() can be
> > > > > removed from lockless_dereference().  Without this removal Alpha will
> > > > > get two memory barriers from rcu_dereference() and friends.
> > > > 
> > > > Oh yes, good point. I was trying to keep the diff simple, but you're
> > > > right that this is packing too many barriers. Fixed diff below.
> > > 
> > > Not seeing any objections thus far.  If there are none by (say) the
> > > end of this week, I would be happy to queue a patch for the 4.16
> > > merge window.  That should give ample opportunity for further review
> > > and testing.
> > 
> > Ok, full patch below.
> > 
> > Will
> > 
> > --->8
> > 
> > From 15956d0cc6b37208d8542b1858a8d8b64227acf4 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> > Date: Thu, 5 Oct 2017 16:57:36 +0100
> > Subject: [PATCH] locking/barriers: Kill lockless_dereference
> > 
> > lockless_dereference is a nice idea, but it's gained little traction in
> > kernel code since it's introduction three years ago. This is partly
> > because it's a pain to type, but also because using READ_ONCE instead
> > will work correctly on all architectures apart from Alpha, which is a
> > fully supported but somewhat niche architecture these days.
> 
> lockless_dereference might be a mouthful, but it does (explicitly)
> say/remark: "Yep, we are relying on the following address dep. to
> be "in strong-ppo" ".
> 
> Such information will be lost or, at least, not immediately clear
> by just reading a READ_ONCE(). (And Yes, this information is only
> relevant when we "include" Alpha in the picture/analysis.)

It is possible to argue that lockless_dereference() should remain for
this reason, even given READ_ONCE() containing smp_read_barrier_depends().
However, such arguments would be much more compelling if there were
tools that cared about the difference.

> > This patch moves smp_read_barrier_depends() (a NOP on all architectures
> > other than Alpha) from lockless_dereference into READ_ONCE, converts
> > the few actual users over to READ_ONCE and then finally removes
> > lockless_dereference altogether.
> 
> Notice that several "potential users" of lockless_dereference are
> currently hidden in other call sites for smp_read_barrier_depends
> (i.e., cases where this barrier is not called from within a lock-
> less or an RCU dereference).
> 
> Some of these usages (e.g.,
> 
>   include/linux/percpu-refcount.h:__ref_is_percpu,
>   mm/ksm.c:get_ksm_page,
>   security/keys/keyring.c:search_nested_keyrings )
> 
> precedes this barrier with a READ_ONCE; others (e.g.,
> 
>   arch/alpha/include/asm/pgtable.h:pmd_offset,
>   net/ipv4/netfilter/arp_tables.c:arpt_do_table
>   kernel/kernel/events/uprobes.c:get_trampiline_vaddr )
> 
> with a plain read.

I would welcome patches for the cases where smp_read_barrier_depends() is
preceded by READ_ONCE().  Perhaps the others should gain a READ_ONCE(),
and I suspect that they should, but ultimately that decision is in
the hands of the relevant maintainer, so any such patches need to be
separated and will need at least an ack from the relevant maintainers.

> There also appear to be cases where the barrier is preceded by an
> ACCESS_ONCE (c.f, fs/dcache.c:prepend_name) or by an xchg_release
> (c.f., kernel/locking/qspinlock.c:queued_spin_lock_slowpath), and
> it would not be difficult to imagine/create different usages.

It would indeed be good to replace ACCESS_ONCE() with READ_ONCE() or
with WRITE_ONCE() where this works.  And yes, I agree that there are
other usage patterns possible.

> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> I understand that we all agree we're missing a Tested-by here ;-).

Indeed, hence my "applied for testing and review".  ;-)

							Thanx, Paul

>   Andrea
> 
> 
> > ---
> >  Documentation/memory-barriers.txt                   | 12 ------------
> >  .../translations/ko_KR/memory-barriers.txt          | 12 ------------
> >  arch/x86/events/core.c                              |  2 +-
> >  arch/x86/include/asm/mmu_context.h                  |  4 ++--
> >  arch/x86/kernel/ldt.c                               |  2 +-
> >  drivers/md/dm-mpath.c                               | 20 ++++++++++----------
> >  fs/dcache.c                                         |  4 ++--
> >  fs/overlayfs/ovl_entry.h                            |  2 +-
> >  fs/overlayfs/readdir.c                              |  2 +-
> >  include/linux/compiler.h                            | 21 +--------------------
> >  include/linux/rculist.h                             |  4 ++--
> >  include/linux/rcupdate.h                            |  4 ++--
> >  kernel/events/core.c                                |  4 ++--
> >  kernel/seccomp.c                                    |  2 +-
> >  kernel/task_work.c                                  |  2 +-
> >  mm/slab.h                                           |  2 +-
> >  16 files changed, 28 insertions(+), 71 deletions(-)
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index b759a60624fd..470a682f3fa4 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1886,18 +1886,6 @@ There are some more advanced barrier functions:
> >       See Documentation/atomic_{t,bitops}.txt for more information.
> >  
> >  
> > - (*) lockless_dereference();
> > -
> > -     This can be thought of as a pointer-fetch wrapper around the
> > -     smp_read_barrier_depends() data-dependency barrier.
> > -
> > -     This is also similar to rcu_dereference(), but in cases where
> > -     object lifetime is handled by some mechanism other than RCU, for
> > -     example, when the objects removed only when the system goes down.
> > -     In addition, lockless_dereference() is used in some data structures
> > -     that can be used both with and without RCU.
> > -
> > -
> >   (*) dma_wmb();
> >   (*) dma_rmb();
> >  
> > diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
> > index a7a813258013..ec3b46e27b7a 100644
> > --- a/Documentation/translations/ko_KR/memory-barriers.txt
> > +++ b/Documentation/translations/ko_KR/memory-barriers.txt
> > @@ -1858,18 +1858,6 @@ Mandatory 배리어들은 SMP 시스템에서도 UP 시스템에서도 SMP 효
> >       참고하세요.
> >  
> >  
> > - (*) lockless_dereference();
> > -
> > -     이 함수는 smp_read_barrier_depends() 데이터 의존성 배리어를 사용하는
> > -     포인터 읽어오기 래퍼(wrapper) 함수로 생각될 수 있습니다.
> > -
> > -     객체의 라이프타임이 RCU 외의 메커니즘으로 관리된다는 점을 제외하면
> > -     rcu_dereference() 와도 유사한데, 예를 들면 객체가 시스템이 꺼질 때에만
> > -     제거되는 경우 등입니다.  또한, lockless_dereference() 은 RCU 와 함께
> > -     사용될수도, RCU 없이 사용될 수도 있는 일부 데이터 구조에 사용되고
> > -     있습니다.
> > -
> > -
> >   (*) dma_wmb();
> >   (*) dma_rmb();
> >  
> > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> > index 80534d3c2480..589af1eec7c1 100644
> > --- a/arch/x86/events/core.c
> > +++ b/arch/x86/events/core.c
> > @@ -2371,7 +2371,7 @@ static unsigned long get_segment_base(unsigned int segment)
> >  		struct ldt_struct *ldt;
> >  
> >  		/* IRQs are off, so this synchronizes with smp_store_release */
> > -		ldt = lockless_dereference(current->active_mm->context.ldt);
> > +		ldt = READ_ONCE(current->active_mm->context.ldt);
> >  		if (!ldt || idx >= ldt->nr_entries)
> >  			return 0;
> >  
> > diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> > index c120b5db178a..9037a4e546e8 100644
> > --- a/arch/x86/include/asm/mmu_context.h
> > +++ b/arch/x86/include/asm/mmu_context.h
> > @@ -72,8 +72,8 @@ static inline void load_mm_ldt(struct mm_struct *mm)
> >  #ifdef CONFIG_MODIFY_LDT_SYSCALL
> >  	struct ldt_struct *ldt;
> >  
> > -	/* lockless_dereference synchronizes with smp_store_release */
> > -	ldt = lockless_dereference(mm->context.ldt);
> > +	/* READ_ONCE synchronizes with smp_store_release */
> > +	ldt = READ_ONCE(mm->context.ldt);
> >  
> >  	/*
> >  	 * Any change to mm->context.ldt is followed by an IPI to all
> > diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> > index f0e64db18ac8..0a21390642c4 100644
> > --- a/arch/x86/kernel/ldt.c
> > +++ b/arch/x86/kernel/ldt.c
> > @@ -101,7 +101,7 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
> >  static void install_ldt(struct mm_struct *current_mm,
> >  			struct ldt_struct *ldt)
> >  {
> > -	/* Synchronizes with lockless_dereference in load_mm_ldt. */
> > +	/* Synchronizes with READ_ONCE in load_mm_ldt. */
> >  	smp_store_release(&current_mm->context.ldt, ldt);
> >  
> >  	/* Activate the LDT for all CPUs using current_mm. */
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 11f273d2f018..3f88c9d32f7e 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -366,7 +366,7 @@ static struct pgpath *choose_path_in_pg(struct multipath *m,
> >  
> >  	pgpath = path_to_pgpath(path);
> >  
> > -	if (unlikely(lockless_dereference(m->current_pg) != pg)) {
> > +	if (unlikely(READ_ONCE(m->current_pg) != pg)) {
> >  		/* Only update current_pgpath if pg changed */
> >  		spin_lock_irqsave(&m->lock, flags);
> >  		m->current_pgpath = pgpath;
> > @@ -390,7 +390,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
> >  	}
> >  
> >  	/* Were we instructed to switch PG? */
> > -	if (lockless_dereference(m->next_pg)) {
> > +	if (READ_ONCE(m->next_pg)) {
> >  		spin_lock_irqsave(&m->lock, flags);
> >  		pg = m->next_pg;
> >  		if (!pg) {
> > @@ -406,7 +406,7 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
> >  
> >  	/* Don't change PG until it has no remaining paths */
> >  check_current_pg:
> > -	pg = lockless_dereference(m->current_pg);
> > +	pg = READ_ONCE(m->current_pg);
> >  	if (pg) {
> >  		pgpath = choose_path_in_pg(m, pg, nr_bytes);
> >  		if (!IS_ERR_OR_NULL(pgpath))
> > @@ -473,7 +473,7 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
> >  	struct request *clone;
> >  
> >  	/* Do we need to select a new pgpath? */
> > -	pgpath = lockless_dereference(m->current_pgpath);
> > +	pgpath = READ_ONCE(m->current_pgpath);
> >  	if (!pgpath || !test_bit(MPATHF_QUEUE_IO, &m->flags))
> >  		pgpath = choose_pgpath(m, nr_bytes);
> >  
> > @@ -535,7 +535,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
> >  	bool queue_io;
> >  
> >  	/* Do we need to select a new pgpath? */
> > -	pgpath = lockless_dereference(m->current_pgpath);
> > +	pgpath = READ_ONCE(m->current_pgpath);
> >  	queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
> >  	if (!pgpath || !queue_io)
> >  		pgpath = choose_pgpath(m, nr_bytes);
> > @@ -1804,7 +1804,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
> >  	struct pgpath *current_pgpath;
> >  	int r;
> >  
> > -	current_pgpath = lockless_dereference(m->current_pgpath);
> > +	current_pgpath = READ_ONCE(m->current_pgpath);
> >  	if (!current_pgpath)
> >  		current_pgpath = choose_pgpath(m, 0);
> >  
> > @@ -1826,7 +1826,7 @@ static int multipath_prepare_ioctl(struct dm_target *ti,
> >  	}
> >  
> >  	if (r == -ENOTCONN) {
> > -		if (!lockless_dereference(m->current_pg)) {
> > +		if (!READ_ONCE(m->current_pg)) {
> >  			/* Path status changed, redo selection */
> >  			(void) choose_pgpath(m, 0);
> >  		}
> > @@ -1895,9 +1895,9 @@ static int multipath_busy(struct dm_target *ti)
> >  		return (m->queue_mode != DM_TYPE_MQ_REQUEST_BASED);
> >  
> >  	/* Guess which priority_group will be used at next mapping time */
> > -	pg = lockless_dereference(m->current_pg);
> > -	next_pg = lockless_dereference(m->next_pg);
> > -	if (unlikely(!lockless_dereference(m->current_pgpath) && next_pg))
> > +	pg = READ_ONCE(m->current_pg);
> > +	next_pg = READ_ONCE(m->next_pg);
> > +	if (unlikely(!READ_ONCE(m->current_pgpath) && next_pg))
> >  		pg = next_pg;
> >  
> >  	if (!pg) {
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index f90141387f01..34c852af215c 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -231,7 +231,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
> >  {
> >  	/*
> >  	 * Be careful about RCU walk racing with rename:
> > -	 * use 'lockless_dereference' to fetch the name pointer.
> > +	 * use 'READ_ONCE' to fetch the name pointer.
> >  	 *
> >  	 * NOTE! Even if a rename will mean that the length
> >  	 * was not loaded atomically, we don't care. The
> > @@ -245,7 +245,7 @@ static inline int dentry_cmp(const struct dentry *dentry, const unsigned char *c
> >  	 * early because the data cannot match (there can
> >  	 * be no NUL in the ct/tcount data)
> >  	 */
> > -	const unsigned char *cs = lockless_dereference(dentry->d_name.name);
> > +	const unsigned char *cs = READ_ONCE(dentry->d_name.name);
> >  
> >  	return dentry_string_cmp(cs, ct, tcount);
> >  }
> > diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
> > index 878a750986dd..0f6809fa6628 100644
> > --- a/fs/overlayfs/ovl_entry.h
> > +++ b/fs/overlayfs/ovl_entry.h
> > @@ -74,5 +74,5 @@ static inline struct ovl_inode *OVL_I(struct inode *inode)
> >  
> >  static inline struct dentry *ovl_upperdentry_dereference(struct ovl_inode *oi)
> >  {
> > -	return lockless_dereference(oi->__upperdentry);
> > +	return READ_ONCE(oi->__upperdentry);
> >  }
> > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c
> > index 62e9b22a2077..0b389d330613 100644
> > --- a/fs/overlayfs/readdir.c
> > +++ b/fs/overlayfs/readdir.c
> > @@ -754,7 +754,7 @@ static int ovl_dir_fsync(struct file *file, loff_t start, loff_t end,
> >  	if (!od->is_upper && OVL_TYPE_UPPER(ovl_path_type(dentry))) {
> >  		struct inode *inode = file_inode(file);
> >  
> > -		realfile = lockless_dereference(od->upperfile);
> > +		realfile = READ_ONCE(od->upperfile);
> >  		if (!realfile) {
> >  			struct path upperpath;
> >  
> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index e95a2631e545..f260ff39f90f 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -340,6 +340,7 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> >  		__read_once_size(&(x), __u.__c, sizeof(x));		\
> >  	else								\
> >  		__read_once_size_nocheck(&(x), __u.__c, sizeof(x));	\
> > +	smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
> >  	__u.__val;							\
> >  })
> >  #define READ_ONCE(x) __READ_ONCE(x, 1)
> > @@ -604,24 +605,4 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s
> >  	(volatile typeof(x) *)&(x); })
> >  #define ACCESS_ONCE(x) (*__ACCESS_ONCE(x))
> >  
> > -/**
> > - * lockless_dereference() - safely load a pointer for later dereference
> > - * @p: The pointer to load
> > - *
> > - * Similar to rcu_dereference(), but for situations where the pointed-to
> > - * object's lifetime is managed by something other than RCU.  That
> > - * "something other" might be reference counting or simple immortality.
> > - *
> > - * The seemingly unused variable ___typecheck_p validates that @p is
> > - * indeed a pointer type by using a pointer to typeof(*p) as the type.
> > - * Taking a pointer to typeof(*p) again is needed in case p is void *.
> > - */
> > -#define lockless_dereference(p) \
> > -({ \
> > -	typeof(p) _________p1 = READ_ONCE(p); \
> > -	typeof(*(p)) *___typecheck_p __maybe_unused; \
> > -	smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
> > -	(_________p1); \
> > -})
> > -
> >  #endif /* __LINUX_COMPILER_H */
> > diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> > index b1fd8bf85fdc..3a2bb7d8ed4d 100644
> > --- a/include/linux/rculist.h
> > +++ b/include/linux/rculist.h
> > @@ -274,7 +274,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >   * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
> >   */
> >  #define list_entry_rcu(ptr, type, member) \
> > -	container_of(lockless_dereference(ptr), type, member)
> > +	container_of(READ_ONCE(ptr), type, member)
> >  
> >  /**
> >   * Where are list_empty_rcu() and list_first_entry_rcu()?
> > @@ -367,7 +367,7 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
> >   * example is when items are added to the list, but never deleted.
> >   */
> >  #define list_entry_lockless(ptr, type, member) \
> > -	container_of((typeof(ptr))lockless_dereference(ptr), type, member)
> > +	container_of((typeof(ptr))READ_ONCE(ptr), type, member)
> >  
> >  /**
> >   * list_for_each_entry_lockless - iterate over rcu list of given type
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index de50d8a4cf41..380a3aeb09d7 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -346,7 +346,7 @@ static inline void rcu_preempt_sleep_check(void) { }
> >  #define __rcu_dereference_check(p, c, space) \
> >  ({ \
> >  	/* Dependency order vs. p above. */ \
> > -	typeof(*p) *________p1 = (typeof(*p) *__force)lockless_dereference(p); \
> > +	typeof(*p) *________p1 = (typeof(*p) *__force)READ_ONCE(p); \
> >  	RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
> >  	rcu_dereference_sparse(p, space); \
> >  	((typeof(*p) __force __kernel *)(________p1)); \
> > @@ -360,7 +360,7 @@ static inline void rcu_preempt_sleep_check(void) { }
> >  #define rcu_dereference_raw(p) \
> >  ({ \
> >  	/* Dependency order vs. p above. */ \
> > -	typeof(p) ________p1 = lockless_dereference(p); \
> > +	typeof(p) ________p1 = READ_ONCE(p); \
> >  	((typeof(*p) __force __kernel *)(________p1)); \
> >  })
> >  
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 6bc21e202ae4..417812ce0099 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -4231,7 +4231,7 @@ static void perf_remove_from_owner(struct perf_event *event)
> >  	 * indeed free this event, otherwise we need to serialize on
> >  	 * owner->perf_event_mutex.
> >  	 */
> > -	owner = lockless_dereference(event->owner);
> > +	owner = READ_ONCE(event->owner);
> >  	if (owner) {
> >  		/*
> >  		 * Since delayed_put_task_struct() also drops the last
> > @@ -4328,7 +4328,7 @@ int perf_event_release_kernel(struct perf_event *event)
> >  		 * Cannot change, child events are not migrated, see the
> >  		 * comment with perf_event_ctx_lock_nested().
> >  		 */
> > -		ctx = lockless_dereference(child->ctx);
> > +		ctx = READ_ONCE(child->ctx);
> >  		/*
> >  		 * Since child_mutex nests inside ctx::mutex, we must jump
> >  		 * through hoops. We start by grabbing a reference on the ctx.
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index bb3a38005b9c..1daa8b61a268 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -189,7 +189,7 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd,
> >  	u32 ret = SECCOMP_RET_ALLOW;
> >  	/* Make sure cross-thread synced filter points somewhere sane. */
> >  	struct seccomp_filter *f =
> > -			lockless_dereference(current->seccomp.filter);
> > +			READ_ONCE(current->seccomp.filter);
> >  
> >  	/* Ensure unexpected behavior doesn't result in failing open. */
> >  	if (unlikely(WARN_ON(f == NULL)))
> > diff --git a/kernel/task_work.c b/kernel/task_work.c
> > index 836a72a66fba..9a9f262fc53d 100644
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -67,7 +67,7 @@ task_work_cancel(struct task_struct *task, task_work_func_t func)
> >  	 * we raced with task_work_run(), *pprev == NULL/exited.
> >  	 */
> >  	raw_spin_lock_irqsave(&task->pi_lock, flags);
> > -	while ((work = lockless_dereference(*pprev))) {
> > +	while ((work = READ_ONCE(*pprev))) {
> >  		if (work->func != func)
> >  			pprev = &work->next;
> >  		else if (cmpxchg(pprev, work, work->next) == work)
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 073362816acc..8894f811a89d 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -258,7 +258,7 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
> >  	 * memcg_caches issues a write barrier to match this (see
> >  	 * memcg_create_kmem_cache()).
> >  	 */
> > -	cachep = lockless_dereference(arr->entries[idx]);
> > +	cachep = READ_ONCE(arr->entries[idx]);
> >  	rcu_read_unlock();
> >  
> >  	return cachep;
> > -- 
> > 2.1.4
> > 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-10-05 20:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-27 15:49 [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Will Deacon
2017-09-27 15:49 ` [RFC PATCH 1/2] arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables Will Deacon
2017-09-28  8:38   ` Peter Zijlstra
2017-09-28  8:45     ` Will Deacon
2017-09-28 15:43       ` Paul E. McKenney
2017-09-28 15:49         ` Will Deacon
2017-09-28 16:07           ` Paul E. McKenney
2017-09-28 18:59         ` Michael Cree
2017-09-29  0:58           ` Paul E. McKenney
2017-09-29  9:08             ` Will Deacon
2017-09-29 16:29               ` Paul E. McKenney
2017-09-29 16:33                 ` Will Deacon
2017-10-03 19:11                   ` Paul E. McKenney
2017-10-05 16:31                     ` Will Deacon
2017-10-05 16:31                       ` Will Deacon
2017-10-05 16:31                       ` Will Deacon
2017-10-05 19:22                       ` Paul E. McKenney
2017-10-05 19:31                       ` Andrea Parri
2017-10-05 20:09                         ` Paul E. McKenney
2017-09-28 19:18   ` Timur Tabi
2017-09-27 15:49 ` [RFC PATCH 2/2] mm: page_vma_mapped: Ensure pmd is loaded with READ_ONCE outside of lock Will Deacon
2017-09-27 22:01 ` [RFC PATCH 0/2] Missing READ_ONCE in core and arch-specific pgtable code leading to crashes Yury Norov
2017-09-28 17:30 ` Richard Ruigrok
2017-09-28 19:38 ` Jon Masters
2017-09-29  8:56   ` Will Deacon
2017-10-03  6:36     ` Jon Masters
2017-10-05 16:54       ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.