linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management
@ 2020-11-20 14:35 Will Deacon
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

Hi all,

This series attempts to fix some issues relating to our access and
(soft-)dirty bit management relating to TLB invalidation. It's a bit all
over the place because I kept running into new issues as I was trying to
figure it out.

The first patch fixes a crash we've seen in practice. The other patches
are all addressing things that I found by code inspection and I would
_really_ appreciate others having a look. In particular, what can go
wrong in practice if a CPU has a stale, writable entry in the TLB for a
pte which is !pte_write()? It feels intuitively bad, but I couldn't find
anywhere that would explode (the CoW path looks alright, for example).

Cheers,

Will

Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: linux-mm@kvack.org
Cc: linux-arm-kernel@lists.infradead.org

--->8

Will Deacon (6):
  arm64: pgtable: Fix pte_accessible()
  arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  tlb: mmu_gather: Remove unused start/end arguments from
    tlb_finish_mmu()
  mm: proc: Invalidate TLB after clearing soft-dirty page state
  tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()
  mm: proc: Avoid fullmm flush for young/dirty bit toggling

 arch/arm64/include/asm/pgtable.h | 31 +++++++++++++++----------------
 arch/ia64/include/asm/tlb.h      |  2 +-
 arch/x86/kernel/ldt.c            |  2 +-
 fs/exec.c                        |  2 +-
 fs/proc/task_mmu.c               | 22 +++++++++++++---------
 include/asm-generic/tlb.h        |  6 ++++--
 include/linux/mm_types.h         |  4 ++--
 mm/hugetlb.c                     |  2 +-
 mm/madvise.c                     |  6 +++---
 mm/memory.c                      |  4 ++--
 mm/mmap.c                        |  6 +++---
 mm/mmu_gather.c                  | 21 +++++++++++++++------
 mm/oom_kill.c                    |  4 ++--
 13 files changed, 63 insertions(+), 49 deletions(-)

-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/6] arm64: pgtable: Fix pte_accessible()
  2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
@ 2020-11-20 14:35 ` Will Deacon
  2020-11-20 16:03   ` Minchan Kim
                     ` (3 more replies)
  2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon
                   ` (4 subsequent siblings)
  5 siblings, 4 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, stable, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
invalidation is necessary when unmapping pages for reclaim. Although our
implementation is correct according to the architecture, returning true
only for valid, young ptes in the absence of racing page-table
modifications, this is in fact flawed due to lazy invalidation of old
ptes in ptep_clear_flush_young() where we elide the expensive DSB
instruction for completing the TLB invalidation.

Rather than penalise the aging path, adjust pte_accessible() to return
true for any valid pte, even if the access flag is cleared.

Cc: <stable@vger.kernel.org>
Fixes: 76c714be0e5e ("arm64: pgtable: implement pte_accessible()")
Reported-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4ff12a7adcfd..1bdf51f01e73 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -115,8 +115,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
 #define pte_valid(pte)		(!!(pte_val(pte) & PTE_VALID))
 #define pte_valid_not_user(pte) \
 	((pte_val(pte) & (PTE_VALID | PTE_USER)) == PTE_VALID)
-#define pte_valid_young(pte) \
-	((pte_val(pte) & (PTE_VALID | PTE_AF)) == (PTE_VALID | PTE_AF))
 #define pte_valid_user(pte) \
 	((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER))
 
@@ -126,7 +124,7 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
  * remapped as PROT_NONE but are yet to be flushed from the TLB.
  */
 #define pte_accessible(mm, pte)	\
-	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
+	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
 
 /*
  * p??_access_permitted() is true for valid user mappings (subject to the
-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
@ 2020-11-20 14:35 ` Will Deacon
  2020-11-20 17:09   ` Minchan Kim
  2020-11-23 14:22   ` Catalin Marinas
  2020-11-20 14:35 ` [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Will Deacon
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, stable, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

With hardware dirty bit management, calling pte_wrprotect() on a writable,
dirty PTE will lose the dirty state and return a read-only, clean entry.

Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that
the dirty bit is preserved for writable entries, as this is required for
soft-dirty bit management if we enable it in the future.

Cc: <stable@vger.kernel.org>
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/pgtable.h | 27 ++++++++++++++-------------
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1bdf51f01e73..a155551863c9 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -162,13 +162,6 @@ static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
 	return pmd;
 }
 
-static inline pte_t pte_wrprotect(pte_t pte)
-{
-	pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
-	pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
-	return pte;
-}
-
 static inline pte_t pte_mkwrite(pte_t pte)
 {
 	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
@@ -194,6 +187,20 @@ static inline pte_t pte_mkdirty(pte_t pte)
 	return pte;
 }
 
+static inline pte_t pte_wrprotect(pte_t pte)
+{
+	/*
+	 * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY
+	 * clear), set the PTE_DIRTY bit.
+	 */
+	if (pte_hw_dirty(pte))
+		pte = pte_mkdirty(pte);
+
+	pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
+	pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
+	return pte;
+}
+
 static inline pte_t pte_mkold(pte_t pte)
 {
 	return clear_pte_bit(pte, __pgprot(PTE_AF));
@@ -843,12 +850,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 	pte = READ_ONCE(*ptep);
 	do {
 		old_pte = pte;
-		/*
-		 * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY
-		 * clear), set the PTE_DIRTY bit.
-		 */
-		if (pte_hw_dirty(pte))
-			pte = pte_mkdirty(pte);
 		pte = pte_wrprotect(pte);
 		pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep),
 					       pte_val(old_pte), pte_val(pte));
-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
  2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
  2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon
@ 2020-11-20 14:35 ` Will Deacon
  2020-11-20 17:20   ` Linus Torvalds
  2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

tlb_finish_mmu() takes two confusing and unused 'start'/'end' address
arguments. Remove them.

Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/ia64/include/asm/tlb.h | 2 +-
 arch/x86/kernel/ldt.c       | 2 +-
 fs/exec.c                   | 2 +-
 fs/proc/task_mmu.c          | 2 +-
 include/linux/mm_types.h    | 3 +--
 mm/hugetlb.c                | 2 +-
 mm/madvise.c                | 6 +++---
 mm/memory.c                 | 4 ++--
 mm/mmap.c                   | 4 ++--
 mm/mmu_gather.c             | 5 +----
 mm/oom_kill.c               | 4 ++--
 11 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/ia64/include/asm/tlb.h b/arch/ia64/include/asm/tlb.h
index 8d9da6f08a62..7059eb2e867a 100644
--- a/arch/ia64/include/asm/tlb.h
+++ b/arch/ia64/include/asm/tlb.h
@@ -36,7 +36,7 @@
  *	    tlb_end_vma(tlb, vma);
  *	  }
  *	}
- *	tlb_finish_mmu(tlb, start, end);	// finish unmap for address space MM
+ *	tlb_finish_mmu(tlb);				// finish unmap for address space MM
  */
 #include <linux/mm.h>
 #include <linux/pagemap.h>
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index b8aee71840ae..0d4e1253c9c9 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -400,7 +400,7 @@ static void free_ldt_pgtables(struct mm_struct *mm)
 
 	tlb_gather_mmu(&tlb, mm, start, end);
 	free_pgd_range(&tlb, start, end, start, end);
-	tlb_finish_mmu(&tlb, start, end);
+	tlb_finish_mmu(&tlb);
 #endif
 }
 
diff --git a/fs/exec.c b/fs/exec.c
index 547a2390baf5..aa846c6ec2f0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -724,7 +724,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		free_pgd_range(&tlb, old_start, old_end, new_end,
 			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
 	}
-	tlb_finish_mmu(&tlb, old_start, old_end);
+	tlb_finish_mmu(&tlb);
 
 	/*
 	 * Shrink the vma to just the new range.  Always succeeds.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 217aa2705d5d..cd03ab9087b0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1260,7 +1260,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 				&cp);
 		if (type == CLEAR_REFS_SOFT_DIRTY)
 			mmu_notifier_invalidate_range_end(&range);
-		tlb_finish_mmu(&tlb, 0, -1);
+		tlb_finish_mmu(&tlb);
 		mmap_read_unlock(mm);
 out_mm:
 		mmput(mm);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5a9238f6caad..7b90058a62be 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -585,8 +585,7 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 				unsigned long start, unsigned long end);
-extern void tlb_finish_mmu(struct mmu_gather *tlb,
-				unsigned long start, unsigned long end);
+extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 37f15c3c24dc..4c0235122464 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3985,7 +3985,7 @@ void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 
 	tlb_gather_mmu(&tlb, mm, tlb_start, tlb_end);
 	__unmap_hugepage_range(&tlb, vma, start, end, ref_page);
-	tlb_finish_mmu(&tlb, tlb_start, tlb_end);
+	tlb_finish_mmu(&tlb);
 }
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 416a56b8e757..29cd3d4172f5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -508,7 +508,7 @@ static long madvise_cold(struct vm_area_struct *vma,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
 	madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
-	tlb_finish_mmu(&tlb, start_addr, end_addr);
+	tlb_finish_mmu(&tlb);
 
 	return 0;
 }
@@ -560,7 +560,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
 	madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
-	tlb_finish_mmu(&tlb, start_addr, end_addr);
+	tlb_finish_mmu(&tlb);
 
 	return 0;
 }
@@ -732,7 +732,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 			&madvise_free_walk_ops, &tlb);
 	tlb_end_vma(&tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb, range.start, range.end);
+	tlb_finish_mmu(&tlb);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e502..04a88c15e076 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1529,7 +1529,7 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
 		unmap_single_vma(&tlb, vma, start, range.end, NULL);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb, start, range.end);
+	tlb_finish_mmu(&tlb);
 }
 
 /**
@@ -1555,7 +1555,7 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr
 	mmu_notifier_invalidate_range_start(&range);
 	unmap_single_vma(&tlb, vma, address, range.end, details);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_finish_mmu(&tlb, address, range.end);
+	tlb_finish_mmu(&tlb);
 }
 
 /**
diff --git a/mm/mmap.c b/mm/mmap.c
index d91ecb00d38c..6d94b2ee9c45 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2678,7 +2678,7 @@ static void unmap_region(struct mm_struct *mm,
 	unmap_vmas(&tlb, vma, start, end);
 	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 				 next ? next->vm_start : USER_PGTABLES_CEILING);
-	tlb_finish_mmu(&tlb, start, end);
+	tlb_finish_mmu(&tlb);
 }
 
 /*
@@ -3221,7 +3221,7 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
-	tlb_finish_mmu(&tlb, 0, -1);
+	tlb_finish_mmu(&tlb);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 03c33c93a582..b0be5a7aa08f 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -290,14 +290,11 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 /**
  * tlb_finish_mmu - finish an mmu_gather structure
  * @tlb: the mmu_gather structure to finish
- * @start: start of the region that will be removed from the page-table
- * @end: end of the region that will be removed from the page-table
  *
  * Called at the end of the shootdown operation to free up any resources that
  * were required.
  */
-void tlb_finish_mmu(struct mmu_gather *tlb,
-		unsigned long start, unsigned long end)
+void tlb_finish_mmu(struct mmu_gather *tlb)
 {
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8b84661a6410..c7936196a4ae 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -546,13 +546,13 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 						vma->vm_end);
 			tlb_gather_mmu(&tlb, mm, range.start, range.end);
 			if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
-				tlb_finish_mmu(&tlb, range.start, range.end);
+				tlb_finish_mmu(&tlb);
 				ret = false;
 				continue;
 			}
 			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
 			mmu_notifier_invalidate_range_end(&range);
-			tlb_finish_mmu(&tlb, range.start, range.end);
+			tlb_finish_mmu(&tlb);
 		}
 	}
 
-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
                   ` (2 preceding siblings ...)
  2020-11-20 14:35 ` [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Will Deacon
@ 2020-11-20 14:35 ` Will Deacon
  2020-11-20 15:00   ` Peter Zijlstra
  2020-11-20 20:22   ` Yu Zhao
  2020-11-20 14:35 ` [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() Will Deacon
  2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon
  5 siblings, 2 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
via the tlb_remove_*() functions. Consequently, the page-table modifications
performed by clear_refs_write() in response to a write to
/proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
fine when simply aging the ptes, in the case of clearing the "soft-dirty"
state we can end up with entries where pte_write() is false, yet a
writable mapping remains in the TLB.

Fix this by calling tlb_remove_tlb_entry() for each entry being
write-protected when cleating soft-dirty.

Signed-off-by: Will Deacon <will@kernel.org>
---
 fs/proc/task_mmu.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cd03ab9087b0..3308292ee5c5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1032,11 +1032,12 @@ enum clear_refs_types {
 
 struct clear_refs_private {
 	enum clear_refs_types type;
+	struct mmu_gather *tlb;
 };
 
 #ifdef CONFIG_MEM_SOFT_DIRTY
 static inline void clear_soft_dirty(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *pte)
+		unsigned long addr, pte_t *pte, struct mmu_gather *tlb)
 {
 	/*
 	 * The soft-dirty tracker uses #PF-s to catch writes
@@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 		ptent = pte_wrprotect(old_pte);
 		ptent = pte_clear_soft_dirty(ptent);
 		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+		tlb_remove_tlb_entry(tlb, pte, addr);
 	} else if (is_swap_pte(ptent)) {
 		ptent = pte_swp_clear_soft_dirty(ptent);
 		set_pte_at(vma->vm_mm, addr, pte, ptent);
@@ -1060,14 +1062,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
 }
 #else
 static inline void clear_soft_dirty(struct vm_area_struct *vma,
-		unsigned long addr, pte_t *pte)
+		unsigned long addr, pte_t *pte, struct mmu_gather *tlb)
 {
 }
 #endif
 
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t *pmdp)
+		unsigned long addr, pmd_t *pmdp, struct mmu_gather *tlb)
 {
 	pmd_t old, pmd = *pmdp;
 
@@ -1081,6 +1083,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 
 		pmd = pmd_wrprotect(pmd);
 		pmd = pmd_clear_soft_dirty(pmd);
+		tlb_remove_pmd_tlb_entry(tlb, pmdp, addr);
 
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
@@ -1090,7 +1093,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 }
 #else
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t *pmdp)
+		unsigned long addr, pmd_t *pmdp, struct mmu_gather *tlb)
 {
 }
 #endif
@@ -1107,7 +1110,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty_pmd(vma, addr, pmd);
+			clear_soft_dirty_pmd(vma, addr, pmd, cp->tlb);
 			goto out;
 		}
 
@@ -1133,7 +1136,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
 		ptent = *pte;
 
 		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
-			clear_soft_dirty(vma, addr, pte);
+			clear_soft_dirty(vma, addr, pte, cp->tlb);
 			continue;
 		}
 
@@ -1212,7 +1215,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 	if (mm) {
 		struct mmu_notifier_range range;
 		struct clear_refs_private cp = {
-			.type = type,
+			.type	= type,
+			.tlb	= &tlb,
 		};
 
 		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()
  2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
                   ` (3 preceding siblings ...)
  2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon
@ 2020-11-20 14:35 ` Will Deacon
  2020-11-20 17:22   ` Linus Torvalds
       [not found]   ` <20201122151158.GK2390@xsang-OptiPlex-9020>
  2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon
  5 siblings, 2 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

Passing the range '0, -1' to tlb_gather_mmu() sets the 'fullmm' flag,
which indicates that the mm_struct being operated on is going away. In
this case, some architectures (such as arm64) can elide TLB invalidation
by ensuring that the TLB tag (ASID) associated with this mm is not
immediately reclaimed. Although this behaviour is documented in
asm-generic/tlb.h, it's subtle and easily missed. Consequently, the
/proc walker for manipulating the young and soft-dirty bits passes this
range regardless.

Introduce tlb_gather_mmu_fullmm() to make it clearer that this is for the
entire mm and WARN() if tlb_gather_mmu() is called with an 'end' address
greated than TASK_SIZE.

Signed-off-by: Will Deacon <will@kernel.org>
---
 fs/proc/task_mmu.c        |  2 +-
 include/asm-generic/tlb.h |  6 ++++--
 include/linux/mm_types.h  |  1 +
 mm/mmap.c                 |  2 +-
 mm/mmu_gather.c           | 16 ++++++++++++++--
 5 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3308292ee5c5..a76d339b5754 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			count = -EINTR;
 			goto out_mm;
 		}
-		tlb_gather_mmu(&tlb, mm, 0, -1);
+		tlb_gather_mmu_fullmm(&tlb, mm);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 6661ee1cff47..2c68a545ffa7 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -46,7 +46,9 @@
  *
  * The mmu_gather API consists of:
  *
- *  - tlb_gather_mmu() / tlb_finish_mmu(); start and finish a mmu_gather
+ *  - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_finish_mmu()
+ *
+ *    start and finish a mmu_gather
  *
  *    Finish in particular will issue a (final) TLB invalidate and free
  *    all (remaining) queued pages.
@@ -91,7 +93,7 @@
  *
  *  - mmu_gather::fullmm
  *
- *    A flag set by tlb_gather_mmu() to indicate we're going to free
+ *    A flag set by tlb_gather_mmu_fullmm() to indicate we're going to free
  *    the entire mm; this allows a number of optimizations.
  *
  *    - We can ignore tlb_{start,end}_vma(); because we don't
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7b90058a62be..42231729affe 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -585,6 +585,7 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
 struct mmu_gather;
 extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 				unsigned long start, unsigned long end);
+extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
 extern void tlb_finish_mmu(struct mmu_gather *tlb);
 
 static inline void init_tlb_flush_pending(struct mm_struct *mm)
diff --git a/mm/mmap.c b/mm/mmap.c
index 6d94b2ee9c45..4b2809fbbd4a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3216,7 +3216,7 @@ void exit_mmap(struct mm_struct *mm)
 
 	lru_add_drain();
 	flush_cache_mm(mm);
-	tlb_gather_mmu(&tlb, mm, 0, -1);
+	tlb_gather_mmu_fullmm(&tlb, mm);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index b0be5a7aa08f..87b48444e7e5 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -261,8 +261,8 @@ void tlb_flush_mmu(struct mmu_gather *tlb)
  * respectively when @mm is without users and we're going to destroy
  * the full address space (exit/execve).
  */
-void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
-			unsigned long start, unsigned long end)
+static void __tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+			     unsigned long start, unsigned long end)
 {
 	tlb->mm = mm;
 
@@ -287,6 +287,18 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
 	inc_tlb_flush_pending(tlb->mm);
 }
 
+void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm,
+		    unsigned long start, unsigned long end)
+{
+	WARN_ON(end > TASK_SIZE);
+	__tlb_gather_mmu(tlb, mm, start, end);
+}
+
+void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
+{
+	__tlb_gather_mmu(tlb, mm, 0, -1);
+}
+
 /**
  * tlb_finish_mmu - finish an mmu_gather structure
  * @tlb: the mmu_gather structure to finish
-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
                   ` (4 preceding siblings ...)
  2020-11-20 14:35 ` [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() Will Deacon
@ 2020-11-20 14:35 ` Will Deacon
  2020-11-20 17:41   ` Linus Torvalds
  2020-11-20 20:40   ` Yu Zhao
  5 siblings, 2 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-20 14:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	Linus Torvalds, linux-mm, Minchan Kim, Catalin Marinas,
	Will Deacon, linux-arm-kernel

clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
updating the page-tables for the current mm. However, since the mm is not
being freed, this can result in stale TLB entries on architectures which
elide 'fullmm' invalidation.

Ensure that TLB invalidation is performed after updating soft-dirty
entries via clear_refs_write() by using the non-fullmm API to MMU gather.

Signed-off-by: Will Deacon <will@kernel.org>
---
 fs/proc/task_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a76d339b5754..316af047f1aa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 			count = -EINTR;
 			goto out_mm;
 		}
-		tlb_gather_mmu_fullmm(&tlb, mm);
+		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			for (vma = mm->mmap; vma; vma = vma->vm_next) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
-- 
2.29.2.454.gaff20da3a2-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon
@ 2020-11-20 15:00   ` Peter Zijlstra
  2020-11-20 15:09     ` Peter Zijlstra
                       ` (2 more replies)
  2020-11-20 20:22   ` Yu Zhao
  1 sibling, 3 replies; 42+ messages in thread
From: Peter Zijlstra @ 2020-11-20 15:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Catalin Marinas, Linus Torvalds,
	linux-kernel, linux-mm, Minchan Kim, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> via the tlb_remove_*() functions. Consequently, the page-table modifications
> performed by clear_refs_write() in response to a write to
> /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> state we can end up with entries where pte_write() is false, yet a
> writable mapping remains in the TLB.
> 
> Fix this by calling tlb_remove_tlb_entry() for each entry being
> write-protected when cleating soft-dirty.
> 

> @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  		ptent = pte_wrprotect(old_pte);
>  		ptent = pte_clear_soft_dirty(ptent);
>  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);
>  	} else if (is_swap_pte(ptent)) {
>  		ptent = pte_swp_clear_soft_dirty(ptent);
>  		set_pte_at(vma->vm_mm, addr, pte, ptent);

Oh!

Yesterday when you had me look at this code; I figured the sane thing
to do was to make it look more like mprotect().

Why did you chose to make it work with mmu_gather instead? I'll grant
you that it's probably the smaller patch, but I still think it's weird
to use mmu_gather here.

Also, is tlb_remote_tlb_entry() actually correct? If you look at
__tlb_remove_tlb_entry() you'll find that Power-Hash-32 will clear the
entry, which might not be what we want here, we want to update the
entrty.




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 15:00   ` Peter Zijlstra
@ 2020-11-20 15:09     ` Peter Zijlstra
  2020-11-20 15:15     ` Will Deacon
  2020-11-20 15:55     ` Minchan Kim
  2 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2020-11-20 15:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Catalin Marinas, Linus Torvalds,
	linux-kernel, linux-mm, Minchan Kim, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:

> If you look at __tlb_remove_tlb_entry()

... you'll also find we can probably do this ... :-)

diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h
index e841cae544c2..779a5a0f0608 100644
--- a/arch/sparc/include/asm/tlb_64.h
+++ b/arch/sparc/include/asm/tlb_64.h
@@ -24,7 +24,6 @@ void flush_tlb_pending(void);
 
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
-#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 #define tlb_flush(tlb)	flush_tlb_pending()
 
 /*
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 820082bd6880..1bfe979bb9bc 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -4,7 +4,6 @@
 
 #define tlb_start_vma(tlb, vma) do { } while (0)
 #define tlb_end_vma(tlb, vma) do { } while (0)
-#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
 
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 15:00   ` Peter Zijlstra
  2020-11-20 15:09     ` Peter Zijlstra
@ 2020-11-20 15:15     ` Will Deacon
  2020-11-20 15:27       ` Peter Zijlstra
  2020-11-20 15:55     ` Minchan Kim
  2 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2020-11-20 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yu Zhao, Anshuman Khandual, Catalin Marinas, Linus Torvalds,
	linux-kernel, linux-mm, Minchan Kim, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > performed by clear_refs_write() in response to a write to
> > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > state we can end up with entries where pte_write() is false, yet a
> > writable mapping remains in the TLB.
> > 
> > Fix this by calling tlb_remove_tlb_entry() for each entry being
> > write-protected when cleating soft-dirty.
> > 
> 
> > @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >  		ptent = pte_wrprotect(old_pte);
> >  		ptent = pte_clear_soft_dirty(ptent);
> >  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> >  	} else if (is_swap_pte(ptent)) {
> >  		ptent = pte_swp_clear_soft_dirty(ptent);
> >  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> 
> Oh!
> 
> Yesterday when you had me look at this code; I figured the sane thing
> to do was to make it look more like mprotect().

Ah, so you mean ditch the mmu_gather altogether?

> Why did you chose to make it work with mmu_gather instead? I'll grant
> you that it's probably the smaller patch, but I still think it's weird
> to use mmu_gather here.
> 
> Also, is tlb_remote_tlb_entry() actually correct? If you look at
> __tlb_remove_tlb_entry() you'll find that Power-Hash-32 will clear the
> entry, which might not be what we want here, we want to update the
> entrty.

Hmm, I didn't spot that, although ptep_modify_prot_start() does actually
clear the pte so we could just move this up a few lines.

Will

> 
> 
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 15:15     ` Will Deacon
@ 2020-11-20 15:27       ` Peter Zijlstra
  2020-11-23 18:23         ` Will Deacon
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2020-11-20 15:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Catalin Marinas, Linus Torvalds,
	linux-kernel, linux-mm, Minchan Kim, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 03:15:24PM +0000, Will Deacon wrote:
> On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.
> > > 
> > > Fix this by calling tlb_remove_tlb_entry() for each entry being
> > > write-protected when cleating soft-dirty.
> > > 
> > 
> > > @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > >  		ptent = pte_wrprotect(old_pte);
> > >  		ptent = pte_clear_soft_dirty(ptent);
> > >  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> > > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > >  	} else if (is_swap_pte(ptent)) {
> > >  		ptent = pte_swp_clear_soft_dirty(ptent);
> > >  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> > 
> > Oh!
> > 
> > Yesterday when you had me look at this code; I figured the sane thing
> > to do was to make it look more like mprotect().
> 
> Ah, so you mean ditch the mmu_gather altogether?

Yes. Alternatively, if we decide mmu_gather is 'right', then we should
probably look at converting mprotect().

That is, I see no reason why this and mprotect should differ on this
point.

> > Why did you chose to make it work with mmu_gather instead? I'll grant
> > you that it's probably the smaller patch, but I still think it's weird
> > to use mmu_gather here.
> > 
> > Also, is tlb_remote_tlb_entry() actually correct? If you look at
> > __tlb_remove_tlb_entry() you'll find that Power-Hash-32 will clear the
> > entry, which might not be what we want here, we want to update the
> > entrty.
> 
> Hmm, I didn't spot that, although ptep_modify_prot_start() does actually
> clear the pte so we could just move this up a few lines.

Yes, but hash-entry != pte. If I'm not mistaken (and I could very well
be, it's Friday and Power-MMUs being the maze they are), the end result
here is an updated PTE but an empty hash-entry.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 15:00   ` Peter Zijlstra
  2020-11-20 15:09     ` Peter Zijlstra
  2020-11-20 15:15     ` Will Deacon
@ 2020-11-20 15:55     ` Minchan Kim
  2020-11-23 18:41       ` Will Deacon
  2 siblings, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2020-11-20 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yu Zhao, Will Deacon, Anshuman Khandual, Linus Torvalds,
	linux-kernel, linux-mm, Catalin Marinas, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > performed by clear_refs_write() in response to a write to
> > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > state we can end up with entries where pte_write() is false, yet a
> > writable mapping remains in the TLB.
> > 
> > Fix this by calling tlb_remove_tlb_entry() for each entry being
> > write-protected when cleating soft-dirty.
> > 
> 
> > @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >  		ptent = pte_wrprotect(old_pte);
> >  		ptent = pte_clear_soft_dirty(ptent);
> >  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> > +		tlb_remove_tlb_entry(tlb, pte, addr);
> >  	} else if (is_swap_pte(ptent)) {
> >  		ptent = pte_swp_clear_soft_dirty(ptent);
> >  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> 
> Oh!
> 
> Yesterday when you had me look at this code; I figured the sane thing
> to do was to make it look more like mprotect().
> 
> Why did you chose to make it work with mmu_gather instead? I'll grant
> you that it's probably the smaller patch, but I still think it's weird
> to use mmu_gather here.

I agree. The reason why clear_refs_write used the gather API was [1] and
seems like to overkill to me.

We could just do like [inc|dec]_tlb_flush_pending with flush_tlb_mm at
right before dec_tlb_flush_pending instead of gather.

thought?

[1] b3a81d0841a95, mm: fix KSM data corruption

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] arm64: pgtable: Fix pte_accessible()
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
@ 2020-11-20 16:03   ` Minchan Kim
  2020-11-20 19:53   ` Yu Zhao
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2020-11-20 16:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Yu Zhao, Anshuman Khandual, Peter Zijlstra,
	Catalin Marinas, linux-kernel, stable, linux-mm, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:52PM +0000, Will Deacon wrote:
> pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
> invalidation is necessary when unmapping pages for reclaim. Although our
> implementation is correct according to the architecture, returning true
> only for valid, young ptes in the absence of racing page-table
> modifications, this is in fact flawed due to lazy invalidation of old
> ptes in ptep_clear_flush_young() where we elide the expensive DSB
> instruction for completing the TLB invalidation.
> 
> Rather than penalise the aging path, adjust pte_accessible() to return
> true for any valid pte, even if the access flag is cleared.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 76c714be0e5e ("arm64: pgtable: implement pte_accessible()")
> Reported-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Will Deacon <will@kernel.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon
@ 2020-11-20 17:09   ` Minchan Kim
  2020-11-23 14:31     ` Catalin Marinas
  2020-11-23 14:22   ` Catalin Marinas
  1 sibling, 1 reply; 42+ messages in thread
From: Minchan Kim @ 2020-11-20 17:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Yu Zhao, Anshuman Khandual, Peter Zijlstra,
	Catalin Marinas, linux-kernel, stable, linux-mm, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:53PM +0000, Will Deacon wrote:
> With hardware dirty bit management, calling pte_wrprotect() on a writable,
> dirty PTE will lose the dirty state and return a read-only, clean entry.
> 
> Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that
> the dirty bit is preserved for writable entries, as this is required for
> soft-dirty bit management if we enable it in the future.
> 
> Cc: <stable@vger.kernel.org>

It this stable material if it would be a problem once ARM64 supports
softdirty in future?

> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/pgtable.h | 27 ++++++++++++++-------------
>  1 file changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 1bdf51f01e73..a155551863c9 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -162,13 +162,6 @@ static inline pmd_t set_pmd_bit(pmd_t pmd, pgprot_t prot)
>  	return pmd;
>  }
>  
> -static inline pte_t pte_wrprotect(pte_t pte)
> -{
> -	pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
> -	pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
> -	return pte;
> -}
> -
>  static inline pte_t pte_mkwrite(pte_t pte)
>  {
>  	pte = set_pte_bit(pte, __pgprot(PTE_WRITE));
> @@ -194,6 +187,20 @@ static inline pte_t pte_mkdirty(pte_t pte)
>  	return pte;
>  }
>  
> +static inline pte_t pte_wrprotect(pte_t pte)
> +{
> +	/*
> +	 * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY
> +	 * clear), set the PTE_DIRTY bit.
> +	 */
> +	if (pte_hw_dirty(pte))
> +		pte = pte_mkdirty(pte);
> +
> +	pte = clear_pte_bit(pte, __pgprot(PTE_WRITE));
> +	pte = set_pte_bit(pte, __pgprot(PTE_RDONLY));
> +	return pte;
> +}
> +
>  static inline pte_t pte_mkold(pte_t pte)
>  {
>  	return clear_pte_bit(pte, __pgprot(PTE_AF));
> @@ -843,12 +850,6 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>  	pte = READ_ONCE(*ptep);
>  	do {
>  		old_pte = pte;
> -		/*
> -		 * If hardware-dirty (PTE_WRITE/DBM bit set and PTE_RDONLY
> -		 * clear), set the PTE_DIRTY bit.
> -		 */
> -		if (pte_hw_dirty(pte))
> -			pte = pte_mkdirty(pte);
>  		pte = pte_wrprotect(pte);
>  		pte_val(pte) = cmpxchg_relaxed(&pte_val(*ptep),
>  					       pte_val(old_pte), pte_val(pte));
> -- 
> 2.29.2.454.gaff20da3a2-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
  2020-11-20 14:35 ` [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Will Deacon
@ 2020-11-20 17:20   ` Linus Torvalds
  2020-11-23 16:48     ` Will Deacon
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2020-11-20 17:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

On Fri, Nov 20, 2020 at 6:36 AM Will Deacon <will@kernel.org> wrote:
>
> tlb_finish_mmu() takes two confusing and unused 'start'/'end' address
> arguments. Remove them.

Ack, but please add the history to it.

Those arguments were used, up until 7a30df49f63a ("mm: mmu_gather:
remove __tlb_reset_range() for force flush").

And the thing is, using a range flush in theory might be better, but
for simplicity it's now doing a "fullmm" one, which is why those
arguments no longer matter.

(And I think we track the range better now too, which may be another
reason they aren't needed)

          Linus

                 Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()
  2020-11-20 14:35 ` [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() Will Deacon
@ 2020-11-20 17:22   ` Linus Torvalds
  2020-11-20 17:31     ` Linus Torvalds
       [not found]   ` <20201122151158.GK2390@xsang-OptiPlex-9020>
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2020-11-20 17:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

On Fri, Nov 20, 2020 at 6:36 AM Will Deacon <will@kernel.org> wrote:
>
> Introduce tlb_gather_mmu_fullmm() to make it clearer that this is for the
> entire mm and WARN() if tlb_gather_mmu() is called with an 'end' address
> greated than TASK_SIZE.

Ack (but with a spello note - "greated").

          Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()
  2020-11-20 17:22   ` Linus Torvalds
@ 2020-11-20 17:31     ` Linus Torvalds
  2020-11-23 16:48       ` Will Deacon
  0 siblings, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2020-11-20 17:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

Oh - wait.

Not ack.

Not because this is wrong, but because I think you should remove the
start/end arguments here too.

The _only_ thing they were used for was that "fullmm" flag, afaik. So
now they no longer make sense.

Hmm?

               Linus

On Fri, Nov 20, 2020 at 9:22 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Nov 20, 2020 at 6:36 AM Will Deacon <will@kernel.org> wrote:
> >
> > Introduce tlb_gather_mmu_fullmm() to make it clearer that this is for the
> > entire mm and WARN() if tlb_gather_mmu() is called with an 'end' address
> > greated than TASK_SIZE.
>
> Ack (but with a spello note - "greated").
>
>           Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon
@ 2020-11-20 17:41   ` Linus Torvalds
  2020-11-20 17:45     ` Linus Torvalds
  2020-11-20 20:40   ` Yu Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: Linus Torvalds @ 2020-11-20 17:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

On Fri, Nov 20, 2020 at 6:36 AM Will Deacon <will@kernel.org> wrote:
>
> Ensure that TLB invalidation is performed after updating soft-dirty
> entries via clear_refs_write() by using the non-fullmm API to MMU gather.

This code sequence looks bogus to begin with.

It does that

                tlb_gather_mmu(&tlb, mm, 0, -1);
     ..
                tlb_finish_mmu(&tlb, 0, -1);

around the loop (all, your patch series changes those arguments), but
it doesn't actually use "tlb" anywhere inside the loop itself that I
can see.

Yeah., yeah, it sets the flush_pending thing etc, but that still
sounds fundamentally wrong. It should do the proper range adjustments
if/when it actually wals the range. No?

If I read this all right, it will do a full TLB flush even when it
doesn't do anything (eg CLEAR_REFS_SOFT_DIRTY with no softdirty
pages).

So this looks all kinds of bogus. Not your patch, but the code it patches.

               Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-20 17:41   ` Linus Torvalds
@ 2020-11-20 17:45     ` Linus Torvalds
  0 siblings, 0 replies; 42+ messages in thread
From: Linus Torvalds @ 2020-11-20 17:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

On Fri, Nov 20, 2020 at 9:41 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This code sequence looks bogus to begin with.

Oh, never mind.

I was reading the patches out of order, because 4/6 showed up later in
my inbox since it had other replies.

You seem to have fixed that bogosity in 4/6.

             Linus

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] arm64: pgtable: Fix pte_accessible()
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
  2020-11-20 16:03   ` Minchan Kim
@ 2020-11-20 19:53   ` Yu Zhao
  2020-11-23 13:27   ` Catalin Marinas
  2020-11-24 10:02   ` Anshuman Khandual
  3 siblings, 0 replies; 42+ messages in thread
From: Yu Zhao @ 2020-11-20 19:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, stable, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:52PM +0000, Will Deacon wrote:
> pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
> invalidation is necessary when unmapping pages for reclaim. Although our
> implementation is correct according to the architecture, returning true
> only for valid, young ptes in the absence of racing page-table
> modifications, this is in fact flawed due to lazy invalidation of old
> ptes in ptep_clear_flush_young() where we elide the expensive DSB
> instruction for completing the TLB invalidation.
> 
> Rather than penalise the aging path, adjust pte_accessible() to return
> true for any valid pte, even if the access flag is cleared.

The chance of a system hitting reclaim is proportional to how long
it's been running, and that of having mapped but yet to be accessed
PTEs is reciprocal, so to speak. I don't reboot my devices everyday,
and therefore:

Acked-by: Yu Zhao <yuzhao@google.com>

> Cc: <stable@vger.kernel.org>
> Fixes: 76c714be0e5e ("arm64: pgtable: implement pte_accessible()")
> Reported-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/pgtable.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 4ff12a7adcfd..1bdf51f01e73 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -115,8 +115,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>  #define pte_valid(pte)		(!!(pte_val(pte) & PTE_VALID))
>  #define pte_valid_not_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER)) == PTE_VALID)
> -#define pte_valid_young(pte) \
> -	((pte_val(pte) & (PTE_VALID | PTE_AF)) == (PTE_VALID | PTE_AF))
>  #define pte_valid_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER))
>  
> @@ -126,7 +124,7 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>   * remapped as PROT_NONE but are yet to be flushed from the TLB.
>   */
>  #define pte_accessible(mm, pte)	\
> -	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> +	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
>  
>  /*
>   * p??_access_permitted() is true for valid user mappings (subject to the
> -- 
> 2.29.2.454.gaff20da3a2-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon
  2020-11-20 15:00   ` Peter Zijlstra
@ 2020-11-20 20:22   ` Yu Zhao
  2020-11-21  2:49     ` Yu Zhao
  1 sibling, 1 reply; 42+ messages in thread
From: Yu Zhao @ 2020-11-20 20:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> via the tlb_remove_*() functions. Consequently, the page-table modifications
> performed by clear_refs_write() in response to a write to
> /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> state we can end up with entries where pte_write() is false, yet a
> writable mapping remains in the TLB.

I don't think we need a TLB flush in this context, same reason as we
don't have one in copy_present_pte() which uses ptep_set_wrprotect()
to write-protect a src PTE.

ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
either the dirty bit is set (when a PTE is still writable) or a PF
happens (when a PTE has become r/o) when h/w page table walker races
with kernel that modifies a PTE using the two APIs.

> Fix this by calling tlb_remove_tlb_entry() for each entry being
> write-protected when cleating soft-dirty.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  fs/proc/task_mmu.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index cd03ab9087b0..3308292ee5c5 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1032,11 +1032,12 @@ enum clear_refs_types {
>  
>  struct clear_refs_private {
>  	enum clear_refs_types type;
> +	struct mmu_gather *tlb;
>  };
>  
>  #ifdef CONFIG_MEM_SOFT_DIRTY
>  static inline void clear_soft_dirty(struct vm_area_struct *vma,
> -		unsigned long addr, pte_t *pte)
> +		unsigned long addr, pte_t *pte, struct mmu_gather *tlb)
>  {
>  	/*
>  	 * The soft-dirty tracker uses #PF-s to catch writes
> @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  		ptent = pte_wrprotect(old_pte);
>  		ptent = pte_clear_soft_dirty(ptent);
>  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> +		tlb_remove_tlb_entry(tlb, pte, addr);
>  	} else if (is_swap_pte(ptent)) {
>  		ptent = pte_swp_clear_soft_dirty(ptent);
>  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> @@ -1060,14 +1062,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  }
>  #else
>  static inline void clear_soft_dirty(struct vm_area_struct *vma,
> -		unsigned long addr, pte_t *pte)
> +		unsigned long addr, pte_t *pte, struct mmu_gather *tlb)
>  {
>  }
>  #endif
>  
>  #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> -		unsigned long addr, pmd_t *pmdp)
> +		unsigned long addr, pmd_t *pmdp, struct mmu_gather *tlb)
>  {
>  	pmd_t old, pmd = *pmdp;
>  
> @@ -1081,6 +1083,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>  
>  		pmd = pmd_wrprotect(pmd);
>  		pmd = pmd_clear_soft_dirty(pmd);
> +		tlb_remove_pmd_tlb_entry(tlb, pmdp, addr);
>  
>  		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
>  	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
> @@ -1090,7 +1093,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>  }
>  #else
>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
> -		unsigned long addr, pmd_t *pmdp)
> +		unsigned long addr, pmd_t *pmdp, struct mmu_gather *tlb)
>  {
>  }
>  #endif
> @@ -1107,7 +1110,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  	ptl = pmd_trans_huge_lock(pmd, vma);
>  	if (ptl) {
>  		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
> -			clear_soft_dirty_pmd(vma, addr, pmd);
> +			clear_soft_dirty_pmd(vma, addr, pmd, cp->tlb);
>  			goto out;
>  		}
>  
> @@ -1133,7 +1136,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
>  		ptent = *pte;
>  
>  		if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
> -			clear_soft_dirty(vma, addr, pte);
> +			clear_soft_dirty(vma, addr, pte, cp->tlb);
>  			continue;
>  		}
>  
> @@ -1212,7 +1215,8 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  	if (mm) {
>  		struct mmu_notifier_range range;
>  		struct clear_refs_private cp = {
> -			.type = type,
> +			.type	= type,
> +			.tlb	= &tlb,
>  		};
>  
>  		if (type == CLEAR_REFS_MM_HIWATER_RSS) {
> -- 
> 2.29.2.454.gaff20da3a2-goog
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon
  2020-11-20 17:41   ` Linus Torvalds
@ 2020-11-20 20:40   ` Yu Zhao
  2020-11-23 18:35     ` Will Deacon
  2020-11-24 14:46     ` Peter Zijlstra
  1 sibling, 2 replies; 42+ messages in thread
From: Yu Zhao @ 2020-11-20 20:40 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> updating the page-tables for the current mm. However, since the mm is not
> being freed, this can result in stale TLB entries on architectures which
> elide 'fullmm' invalidation.
> 
> Ensure that TLB invalidation is performed after updating soft-dirty
> entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  fs/proc/task_mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index a76d339b5754..316af047f1aa 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
>  			count = -EINTR;
>  			goto out_mm;
>  		}
> -		tlb_gather_mmu_fullmm(&tlb, mm);
> +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);

Let's assume my reply to patch 4 is wrong, and therefore we still need
tlb_gather/finish_mmu() here. But then wouldn't this change deprive
architectures other than ARM the opportunity to optimize based on the
fact it's a full-mm flush?

It seems to me ARM's interpretation of tlb->fullmm is a special case,
not the other way around.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 20:22   ` Yu Zhao
@ 2020-11-21  2:49     ` Yu Zhao
  2020-11-23 19:21       ` Yu Zhao
  2020-11-23 22:04       ` Will Deacon
  0 siblings, 2 replies; 42+ messages in thread
From: Yu Zhao @ 2020-11-21  2:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > performed by clear_refs_write() in response to a write to
> > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > state we can end up with entries where pte_write() is false, yet a
> > writable mapping remains in the TLB.
> 
> I don't think we need a TLB flush in this context, same reason as we
> don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> to write-protect a src PTE.
> 
> ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> either the dirty bit is set (when a PTE is still writable) or a PF
> happens (when a PTE has become r/o) when h/w page table walker races
> with kernel that modifies a PTE using the two APIs.

After we remove the writable bit, if we end up with a clean PTE, any
subsequent write will trigger a page fault. We can't have a stale
writable tlb entry. The architecture-specific APIs guarantee this.

If we end up with a dirty PTE, then yes, there will be a stale
writable tlb entry. But this won't be a problem because when we
write-protect a page (not PTE), we always check both pte_dirty()
and pte_write(), i.e., write_protect_page() and page_mkclean_one().
When they see this dirty PTE, they will flush. And generally, only
callers of pte_mkclean() should flush tlb; otherwise we end up one
extra if callers of pte_mkclean() and pte_wrprotect() both flush.

Now let's take a step back and see why we got
tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
("mm: fix KSM data corruption") explains the problem clearly. But
to fix a problem created by two threads clearing pte_write() and
pte_dirty() independently, we only need one of them to set
mm_tlb_flush_pending(). Given only removing the writable bit requires
tlb flush, that thread should be the one, as I just explained. Adding
tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
point in having the original flush_tlb_mm() either, given data
integrity is already guaranteed. Of course, with it we have more
accurate access tracking.

Does a similar problem exist for page_mkclean_one()? Possibly. It
checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
At the moment, madvise_free_pte_range() only supports anonymous
memory, which doesn't do writeback. But the missing
mm_tlb_flush_pending() just seems to be an accident waiting to happen.
E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
I don't know what it's for, but if it's called on file VMAs, a similar
race involving 4 CPUs can happen. This time CPU 1 runs
clean_record_pte() and CPU 3 runs page_mkclean_one().

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] arm64: pgtable: Fix pte_accessible()
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
  2020-11-20 16:03   ` Minchan Kim
  2020-11-20 19:53   ` Yu Zhao
@ 2020-11-23 13:27   ` Catalin Marinas
  2020-11-24 10:02   ` Anshuman Khandual
  3 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2020-11-23 13:27 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	linux-kernel, stable, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:52PM +0000, Will Deacon wrote:
> pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
> invalidation is necessary when unmapping pages for reclaim. Although our
> implementation is correct according to the architecture, returning true
> only for valid, young ptes in the absence of racing page-table
> modifications, this is in fact flawed due to lazy invalidation of old
> ptes in ptep_clear_flush_young() where we elide the expensive DSB
> instruction for completing the TLB invalidation.
> 
> Rather than penalise the aging path, adjust pte_accessible() to return
> true for any valid pte, even if the access flag is cleared.
> 
> Cc: <stable@vger.kernel.org>
> Fixes: 76c714be0e5e ("arm64: pgtable: implement pte_accessible()")
> Reported-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Will Deacon <will@kernel.org>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon
  2020-11-20 17:09   ` Minchan Kim
@ 2020-11-23 14:22   ` Catalin Marinas
  1 sibling, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2020-11-23 14:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, kernel-team,
	linux-kernel, stable, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 02:35:53PM +0000, Will Deacon wrote:
> With hardware dirty bit management, calling pte_wrprotect() on a writable,
> dirty PTE will lose the dirty state and return a read-only, clean entry.

My assumption at the time was that the caller of pte_wrprotect() already
moved the 'dirty' information to the underlying page. Most
pte_wrprotect() calls also do a pte_mkclean(). However, it doesn't seem
to always be the case (soft-dirty but we don't support it yet).

I was worried that we may inadvertently set the dirty bit when doing a
pte_wrprotect() on a freshly created pte (not read from memory, for
example __split_huge_pmd_locked()) but I think all our __P* and __S*
attributes start with a PTE_RDONLY, therefore the pte_hw_dirty() returns
false. A test for mm/debug_vm_pgtable.c, something like:

	for (i = 0, i < ARRAY_SIZE(protection_map); i++) {
		pte = pfn_pte(pfn, protection_map(i));
		WARN_ON(pte_dirty(pte_wrprotect(pte));
	}

(I'll leave this to Anshuman ;))

> Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that
> the dirty bit is preserved for writable entries, as this is required for
> soft-dirty bit management if we enable it in the future.
> 
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Will Deacon <will@kernel.org>

I think this could go back as far as the hardware AF/DBM support (v4.3):

Fixes: 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits")

If you limit this fix to 4.14, you probably don't need additional
commits. Otherwise, at least this one:

3bbf7157ac66 ("arm64: Convert pte handling from inline asm to using (cmp)xchg")

and a slightly more intrusive:

73e86cb03cf2 ("arm64: Move PTE_RDONLY bit handling out of set_pte_at()")

We also had some attempts at fixing ptep_set_wrprotect():

64c26841b349 ("arm64: Ignore hardware dirty bit updates in ptep_set_wrprotect()")

Fixed subsequently by:

8781bcbc5e69 ("arm64: mm: Fix pte_mkclean, pte_mkdirty semantics")

I have a hope that at some point we'll understand how this all works ;).

For this patch:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect()
  2020-11-20 17:09   ` Minchan Kim
@ 2020-11-23 14:31     ` Catalin Marinas
  0 siblings, 0 replies; 42+ messages in thread
From: Catalin Marinas @ 2020-11-23 14:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Will Deacon,
	Linus Torvalds, linux-kernel, stable, linux-mm, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 09:09:03AM -0800, Minchan Kim wrote:
> On Fri, Nov 20, 2020 at 02:35:53PM +0000, Will Deacon wrote:
> > With hardware dirty bit management, calling pte_wrprotect() on a writable,
> > dirty PTE will lose the dirty state and return a read-only, clean entry.
> > 
> > Move the logic from ptep_set_wrprotect() into pte_wrprotect() to ensure that
> > the dirty bit is preserved for writable entries, as this is required for
> > soft-dirty bit management if we enable it in the future.
> > 
> > Cc: <stable@vger.kernel.org>
> 
> It this stable material if it would be a problem once ARM64 supports
> softdirty in future?

I don't think so. Arm64 did not have a hardware dirty mechanism from the
start, it was added later but in a way as to coexist with other CPUs or
peripherals that don't support it. So instead of setting a PTE_DIRTY bit
as one would expect, the CPU clears the PTE_RDONLY on write access to a
writable PTE (the PTE_DBM/PTE_WRITE bit set). So our pte_wrprotect()
needs to set PTE_RDONLY and clear PTE_DBM (PTE_WRITE) but !PTE_RDONLY is
our only information of a pte having been dirtied, so we have to
transfer it to a software PTE_DIRTY bit. This is different from a
soft-dirty pte bit if we add it in the future.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()
  2020-11-20 17:31     ` Linus Torvalds
@ 2020-11-23 16:48       ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-23 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

On Fri, Nov 20, 2020 at 09:31:09AM -0800, Linus Torvalds wrote:
> Oh - wait.
> 
> Not ack.
> 
> Not because this is wrong, but because I think you should remove the
> start/end arguments here too.
> 
> The _only_ thing they were used for was that "fullmm" flag, afaik. So
> now they no longer make sense.
> 
> Hmm?

Oh nice, well spotted. I'll drop them for v2.

Cheers,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
  2020-11-20 17:20   ` Linus Torvalds
@ 2020-11-23 16:48     ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-23 16:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linux Kernel Mailing List, Linux-MM, Minchan Kim,
	Android Kernel Team, Linux ARM

On Fri, Nov 20, 2020 at 09:20:43AM -0800, Linus Torvalds wrote:
> On Fri, Nov 20, 2020 at 6:36 AM Will Deacon <will@kernel.org> wrote:
> >
> > tlb_finish_mmu() takes two confusing and unused 'start'/'end' address
> > arguments. Remove them.
> 
> Ack, but please add the history to it.
> 
> Those arguments were used, up until 7a30df49f63a ("mm: mmu_gather:
> remove __tlb_reset_range() for force flush").
> 
> And the thing is, using a range flush in theory might be better, but
> for simplicity it's now doing a "fullmm" one, which is why those
> arguments no longer matter.
> 
> (And I think we track the range better now too, which may be another
> reason they aren't needed)

Yes, thanks for digging that up. I'll add it to the commit message in v2.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [tlb]  e242a269fa: WARNING:at_mm/mmu_gather.c:#tlb_gather_mmu
       [not found]   ` <20201122151158.GK2390@xsang-OptiPlex-9020>
@ 2020-11-23 17:51     ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-23 17:51 UTC (permalink / raw)
  To: kernel test robot
  Cc: Yu Zhao, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	Linus Torvalds, linux-kernel, linux-mm, Minchan Kim, lkp,
	kernel-team, linux-arm-kernel, 0day robot

Hmm, this is interesting but my x86-fu is a bit lacking:

On Sun, Nov 22, 2020 at 11:11:58PM +0800, kernel test robot wrote:
> commit: e242a269fa4b7aee0b157ce5b1d7d12179fc3c44 ("[PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm()")
> url: https://github.com/0day-ci/linux/commits/Will-Deacon/tlb-Fix-access-and-soft-dirty-bit-management/20201120-223809
> base: https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git for-next/core

[...]

> [   14.182822] WARNING: CPU: 0 PID: 1 at mm/mmu_gather.c:293 tlb_gather_mmu+0x40/0x99

This fires because free_ldt_pgtables() initialises an mmu_gather() with
an end address > TASK_SIZE. In other words, this code:

	unsigned long start = LDT_BASE_ADDR;
	unsigned long end = LDT_END_ADDR;

	if (!boot_cpu_has(X86_FEATURE_PTI))
		return;

	tlb_gather_mmu(&tlb, mm, start, end);

seems to be passing kernel addresses to tlb_gather_mmu(), which will cause
the range adjusment logic in __tlb_adjust_range() to round the base down
to TASK_SIZE afaict. At which point, I suspect the low-level invalidation
routine replaces the enormous range with a fullmm flush (see the check in
flush_tlb_mm_range()).

If that's the case (and I would appreciate some input from somebody who
knows what an LDT is), then I think the right answer is to replace this with
a call to tlb_gather_mmu_fullmm, although I haven't ever anticipated these
things working on kernel addresses and whether that would do the right kind
of invalidation for x86 w/ PTI. A quick read of the code suggests it should
work out...

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 15:27       ` Peter Zijlstra
@ 2020-11-23 18:23         ` Will Deacon
  0 siblings, 0 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-23 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yu Zhao, Anshuman Khandual, Catalin Marinas, Linus Torvalds,
	linux-kernel, linux-mm, Minchan Kim, kernel-team,
	linux-arm-kernel

Hi Peter,

On Fri, Nov 20, 2020 at 04:27:31PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 03:15:24PM +0000, Will Deacon wrote:
> > On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:
> > > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > > performed by clear_refs_write() in response to a write to
> > > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > > state we can end up with entries where pte_write() is false, yet a
> > > > writable mapping remains in the TLB.
> > > > 
> > > > Fix this by calling tlb_remove_tlb_entry() for each entry being
> > > > write-protected when cleating soft-dirty.
> > > > 
> > > 
> > > > @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > > >  		ptent = pte_wrprotect(old_pte);
> > > >  		ptent = pte_clear_soft_dirty(ptent);
> > > >  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> > > > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > > >  	} else if (is_swap_pte(ptent)) {
> > > >  		ptent = pte_swp_clear_soft_dirty(ptent);
> > > >  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> > > 
> > > Oh!
> > > 
> > > Yesterday when you had me look at this code; I figured the sane thing
> > > to do was to make it look more like mprotect().
> > 
> > Ah, so you mean ditch the mmu_gather altogether?
> 
> Yes. Alternatively, if we decide mmu_gather is 'right', then we should
> probably look at converting mprotect().
> 
> That is, I see no reason why this and mprotect should differ on this
> point.

I agree that we should aim for consistency, but it's worth pointing out
that madvise() uses the gather API in the same way that I'm proposing
here (see MADV_COLD/MADV_PAGEOUT).

Another thing to keep in mind is that, unlike mprotect(), we do actually
want to elide the TLB invalidation clear_refs_write() when all we're
doing is making the pages old. The gather API lends itself quite nicely
to this, as we can only update the range when actually doing the write
protection on the soft-dirty path.

> > > Why did you chose to make it work with mmu_gather instead? I'll grant
> > > you that it's probably the smaller patch, but I still think it's weird
> > > to use mmu_gather here.
> > > 
> > > Also, is tlb_remote_tlb_entry() actually correct? If you look at
> > > __tlb_remove_tlb_entry() you'll find that Power-Hash-32 will clear the
> > > entry, which might not be what we want here, we want to update the
> > > entrty.
> > 
> > Hmm, I didn't spot that, although ptep_modify_prot_start() does actually
> > clear the pte so we could just move this up a few lines.
> 
> Yes, but hash-entry != pte. If I'm not mistaken (and I could very well
> be, it's Friday and Power-MMUs being the maze they are), the end result
> here is an updated PTE but an empty hash-entry.

I had a look at the PPC code and, afaict, this should be fine. The next
access will fault, and we'll populate the hash entry from the pte afaict.

Am I missing something?

If we _really_ wanted to, then we could extend the mmu gather API to add
something like tlb_update_tlb_entry(), which would call
tlb_remove_tlb_entry() under the hood, and set a flag on the gather
structure so that tlb_finish_mmu() ends up calling update_mmu_cache() to
preload the hash.

However, I think this is purely a performance thing, and I'm wary about
pro-actively extending the API to optimise for the PPC hash.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-20 20:40   ` Yu Zhao
@ 2020-11-23 18:35     ` Will Deacon
  2020-11-23 20:04       ` Yu Zhao
  2020-11-24 14:46     ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: Will Deacon @ 2020-11-23 18:35 UTC (permalink / raw)
  To: Yu Zhao
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > updating the page-tables for the current mm. However, since the mm is not
> > being freed, this can result in stale TLB entries on architectures which
> > elide 'fullmm' invalidation.
> > 
> > Ensure that TLB invalidation is performed after updating soft-dirty
> > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > 
> > Signed-off-by: Will Deacon <will@kernel.org>
> > ---
> >  fs/proc/task_mmu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index a76d339b5754..316af047f1aa 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> >  			count = -EINTR;
> >  			goto out_mm;
> >  		}
> > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> 
> Let's assume my reply to patch 4 is wrong, and therefore we still need
> tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> architectures other than ARM the opportunity to optimize based on the
> fact it's a full-mm flush?

Only for the soft-dirty case, but I think TLB invalidation is required
there because we are write-protecting the entries and I don't see any
mechanism to handle lazy invalidation for that (compared with the aging
case, which is handled via pte_accessible()).

Furthermore, If we decide that we can relax the TLB invalidation
requirements here, then I'd much rather than was done deliberately, rather
than as an accidental side-effect of another commit (since I think the
current behaviour was a consequence of 7a30df49f63a).

> It seems to me ARM's interpretation of tlb->fullmm is a special case,
> not the other way around.

Although I agree that this is subtle and error-prone (which is why I'm
trying to make the API more explicit here), it _is_ documented clearly
in asm-generic/tlb.h:

 *  - mmu_gather::fullmm
 *
 *    A flag set by tlb_gather_mmu() to indicate we're going to free
 *    the entire mm; this allows a number of optimizations.
 *
 *    - We can ignore tlb_{start,end}_vma(); because we don't
 *      care about ranges. Everything will be shot down.
 *
 *    - (RISC) architectures that use ASIDs can cycle to a new ASID
 *      and delay the invalidation until ASID space runs out.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-20 15:55     ` Minchan Kim
@ 2020-11-23 18:41       ` Will Deacon
  2020-11-25 22:51         ` Minchan Kim
  0 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2020-11-23 18:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: kernel-team, Yu Zhao, Anshuman Khandual, Peter Zijlstra,
	Catalin Marinas, linux-kernel, linux-mm, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 07:55:14AM -0800, Minchan Kim wrote:
> On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.
> > > 
> > > Fix this by calling tlb_remove_tlb_entry() for each entry being
> > > write-protected when cleating soft-dirty.
> > > 
> > 
> > > @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > >  		ptent = pte_wrprotect(old_pte);
> > >  		ptent = pte_clear_soft_dirty(ptent);
> > >  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> > > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > >  	} else if (is_swap_pte(ptent)) {
> > >  		ptent = pte_swp_clear_soft_dirty(ptent);
> > >  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> > 
> > Oh!
> > 
> > Yesterday when you had me look at this code; I figured the sane thing
> > to do was to make it look more like mprotect().
> > 
> > Why did you chose to make it work with mmu_gather instead? I'll grant
> > you that it's probably the smaller patch, but I still think it's weird
> > to use mmu_gather here.
> 
> I agree. The reason why clear_refs_write used the gather API was [1] and
> seems like to overkill to me.

I don't see why it's overkill. Prior to that commit, it called
flush_tlb_mm() directly.

> We could just do like [inc|dec]_tlb_flush_pending with flush_tlb_mm at
> right before dec_tlb_flush_pending instead of gather.
> 
> thought?

I'm not sure why this is better; it's different to the madvise() path, and
will need special logic to avoid the flush in the case where we're just
doing aging.

Will

> [1] b3a81d0841a95, mm: fix KSM data corruption

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-21  2:49     ` Yu Zhao
@ 2020-11-23 19:21       ` Yu Zhao
  2020-11-23 22:04       ` Will Deacon
  1 sibling, 0 replies; 42+ messages in thread
From: Yu Zhao @ 2020-11-23 19:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 07:49:22PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.

I double checked my conclusion and I think it holds. But let me
correct some typos and add a summary.

> > I don't think we need a TLB flush in this context, same reason as we
                                ^^^^^ gather

> > don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> > to write-protect a src PTE.
> > 
> > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> > either the dirty bit is set (when a PTE is still writable) or a PF
> > happens (when a PTE has become r/o) when h/w page table walker races
> > with kernel that modifies a PTE using the two APIs.
> 
> After we remove the writable bit, if we end up with a clean PTE, any
> subsequent write will trigger a page fault. We can't have a stale
> writable tlb entry. The architecture-specific APIs guarantee this.
> 
> If we end up with a dirty PTE, then yes, there will be a stale
> writable tlb entry. But this won't be a problem because when we
> write-protect a page (not PTE), we always check both pte_dirty()
> and pte_write(), i.e., write_protect_page() and page_mkclean_one().
> When they see this dirty PTE, they will flush. And generally, only
> callers of pte_mkclean() should flush tlb; otherwise we end up one
> extra if callers of pte_mkclean() and pte_wrprotect() both flush.
> 
> Now let's take a step back and see why we got
> tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
> ("mm: fix KSM data corruption") explains the problem clearly. But
> to fix a problem created by two threads clearing pte_write() and
> pte_dirty() independently, we only need one of them to set
> mm_tlb_flush_pending(). Given only removing the writable bit requires
                                                  ^^^^^^^^ dirty

> tlb flush, that thread should be the one, as I just explained. Adding
> tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
> point in having the original flush_tlb_mm() either, given data
> integrity is already guaranteed.
(i.e., writable tlb entries are flushed when removing the dirty bit.)

> Of course, with it we have more accurate access tracking.
> 
> Does a similar problem exist for page_mkclean_one()? Possibly. It
> checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
> At the moment, madvise_free_pte_range() only supports anonymous
> memory, which doesn't do writeback. But the missing
> mm_tlb_flush_pending() just seems to be an accident waiting to happen.
> E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
> I don't know what it's for, but if it's called on file VMAs, a similar
> race involving 4 CPUs can happen. This time CPU 1 runs
> clean_record_pte() and CPU 3 runs page_mkclean_one().

To summarize, IMO, we should 1) remove tlb_gather/finish_mmu() here;
2) check mm_tlb_flush_pending() in page_mkclean_one() and
dax_entry_mkclean().

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-23 18:35     ` Will Deacon
@ 2020-11-23 20:04       ` Yu Zhao
  2020-11-23 21:17         ` Will Deacon
  0 siblings, 1 reply; 42+ messages in thread
From: Yu Zhao @ 2020-11-23 20:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > updating the page-tables for the current mm. However, since the mm is not
> > > being freed, this can result in stale TLB entries on architectures which
> > > elide 'fullmm' invalidation.
> > > 
> > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > 
> > > Signed-off-by: Will Deacon <will@kernel.org>
> > > ---
> > >  fs/proc/task_mmu.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index a76d339b5754..316af047f1aa 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > >  			count = -EINTR;
> > >  			goto out_mm;
> > >  		}
> > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > 
> > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > architectures other than ARM the opportunity to optimize based on the
> > fact it's a full-mm flush?

I double checked my conclusion on patch 4, and aside from a couple
of typos, it still seems correct after the weekend.

> Only for the soft-dirty case, but I think TLB invalidation is required
> there because we are write-protecting the entries and I don't see any
> mechanism to handle lazy invalidation for that (compared with the aging
> case, which is handled via pte_accessible()).

The lazy invalidation for that is done when we write-protect a page,
not an individual PTE. When we do so, our decision is based on both
the dirty bit and the writable bit on each PTE mapping this page. So
we only need to make sure we don't lose both on a PTE. And we don't
here.

> Furthermore, If we decide that we can relax the TLB invalidation
> requirements here, then I'd much rather than was done deliberately, rather
> than as an accidental side-effect of another commit (since I think the
> current behaviour was a consequence of 7a30df49f63a).

Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
("mm: fix KSM data corruption") in the first place.

> > It seems to me ARM's interpretation of tlb->fullmm is a special case,
> > not the other way around.
> 
> Although I agree that this is subtle and error-prone (which is why I'm
> trying to make the API more explicit here), it _is_ documented clearly
> in asm-generic/tlb.h:
> 
>  *  - mmu_gather::fullmm
>  *
>  *    A flag set by tlb_gather_mmu() to indicate we're going to free
>  *    the entire mm; this allows a number of optimizations.
>  *
>  *    - We can ignore tlb_{start,end}_vma(); because we don't
>  *      care about ranges. Everything will be shot down.
>  *
>  *    - (RISC) architectures that use ASIDs can cycle to a new ASID
>  *      and delay the invalidation until ASID space runs out.

I'd leave the original tlb_gather/finish_mmu() for the first case and
add a new API for the second case, the special case that only applies
to exit_mmap()). This way we won't change any existing behaviors on
other architectures, which seems important to me.

Additional cleanups to tlb_gather/finish_mmu() come thereafter.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-23 20:04       ` Yu Zhao
@ 2020-11-23 21:17         ` Will Deacon
  2020-11-24  1:13           ` Yu Zhao
  0 siblings, 1 reply; 42+ messages in thread
From: Will Deacon @ 2020-11-23 21:17 UTC (permalink / raw)
  To: Yu Zhao
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > updating the page-tables for the current mm. However, since the mm is not
> > > > being freed, this can result in stale TLB entries on architectures which
> > > > elide 'fullmm' invalidation.
> > > > 
> > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > 
> > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > ---
> > > >  fs/proc/task_mmu.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index a76d339b5754..316af047f1aa 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > >  			count = -EINTR;
> > > >  			goto out_mm;
> > > >  		}
> > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > 
> > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > architectures other than ARM the opportunity to optimize based on the
> > > fact it's a full-mm flush?
> 
> I double checked my conclusion on patch 4, and aside from a couple
> of typos, it still seems correct after the weekend.

I still need to digest that, but I would prefer that we restore the
invalidation first, and then have a subsequent commit to relax it. I find
it hard to believe that the behaviour in mainline at the moment is deliberate.

That is, I'm not against optimising this, but I'd rather get it "obviously
correct" first and the current code is definitely not that.

> > Only for the soft-dirty case, but I think TLB invalidation is required
> > there because we are write-protecting the entries and I don't see any
> > mechanism to handle lazy invalidation for that (compared with the aging
> > case, which is handled via pte_accessible()).
> 
> The lazy invalidation for that is done when we write-protect a page,
> not an individual PTE. When we do so, our decision is based on both
> the dirty bit and the writable bit on each PTE mapping this page. So
> we only need to make sure we don't lose both on a PTE. And we don't
> here.

Sorry, I don't follow what you're getting at here (page vs pte). Please can
you point me to the code you're referring to? The case I'm worried about is
code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
where !pte_write() and assuming (despite pte_dirty()) that there can't be
any concurrent modifications to the mapped page. Granted, I haven't found
anything doing that, but I could not convince myself that it would be a bug
to write such code, either.

> > Furthermore, If we decide that we can relax the TLB invalidation
> > requirements here, then I'd much rather than was done deliberately, rather
> > than as an accidental side-effect of another commit (since I think the
> > current behaviour was a consequence of 7a30df49f63a).
> 
> Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
> ("mm: fix KSM data corruption") in the first place.

Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB
invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-21  2:49     ` Yu Zhao
  2020-11-23 19:21       ` Yu Zhao
@ 2020-11-23 22:04       ` Will Deacon
  1 sibling, 0 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-23 22:04 UTC (permalink / raw)
  To: Yu Zhao
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 07:49:22PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > performed by clear_refs_write() in response to a write to
> > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > state we can end up with entries where pte_write() is false, yet a
> > > writable mapping remains in the TLB.
> > 
> > I don't think we need a TLB flush in this context, same reason as we
> > don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> > to write-protect a src PTE.

Hmm. Afaict, copy_present_pte() is only called on the fork() path when
VM_WIPEONFORK is set. I think that's a bit different to the fault case,
and even then, there is a fullmm flush after the copy.

> > ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> > either the dirty bit is set (when a PTE is still writable) or a PF
> > happens (when a PTE has become r/o) when h/w page table walker races
> > with kernel that modifies a PTE using the two APIs.
> 
> After we remove the writable bit, if we end up with a clean PTE, any
> subsequent write will trigger a page fault. We can't have a stale
> writable tlb entry. The architecture-specific APIs guarantee this.
> 
> If we end up with a dirty PTE, then yes, there will be a stale
> writable tlb entry. But this won't be a problem because when we
> write-protect a page (not PTE), we always check both pte_dirty()
> and pte_write(), i.e., write_protect_page() and page_mkclean_one().
> When they see this dirty PTE, they will flush. And generally, only
> callers of pte_mkclean() should flush tlb; otherwise we end up one
> extra if callers of pte_mkclean() and pte_wrprotect() both flush.

I just find this sort of analysis incredibly fragile: we're justifying the
lack of TLB invalidation on a case-by-case basis rather than some general
rules that mean it is not required by construction. Even if all current
users don't need it, what means that will still be true in six months time?
It's not like this stuff is easy to trigger in practice if we get it wrong.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-23 21:17         ` Will Deacon
@ 2020-11-24  1:13           ` Yu Zhao
  2020-11-24 14:31             ` Will Deacon
  2020-11-25 22:01             ` Minchan Kim
  0 siblings, 2 replies; 42+ messages in thread
From: Yu Zhao @ 2020-11-24  1:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote:
> On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > > updating the page-tables for the current mm. However, since the mm is not
> > > > > being freed, this can result in stale TLB entries on architectures which
> > > > > elide 'fullmm' invalidation.
> > > > > 
> > > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > > 
> > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > ---
> > > > >  fs/proc/task_mmu.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index a76d339b5754..316af047f1aa 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > > >  			count = -EINTR;
> > > > >  			goto out_mm;
> > > > >  		}
> > > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > > 
> > > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > > architectures other than ARM the opportunity to optimize based on the
> > > > fact it's a full-mm flush?
> > 
> > I double checked my conclusion on patch 4, and aside from a couple
> > of typos, it still seems correct after the weekend.
> 
> I still need to digest that, but I would prefer that we restore the
> invalidation first, and then have a subsequent commit to relax it. I find
> it hard to believe that the behaviour in mainline at the moment is deliberate.
> 
> That is, I'm not against optimising this, but I'd rather get it "obviously
> correct" first and the current code is definitely not that.

I wouldn't mind having this patch and patch 4 if the invalidation they
restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data
corruption") isn't correct to start with.

It is complicated, so please bear with me. Let's study this by looking
at examples this time.

> > > Only for the soft-dirty case, but I think TLB invalidation is required
> > > there because we are write-protecting the entries and I don't see any
> > > mechanism to handle lazy invalidation for that (compared with the aging
> > > case, which is handled via pte_accessible()).
> > 
> > The lazy invalidation for that is done when we write-protect a page,
> > not an individual PTE. When we do so, our decision is based on both
> > the dirty bit and the writable bit on each PTE mapping this page. So
> > we only need to make sure we don't lose both on a PTE. And we don't
> > here.
> 
> Sorry, I don't follow what you're getting at here (page vs pte). Please can
> you point me to the code you're referring to? The case I'm worried about is
> code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
> where !pte_write() and assuming (despite pte_dirty()) that there can't be
> any concurrent modifications to the mapped page. Granted, I haven't found
> anything doing that, but I could not convince myself that it would be a bug
> to write such code, either.

Example 1: memory corruption is still possible with patch 4 & 6

  CPU0        CPU1        CPU2        CPU3
  ----        ----        ----        ----
  userspace                           page writeback

  [cache writable
   PTE in TLB]

              inc_tlb_flush_pending()
              clean_record_pte()
              pte_mkclean()

                          tlb_gather_mmu()
                          [set mm_tlb_flush_pending()]
                          clear_refs_write()
                          pte_wrprotect()

                                      page_mkclean_one()
                                      !pte_dirty() && !pte_write()
                                      [true, no flush]

                                      write page to disk

  Write to page
  [using stale PTE]

                                      drop clean page
                                      [data integrity compromised]

              flush_tlb_range()

                          tlb_finish_mmu()
                          [flush (with patch 4)]

Example 2: why no flush when write-protecting is not a problem (after
we fix the problem correctly by adding mm_tlb_flush_pending()).

Case a:

  CPU0        CPU1        CPU2        CPU3
  ----        ----        ----        ----
  userspace                           page writeback

  [cache writable
   PTE in TLB]

              inc_tlb_flush_pending()
              clean_record_pte()
              pte_mkclean()

                          clear_refs_write()
                          pte_wrprotect()

                                      page_mkclean_one()
                                      !pte_dirty() && !pte_write() &&
                                      !mm_tlb_flush_pending()
                                      [false: flush]

                                      write page to disk

  Write to page
  [page fault]

                                      drop clean page
                                      [data integrity guaranteed]

              flush_tlb_range()

Case b:

  CPU0        CPU1        CPU2
  ----        ----        ----
  userspace               page writeback

  [cache writable
   PTE in TLB]

              clear_refs_write()
              pte_wrprotect()
              [pte_dirty() is false]

                          page_mkclean_one()
                          !pte_dirty() && !pte_write() &&
                          !mm_tlb_flush_pending()
                          [true: no flush]

                          write page to disk

  Write to page
  [h/w tries to set
   the dirty bit
   but sees write-
   protected PTE,
   page fault]

                          drop clean page
                          [data integrity guaranteed]

Case c:

  CPU0        CPU1        CPU2
  ----        ----        ----
  userspace               page writeback

  [cache writable
   PTE in TLB]

              clear_refs_write()
              pte_wrprotect()
              [pte_dirty() is true]

                          page_mkclean_one()
                          !pte_dirty() && !pte_write() &&
                          !mm_tlb_flush_pending()
                          [false: flush]

                          write page to disk

  Write to page
  [page fault]

                          drop clean page
                          [data integrity guaranteed]

> > > Furthermore, If we decide that we can relax the TLB invalidation
> > > requirements here, then I'd much rather than was done deliberately, rather
> > > than as an accidental side-effect of another commit (since I think the
> > > current behaviour was a consequence of 7a30df49f63a).
> > 
> > Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
                                  ^^^^^^ shouldn't

Another typo, I apologize.

> > ("mm: fix KSM data corruption") in the first place.
> 
> Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB
> invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no?
> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/6] arm64: pgtable: Fix pte_accessible()
  2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
                     ` (2 preceding siblings ...)
  2020-11-23 13:27   ` Catalin Marinas
@ 2020-11-24 10:02   ` Anshuman Khandual
  3 siblings, 0 replies; 42+ messages in thread
From: Anshuman Khandual @ 2020-11-24 10:02 UTC (permalink / raw)
  To: Will Deacon, linux-kernel
  Cc: kernel-team, Yu Zhao, linux-mm, Peter Zijlstra, Catalin Marinas,
	stable, Minchan Kim, Linus Torvalds, linux-arm-kernel


On 11/20/20 8:05 PM, Will Deacon wrote:
> pte_accessible() is used by ptep_clear_flush() to figure out whether TLB
> invalidation is necessary when unmapping pages for reclaim. Although our
> implementation is correct according to the architecture, returning true
> only for valid, young ptes in the absence of racing page-table

Just curious, a PTE mapping would go into the TLB only if it is an
young one with PTE_AF bit set per the architecture ?

> modifications, this is in fact flawed due to lazy invalidation of old
> ptes in ptep_clear_flush_young() where we elide the expensive DSB
> instruction for completing the TLB invalidation.

IOW, an old PTE might have missed the required TLB invalidation via
ptep_clear_flush_young() because it's done in lazy mode. Hence just
include old valid PTEs in pte_accessible() so that TLB invalidation
could be done in ptep_clear_flush() path instead. May be TLB flush
could be done for every PTE, irrespective of its PTE_AF bit in
ptep_clear_flush_young().

> 
> Rather than penalise the aging path, adjust pte_accessible() to return
> true for any valid pte, even if the access flag is cleared.

But will not this cause more (possibly not required) TLB invalidation
in normal unmapping paths ? The cover letter mentions that this patch
fixes a real world crash. Should not the crash also be described here
in the commit message as this patch is marked for stable and has a
"Fixes: " tag.

> 
> Cc: <stable@vger.kernel.org>
> Fixes: 76c714be0e5e ("arm64: pgtable: implement pte_accessible()")
> Reported-by: Yu Zhao <yuzhao@google.com>
> Signed-off-by: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/include/asm/pgtable.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 4ff12a7adcfd..1bdf51f01e73 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -115,8 +115,6 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>  #define pte_valid(pte)		(!!(pte_val(pte) & PTE_VALID))
>  #define pte_valid_not_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER)) == PTE_VALID)
> -#define pte_valid_young(pte) \
> -	((pte_val(pte) & (PTE_VALID | PTE_AF)) == (PTE_VALID | PTE_AF))
>  #define pte_valid_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER)) == (PTE_VALID | PTE_USER))
>  
> @@ -126,7 +124,7 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
>   * remapped as PROT_NONE but are yet to be flushed from the TLB.
>   */
>  #define pte_accessible(mm, pte)	\
> -	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid_young(pte))
> +	(mm_tlb_flush_pending(mm) ? pte_present(pte) : pte_valid(pte))
>  
>  /*
>   * p??_access_permitted() is true for valid user mappings (subject to the
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-24  1:13           ` Yu Zhao
@ 2020-11-24 14:31             ` Will Deacon
  2020-11-25 22:01             ` Minchan Kim
  1 sibling, 0 replies; 42+ messages in thread
From: Will Deacon @ 2020-11-24 14:31 UTC (permalink / raw)
  To: Yu Zhao
  Cc: kernel-team, Anshuman Khandual, Peter Zijlstra, Catalin Marinas,
	linux-kernel, linux-mm, Minchan Kim, Linus Torvalds,
	linux-arm-kernel

On Mon, Nov 23, 2020 at 06:13:34PM -0700, Yu Zhao wrote:
> On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> > > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > > > updating the page-tables for the current mm. However, since the mm is not
> > > > > > being freed, this can result in stale TLB entries on architectures which
> > > > > > elide 'fullmm' invalidation.
> > > > > > 
> > > > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > > > 
> > > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > > ---
> > > > > >  fs/proc/task_mmu.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index a76d339b5754..316af047f1aa 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > > > >  			count = -EINTR;
> > > > > >  			goto out_mm;
> > > > > >  		}
> > > > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > > > 
> > > > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > > > architectures other than ARM the opportunity to optimize based on the
> > > > > fact it's a full-mm flush?
> > > 
> > > I double checked my conclusion on patch 4, and aside from a couple
> > > of typos, it still seems correct after the weekend.
> > 
> > I still need to digest that, but I would prefer that we restore the
> > invalidation first, and then have a subsequent commit to relax it. I find
> > it hard to believe that the behaviour in mainline at the moment is deliberate.
> > 
> > That is, I'm not against optimising this, but I'd rather get it "obviously
> > correct" first and the current code is definitely not that.
> 
> I wouldn't mind having this patch and patch 4 if the invalidation they
> restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data
> corruption") isn't correct to start with.
> 
> It is complicated, so please bear with me. Let's study this by looking
> at examples this time.

Thanks for putting these together. If you're right, then it looks like it's
even worse than I thought :(

> > > > Only for the soft-dirty case, but I think TLB invalidation is required
> > > > there because we are write-protecting the entries and I don't see any
> > > > mechanism to handle lazy invalidation for that (compared with the aging
> > > > case, which is handled via pte_accessible()).
> > > 
> > > The lazy invalidation for that is done when we write-protect a page,
> > > not an individual PTE. When we do so, our decision is based on both
> > > the dirty bit and the writable bit on each PTE mapping this page. So
> > > we only need to make sure we don't lose both on a PTE. And we don't
> > > here.
> > 
> > Sorry, I don't follow what you're getting at here (page vs pte). Please can
> > you point me to the code you're referring to? The case I'm worried about is
> > code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
> > where !pte_write() and assuming (despite pte_dirty()) that there can't be
> > any concurrent modifications to the mapped page. Granted, I haven't found
> > anything doing that, but I could not convince myself that it would be a bug
> > to write such code, either.
> 
> Example 1: memory corruption is still possible with patch 4 & 6
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()

This path:      ^^^^^ looks a bit weird to me and I _think_ only happens
via some vmware DRM driver (i.e. the only caller of
clean_record_shared_mapping_range()). Are you sure that's operating on
pages that can be reclaimed? I have a feeling it might all be pinned.

>                           tlb_gather_mmu()
>                           [set mm_tlb_flush_pending()]
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write()
>                                       [true, no flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [using stale PTE]
> 
>                                       drop clean page
>                                       [data integrity compromised]
> 
>               flush_tlb_range()
> 
>                           tlb_finish_mmu()
>                           [flush (with patch 4)]

Setting my earlier comment aside, I think a useful observation here
is that even with correct TLB invalidation, there is still a window
between modifying the page-table and flushing the TLB where another CPU
could see the updated page-table and incorrectly elide a flush. In these
cases we need to rely either on locking or use of tlb_flush_pending() to
ensure the correct behaviour.

> Example 2: why no flush when write-protecting is not a problem (after
> we fix the problem correctly by adding mm_tlb_flush_pending()).

So here you add an mm_tlb_flush_pending() check to the reclaim path
to resolve the race above.

> Case a:
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()
> 
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write() &&
>                                       !mm_tlb_flush_pending()
>                                       [false: flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [page fault]
> 
>                                       drop clean page
>                                       [data integrity guaranteed]
> 
>               flush_tlb_range()
> 
> Case b:
> 
>   CPU0        CPU1        CPU2
>   ----        ----        ----
>   userspace               page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               clear_refs_write()
>               pte_wrprotect()
>               [pte_dirty() is false]
> 
>                           page_mkclean_one()
>                           !pte_dirty() && !pte_write() &&
>                           !mm_tlb_flush_pending()
>                           [true: no flush]
> 
>                           write page to disk
> 
>   Write to page
>   [h/w tries to set
>    the dirty bit
>    but sees write-
>    protected PTE,
>    page fault]

I agree with you for this example, but I think if the page writeback ran
on CPU 1 after clear_refs_write() then we could have a problem: the updated
pte could sit in the store buffer of CPU1 and the walker on CPU0 would
be able to set the dirty bit. TLB invalidation in clear_refs_write()
would prevent that.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-20 20:40   ` Yu Zhao
  2020-11-23 18:35     ` Will Deacon
@ 2020-11-24 14:46     ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2020-11-24 14:46 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Anshuman Khandual, Will Deacon, Linus Torvalds, linux-kernel,
	linux-mm, Minchan Kim, Catalin Marinas, kernel-team,
	linux-arm-kernel

On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:

> It seems to me ARM's interpretation of tlb->fullmm is a special case,
> not the other way around.

I don't think ARM is special here, IIRC there were more architectures
that did that.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling
  2020-11-24  1:13           ` Yu Zhao
  2020-11-24 14:31             ` Will Deacon
@ 2020-11-25 22:01             ` Minchan Kim
  1 sibling, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2020-11-25 22:01 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrea Arcangeli, Will Deacon, Anshuman Khandual, Peter Zijlstra,
	Linus Torvalds, Hugh Dickins, linux-kernel, linux-mm, Nadav Amit,
	Mel Gorman, Catalin Marinas, kernel-team, linux-arm-kernel

On Mon, Nov 23, 2020 at 06:13:34PM -0700, Yu Zhao wrote:
> On Mon, Nov 23, 2020 at 09:17:51PM +0000, Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 01:04:03PM -0700, Yu Zhao wrote:
> > > On Mon, Nov 23, 2020 at 06:35:55PM +0000, Will Deacon wrote:
> > > > On Fri, Nov 20, 2020 at 01:40:05PM -0700, Yu Zhao wrote:
> > > > > On Fri, Nov 20, 2020 at 02:35:57PM +0000, Will Deacon wrote:
> > > > > > clear_refs_write() uses the 'fullmm' API for invalidating TLBs after
> > > > > > updating the page-tables for the current mm. However, since the mm is not
> > > > > > being freed, this can result in stale TLB entries on architectures which
> > > > > > elide 'fullmm' invalidation.
> > > > > > 
> > > > > > Ensure that TLB invalidation is performed after updating soft-dirty
> > > > > > entries via clear_refs_write() by using the non-fullmm API to MMU gather.
> > > > > > 
> > > > > > Signed-off-by: Will Deacon <will@kernel.org>
> > > > > > ---
> > > > > >  fs/proc/task_mmu.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index a76d339b5754..316af047f1aa 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > @@ -1238,7 +1238,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> > > > > >  			count = -EINTR;
> > > > > >  			goto out_mm;
> > > > > >  		}
> > > > > > -		tlb_gather_mmu_fullmm(&tlb, mm);
> > > > > > +		tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
> > > > > 
> > > > > Let's assume my reply to patch 4 is wrong, and therefore we still need
> > > > > tlb_gather/finish_mmu() here. But then wouldn't this change deprive
> > > > > architectures other than ARM the opportunity to optimize based on the
> > > > > fact it's a full-mm flush?
> > > 
> > > I double checked my conclusion on patch 4, and aside from a couple
> > > of typos, it still seems correct after the weekend.
> > 
> > I still need to digest that, but I would prefer that we restore the
> > invalidation first, and then have a subsequent commit to relax it. I find
> > it hard to believe that the behaviour in mainline at the moment is deliberate.
> > 
> > That is, I'm not against optimising this, but I'd rather get it "obviously
> > correct" first and the current code is definitely not that.
> 
> I wouldn't mind having this patch and patch 4 if the invalidation they
> restore were in a correct state -- b3a81d0841a9 ("mm: fix KSM data
> corruption") isn't correct to start with.
> 
> It is complicated, so please bear with me. Let's study this by looking
> at examples this time.
> 
> > > > Only for the soft-dirty case, but I think TLB invalidation is required
> > > > there because we are write-protecting the entries and I don't see any
> > > > mechanism to handle lazy invalidation for that (compared with the aging
> > > > case, which is handled via pte_accessible()).
> > > 
> > > The lazy invalidation for that is done when we write-protect a page,
> > > not an individual PTE. When we do so, our decision is based on both
> > > the dirty bit and the writable bit on each PTE mapping this page. So
> > > we only need to make sure we don't lose both on a PTE. And we don't
> > > here.
> > 
> > Sorry, I don't follow what you're getting at here (page vs pte). Please can
> > you point me to the code you're referring to? The case I'm worried about is
> > code that holds sufficient locks (e.g. mmap_sem + ptl) finding an entry
> > where !pte_write() and assuming (despite pte_dirty()) that there can't be
> > any concurrent modifications to the mapped page. Granted, I haven't found
> > anything doing that, but I could not convince myself that it would be a bug
> > to write such code, either.
> 
> Example 1: memory corruption is still possible with patch 4 & 6
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()
> 
>                           tlb_gather_mmu()
>                           [set mm_tlb_flush_pending()]
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write()
>                                       [true, no flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [using stale PTE]
> 
>                                       drop clean page
>                                       [data integrity compromised]
> 
>               flush_tlb_range()
> 
>                           tlb_finish_mmu()
>                           [flush (with patch 4)]
> 
> Example 2: why no flush when write-protecting is not a problem (after
> we fix the problem correctly by adding mm_tlb_flush_pending()).
> 
> Case a:
> 
>   CPU0        CPU1        CPU2        CPU3
>   ----        ----        ----        ----
>   userspace                           page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               inc_tlb_flush_pending()
>               clean_record_pte()
>               pte_mkclean()
> 
>                           clear_refs_write()
>                           pte_wrprotect()
> 
>                                       page_mkclean_one()
>                                       !pte_dirty() && !pte_write() &&
>                                       !mm_tlb_flush_pending()
>                                       [false: flush]
> 
>                                       write page to disk
> 
>   Write to page
>   [page fault]
> 
>                                       drop clean page
>                                       [data integrity guaranteed]
> 
>               flush_tlb_range()
> 
> Case b:
> 
>   CPU0        CPU1        CPU2
>   ----        ----        ----
>   userspace               page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               clear_refs_write()
>               pte_wrprotect()
>               [pte_dirty() is false]
> 
>                           page_mkclean_one()
>                           !pte_dirty() && !pte_write() &&
>                           !mm_tlb_flush_pending()
>                           [true: no flush]
> 
>                           write page to disk
> 
>   Write to page
>   [h/w tries to set
>    the dirty bit
>    but sees write-
>    protected PTE,
>    page fault]
> 
>                           drop clean page
>                           [data integrity guaranteed]
> 
> Case c:
> 
>   CPU0        CPU1        CPU2
>   ----        ----        ----
>   userspace               page writeback
> 
>   [cache writable
>    PTE in TLB]
> 
>               clear_refs_write()
>               pte_wrprotect()
>               [pte_dirty() is true]
> 
>                           page_mkclean_one()
>                           !pte_dirty() && !pte_write() &&
>                           !mm_tlb_flush_pending()
>                           [false: flush]
> 
>                           write page to disk
> 
>   Write to page
>   [page fault]
> 
>                           drop clean page
>                           [data integrity guaranteed]
> 
> > > > Furthermore, If we decide that we can relax the TLB invalidation
> > > > requirements here, then I'd much rather than was done deliberately, rather
> > > > than as an accidental side-effect of another commit (since I think the
> > > > current behaviour was a consequence of 7a30df49f63a).
> > > 
> > > Nope. tlb_gather/finish_mmu() should be added by b3a81d0841a9
>                                   ^^^^^^ shouldn't

I read all examples Yu mentioned and think they are all correct. Furthermore,
I agree with Yu we don't need the tlb gathering in clear_refs_writes from the
beginning to just increase the tlb flush pending count since MADV_FREE already
have it. However, I'd like to keep the last flushing logic in clear_refs_writes
to avoid relying on the luck for better accuracy as well as guarantees.

So, IMHO, technically, Yu's points are all valid to me - we need to fix page
writeback path. About this clear_refs_writes, Will's chages are still
improvement(based on assumption if Yu agree on that we need the TLB flush for
accuracy, not correctness) so still worth to have it.

Then, Wu, could you send the writeback path fix? Please Ccing Hugh, Mel, 
Adnrea and Nadav in next patchset since they all are experts in this
domain - referenced in https://lkml.org/lkml/2015/4/15/565

> 
> Another typo, I apologize.
> 
> > > ("mm: fix KSM data corruption") in the first place.
> > 
> > Sure, but if you check out b3a81d0841a9 then you have a fullmm TLB
> > invalidation in tlb_finish_mmu(). 7a30df49f63a is what removed that, no?
> > 
> > Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state
  2020-11-23 18:41       ` Will Deacon
@ 2020-11-25 22:51         ` Minchan Kim
  0 siblings, 0 replies; 42+ messages in thread
From: Minchan Kim @ 2020-11-25 22:51 UTC (permalink / raw)
  To: Will Deacon
  Cc: kernel-team, Yu Zhao, Anshuman Khandual, Peter Zijlstra,
	Catalin Marinas, linux-kernel, linux-mm, Linus Torvalds,
	linux-arm-kernel

On Mon, Nov 23, 2020 at 06:41:14PM +0000, Will Deacon wrote:
> On Fri, Nov 20, 2020 at 07:55:14AM -0800, Minchan Kim wrote:
> > On Fri, Nov 20, 2020 at 04:00:23PM +0100, Peter Zijlstra wrote:
> > > On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > > > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > > > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > > > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > > > performed by clear_refs_write() in response to a write to
> > > > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > > > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > > > state we can end up with entries where pte_write() is false, yet a
> > > > writable mapping remains in the TLB.
> > > > 
> > > > Fix this by calling tlb_remove_tlb_entry() for each entry being
> > > > write-protected when cleating soft-dirty.
> > > > 
> > > 
> > > > @@ -1053,6 +1054,7 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> > > >  		ptent = pte_wrprotect(old_pte);
> > > >  		ptent = pte_clear_soft_dirty(ptent);
> > > >  		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
> > > > +		tlb_remove_tlb_entry(tlb, pte, addr);
> > > >  	} else if (is_swap_pte(ptent)) {
> > > >  		ptent = pte_swp_clear_soft_dirty(ptent);
> > > >  		set_pte_at(vma->vm_mm, addr, pte, ptent);
> > > 
> > > Oh!
> > > 
> > > Yesterday when you had me look at this code; I figured the sane thing
> > > to do was to make it look more like mprotect().
> > > 
> > > Why did you chose to make it work with mmu_gather instead? I'll grant
> > > you that it's probably the smaller patch, but I still think it's weird
> > > to use mmu_gather here.
> > 
> > I agree. The reason why clear_refs_write used the gather API was [1] and
> > seems like to overkill to me.
> 
> I don't see why it's overkill. Prior to that commit, it called
> flush_tlb_mm() directly.

The TLB gather was added for increasing tlb flush pending count for
stability bug, not for performance optimiataion(The commit never had
any number to support it and didn't have the logic to handle each pte
with tlb gather) and then it introduced a bug now so I take it as overkill
since it made complication from the beginning *unnecessary*.

> 
> > We could just do like [inc|dec]_tlb_flush_pending with flush_tlb_mm at
> > right before dec_tlb_flush_pending instead of gather.
> > 
> > thought?
> 
> I'm not sure why this is better; it's different to the madvise() path, and
> will need special logic to avoid the flush in the case where we're just
> doing aging.

I thought it's better to fix the bug first as *simple* patch and then
do optimization based on it.
Anyway, following to Yu's comment, we don't need gather API and 
even the flush if we give up the accuarcy(but I want to have it).

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-11-25 22:52 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20 14:35 [PATCH 0/6] tlb: Fix access and (soft-)dirty bit management Will Deacon
2020-11-20 14:35 ` [PATCH 1/6] arm64: pgtable: Fix pte_accessible() Will Deacon
2020-11-20 16:03   ` Minchan Kim
2020-11-20 19:53   ` Yu Zhao
2020-11-23 13:27   ` Catalin Marinas
2020-11-24 10:02   ` Anshuman Khandual
2020-11-20 14:35 ` [PATCH 2/6] arm64: pgtable: Ensure dirty bit is preserved across pte_wrprotect() Will Deacon
2020-11-20 17:09   ` Minchan Kim
2020-11-23 14:31     ` Catalin Marinas
2020-11-23 14:22   ` Catalin Marinas
2020-11-20 14:35 ` [PATCH 3/6] tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu() Will Deacon
2020-11-20 17:20   ` Linus Torvalds
2020-11-23 16:48     ` Will Deacon
2020-11-20 14:35 ` [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state Will Deacon
2020-11-20 15:00   ` Peter Zijlstra
2020-11-20 15:09     ` Peter Zijlstra
2020-11-20 15:15     ` Will Deacon
2020-11-20 15:27       ` Peter Zijlstra
2020-11-23 18:23         ` Will Deacon
2020-11-20 15:55     ` Minchan Kim
2020-11-23 18:41       ` Will Deacon
2020-11-25 22:51         ` Minchan Kim
2020-11-20 20:22   ` Yu Zhao
2020-11-21  2:49     ` Yu Zhao
2020-11-23 19:21       ` Yu Zhao
2020-11-23 22:04       ` Will Deacon
2020-11-20 14:35 ` [PATCH 5/6] tlb: mmu_gather: Introduce tlb_gather_mmu_fullmm() Will Deacon
2020-11-20 17:22   ` Linus Torvalds
2020-11-20 17:31     ` Linus Torvalds
2020-11-23 16:48       ` Will Deacon
     [not found]   ` <20201122151158.GK2390@xsang-OptiPlex-9020>
2020-11-23 17:51     ` [tlb] e242a269fa: WARNING:at_mm/mmu_gather.c:#tlb_gather_mmu Will Deacon
2020-11-20 14:35 ` [PATCH 6/6] mm: proc: Avoid fullmm flush for young/dirty bit toggling Will Deacon
2020-11-20 17:41   ` Linus Torvalds
2020-11-20 17:45     ` Linus Torvalds
2020-11-20 20:40   ` Yu Zhao
2020-11-23 18:35     ` Will Deacon
2020-11-23 20:04       ` Yu Zhao
2020-11-23 21:17         ` Will Deacon
2020-11-24  1:13           ` Yu Zhao
2020-11-24 14:31             ` Will Deacon
2020-11-25 22:01             ` Minchan Kim
2020-11-24 14:46     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).