linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mm: dirty/accessed pte optimisations
@ 2018-10-16 13:13 Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 1/5] nios2: update_mmu_cache clear the old entry from the TLB Nicholas Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-10-16 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Linus Torvalds, linux-mm, linux-arch,
	Linux Kernel Mailing List, ppc-dev, Ley Foon Tan

Since v1 I fixed the hang in nios2, split the fork patch into two
as Linus asked, and added hugetlb code for the "don't bother write
protecting already writeprotected" patch.

Please consider this for more cooking in -mm.

Thanks,
Nick

Nicholas Piggin (5):
  nios2: update_mmu_cache clear the old entry from the TLB
  mm/cow: don't bother write protecting already write-protected huge
    pages
  mm/cow: optimise pte accessed bit handling in fork
  mm/cow: optimise pte dirty bit handling in fork
  mm: optimise pte dirty/accessed bit setting by demand based pte
    insertion

 arch/nios2/mm/cacheflush.c |  2 ++
 mm/huge_memory.c           | 24 ++++++++++++++++--------
 mm/hugetlb.c               |  2 +-
 mm/memory.c                | 19 +++++++++++--------
 mm/vmscan.c                |  8 ++++++++
 5 files changed, 38 insertions(+), 17 deletions(-)

-- 
2.18.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/5] nios2: update_mmu_cache clear the old entry from the TLB
  2018-10-16 13:13 [PATCH v2 0/5] mm: dirty/accessed pte optimisations Nicholas Piggin
@ 2018-10-16 13:13 ` Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 2/5] mm/cow: don't bother write protecting already write-protected huge pages Nicholas Piggin
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-10-16 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Linus Torvalds, linux-mm, linux-arch,
	Linux Kernel Mailing List, ppc-dev, Ley Foon Tan

Fault paths like do_read_fault will install a Linux pte with the young
bit clear. The CPU will fault again because the TLB has not been
updated, this time a valid pte exists so handle_pte_fault will just
set the young bit with ptep_set_access_flags, which flushes the TLB.

The TLB is flushed so the next attempt will go to the fast TLB handler
which loads the TLB with the new Linux pte. The access then proceeds.

This design is fragile to depend on the young bit being clear after
the initial Linux fault. A proposed core mm change to immediately set
the young bit upon such a fault, results in ptep_set_access_flags not
flushing the TLB because it finds no change to the pte. The spurious
fault fix path only flushes the TLB if the access was a store. If it
was a load, then this results in an infinite loop of page faults.

This change adds a TLB flush in update_mmu_cache, which removes that
TLB entry upon the first fault. This will cause the fast TLB handler
to load the new pte and avoid the Linux page fault entirely.

Reviewed-by: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/nios2/mm/cacheflush.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/nios2/mm/cacheflush.c b/arch/nios2/mm/cacheflush.c
index 506f6e1c86d5..d58e7e80dc0d 100644
--- a/arch/nios2/mm/cacheflush.c
+++ b/arch/nios2/mm/cacheflush.c
@@ -204,6 +204,8 @@ void update_mmu_cache(struct vm_area_struct *vma,
 	struct page *page;
 	struct address_space *mapping;
 
+	flush_tlb_page(vma, address);
+
 	if (!pfn_valid(pfn))
 		return;
 
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/5] mm/cow: don't bother write protecting already write-protected huge pages
  2018-10-16 13:13 [PATCH v2 0/5] mm: dirty/accessed pte optimisations Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 1/5] nios2: update_mmu_cache clear the old entry from the TLB Nicholas Piggin
@ 2018-10-16 13:13 ` Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 3/5] mm/cow: optimise pte accessed bit handling in fork Nicholas Piggin
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-10-16 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Linus Torvalds, linux-mm, linux-arch,
	Linux Kernel Mailing List, ppc-dev, Ley Foon Tan

This is the HugePage / THP equivalent for 1b2de5d039c8 ("mm/cow: don't
bother write protecting already write-protected pages").

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/huge_memory.c | 14 ++++++++++----
 mm/hugetlb.c     |  2 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 58269f8ba7c4..0fb0e3025f98 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -973,8 +973,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	mm_inc_nr_ptes(dst_mm);
 	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
 
-	pmdp_set_wrprotect(src_mm, addr, src_pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	if (pmd_write(pmd)) {
+		pmdp_set_wrprotect(src_mm, addr, src_pmd);
+		pmd = pmd_wrprotect(pmd);
+	}
+	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1064,8 +1067,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		/* No huge zero pud yet */
 	}
 
-	pudp_set_wrprotect(src_mm, addr, src_pud);
-	pud = pud_mkold(pud_wrprotect(pud));
+	if (pud_write(pud)) {
+		pudp_set_wrprotect(src_mm, addr, src_pud);
+		pud = pud_wrprotect(pud);
+	}
+	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5c390f5a5207..54a4dcb6ee21 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3287,7 +3287,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			}
 			set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz);
 		} else {
-			if (cow) {
+			if (cow && huge_pte_write(entry)) {
 				/*
 				 * No need to notify as we are downgrading page
 				 * table protection not changing it to point
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 3/5] mm/cow: optimise pte accessed bit handling in fork
  2018-10-16 13:13 [PATCH v2 0/5] mm: dirty/accessed pte optimisations Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 1/5] nios2: update_mmu_cache clear the old entry from the TLB Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 2/5] mm/cow: don't bother write protecting already write-protected huge pages Nicholas Piggin
@ 2018-10-16 13:13 ` Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 4/5] mm/cow: optimise pte dirty " Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 5/5] mm: optimise pte dirty/accessed bit setting by demand based pte insertion Nicholas Piggin
  4 siblings, 0 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-10-16 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Linus Torvalds, linux-mm, linux-arch,
	Linux Kernel Mailing List, ppc-dev, Ley Foon Tan

fork clears dirty/accessed bits from new ptes in the child. This logic
has existed since mapped page reclaim was done by scanning ptes when
it may have been quite important. Today with physical based pte
scanning, there is less reason to clear these bits, so this patch
avoids clearing the accessed bit in the child.

Any accessed bit is treated similarly to many, with the difference
today with > 1 referenced bit causing the page to be activated, while
1 bit causes it to be kept. This patch causes pages shared by fork(2)
to be more readily activated, but this heuristic is very fuzzy anyway
-- a page can be accessed by multiple threads via a single pte and be
just as important as one that is accessed via multiple ptes, for
example. In the end I don't believe fork(2) is a significant driver of
page reclaim behaviour that this should matter too much.

This and the following change eliminate a major source of faults that
powerpc/radix requires to set dirty/accessed bits in ptes, speeding
up a fork/exit microbenchmark by about 5% on POWER9 (16600 -> 17500
fork/execs per second).

Skylake appears to have a micro-fault overhead too -- a test which
allocates 4GB anonymous memory, reads each page, then forks, and times
the child reading a byte from each page. The first pass over the pages
takes about 1000 cycles per page, the second pass takes about 27
cycles (TLB miss). With no additional minor faults measured due to
either child pass, and the page array well exceeding TLB capacity, the
large cost must be caused by micro faults caused by setting accessed
bit.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/huge_memory.c | 2 --
 mm/memory.c      | 1 -
 mm/vmscan.c      | 8 ++++++++
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fb0e3025f98..1f43265204d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -977,7 +977,6 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pmdp_set_wrprotect(src_mm, addr, src_pmd);
 		pmd = pmd_wrprotect(pmd);
 	}
-	pmd = pmd_mkold(pmd);
 	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
 
 	ret = 0;
@@ -1071,7 +1070,6 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		pudp_set_wrprotect(src_mm, addr, src_pud);
 		pud = pud_wrprotect(pud);
 	}
-	pud = pud_mkold(pud);
 	set_pud_at(dst_mm, addr, dst_pud, pud);
 
 	ret = 0;
diff --git a/mm/memory.c b/mm/memory.c
index c467102a5cbc..0387ee1e3582 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1033,7 +1033,6 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 */
 	if (vm_flags & VM_SHARED)
 		pte = pte_mkclean(pte);
-	pte = pte_mkold(pte);
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5ef7240cbcb..e72d5b3336a0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1031,6 +1031,14 @@ static enum page_references page_check_references(struct page *page,
 		 * to look twice if a mapped file page is used more
 		 * than once.
 		 *
+		 * fork() will set referenced bits in child ptes despite
+		 * not having been accessed, to avoid micro-faults of
+		 * setting accessed bits. This heuristic is not perfectly
+		 * accurate in other ways -- multiple map/unmap in the
+		 * same time window would be treated as multiple references
+		 * despite same number of actual memory accesses made by
+		 * the program.
+		 *
 		 * Mark it and spare it for another trip around the
 		 * inactive list.  Another page table reference will
 		 * lead to its activation.
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 4/5] mm/cow: optimise pte dirty bit handling in fork
  2018-10-16 13:13 [PATCH v2 0/5] mm: dirty/accessed pte optimisations Nicholas Piggin
                   ` (2 preceding siblings ...)
  2018-10-16 13:13 ` [PATCH v2 3/5] mm/cow: optimise pte accessed bit handling in fork Nicholas Piggin
@ 2018-10-16 13:13 ` Nicholas Piggin
  2018-10-16 13:13 ` [PATCH v2 5/5] mm: optimise pte dirty/accessed bit setting by demand based pte insertion Nicholas Piggin
  4 siblings, 0 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-10-16 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Linus Torvalds, linux-mm, linux-arch,
	Linux Kernel Mailing List, ppc-dev, Ley Foon Tan

fork clears dirty/accessed bits from new ptes in the child. This logic
has existed since mapped page reclaim was done by scanning ptes when
it may have been quite important. Today with physical based pte
scanning, there is less reason to clear these bits, so this patch
avoids clearing the dirty bit in the child.

Dirty bits are all tested and cleared together, and any dirty bit is
the same as many dirty bits, so from a correctness and writeback
bandwidth point-of-view it does not matter if the child gets a dirty
bit.

Dirty ptes are more costly to unmap because they require flushing
under the page table lock, but it is pretty rare to have a shared
dirty mapping that is copied on fork, so just simplify the code and
avoid this dirty clearing logic.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/memory.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0387ee1e3582..9e314339a0bd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1028,11 +1028,12 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	}
 
 	/*
-	 * If it's a shared mapping, mark it clean in
-	 * the child
+	 * Child inherits dirty and young bits from parent. There is no
+	 * point clearing them because any cleaning or aging has to walk
+	 * all ptes anyway, and it will notice the bits set in the parent.
+	 * Leaving them set avoids stalls and even page faults on CPUs that
+	 * handle these bits in software.
 	 */
-	if (vm_flags & VM_SHARED)
-		pte = pte_mkclean(pte);
 
 	page = vm_normal_page(vma, addr, pte);
 	if (page) {
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 5/5] mm: optimise pte dirty/accessed bit setting by demand based pte insertion
  2018-10-16 13:13 [PATCH v2 0/5] mm: dirty/accessed pte optimisations Nicholas Piggin
                   ` (3 preceding siblings ...)
  2018-10-16 13:13 ` [PATCH v2 4/5] mm/cow: optimise pte dirty " Nicholas Piggin
@ 2018-10-16 13:13 ` Nicholas Piggin
  4 siblings, 0 replies; 6+ messages in thread
From: Nicholas Piggin @ 2018-10-16 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nicholas Piggin, Linus Torvalds, linux-mm, linux-arch,
	Linux Kernel Mailing List, ppc-dev, Ley Foon Tan

Similarly to the previous patch, this tries to optimise dirty/accessed
bits in ptes to avoid access costs of hardware setting them.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 mm/huge_memory.c | 12 ++++++++----
 mm/memory.c      |  9 ++++++---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1f43265204d4..38c2cd3b4879 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1197,6 +1197,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
 	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
 		pte_t entry;
 		entry = mk_pte(pages[i], vma->vm_page_prot);
+		entry = pte_mkyoung(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		memcg = (void *)page_private(pages[i]);
 		set_page_private(pages[i], 0);
@@ -2067,7 +2068,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false;
+	bool young, write, dirty, soft_dirty, pmd_migration = false;
 	unsigned long addr;
 	int i;
 
@@ -2145,7 +2146,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		page = pmd_page(old_pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
-	if (pmd_dirty(old_pmd))
+	dirty = pmd_dirty(old_pmd);
+	if (dirty)
 		SetPageDirty(page);
 	write = pmd_write(old_pmd);
 	young = pmd_young(old_pmd);
@@ -2176,8 +2178,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 			entry = maybe_mkwrite(entry, vma);
 			if (!write)
 				entry = pte_wrprotect(entry);
-			if (!young)
-				entry = pte_mkold(entry);
+			if (young)
+				entry = pte_mkyoung(entry);
+			if (dirty)
+				entry = pte_mkdirty(entry);
 			if (soft_dirty)
 				entry = pte_mksoft_dirty(entry);
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 9e314339a0bd..f907ea7a6303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1804,10 +1804,9 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		entry = pte_mkspecial(pfn_t_pte(pfn, prot));
 
 out_mkwrite:
-	if (mkwrite) {
-		entry = pte_mkyoung(entry);
+	entry = pte_mkyoung(entry);
+	if (mkwrite)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	}
 
 	set_pte_at(mm, addr, pte, entry);
 	update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */
@@ -2534,6 +2533,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		}
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
+		entry = pte_mkyoung(entry);
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
@@ -3043,6 +3043,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
+	pte = pte_mkyoung(pte);
 	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
@@ -3185,6 +3186,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
 	__SetPageUptodate(page);
 
 	entry = mk_pte(page, vma->vm_page_prot);
+	entry = pte_mkyoung(entry);
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
@@ -3453,6 +3455,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 
 	flush_icache_page(vma, page);
 	entry = mk_pte(page, vma->vm_page_prot);
+	entry = pte_mkyoung(entry);
 	if (write)
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 	/* copy-on-write page */
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-10-16 13:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-16 13:13 [PATCH v2 0/5] mm: dirty/accessed pte optimisations Nicholas Piggin
2018-10-16 13:13 ` [PATCH v2 1/5] nios2: update_mmu_cache clear the old entry from the TLB Nicholas Piggin
2018-10-16 13:13 ` [PATCH v2 2/5] mm/cow: don't bother write protecting already write-protected huge pages Nicholas Piggin
2018-10-16 13:13 ` [PATCH v2 3/5] mm/cow: optimise pte accessed bit handling in fork Nicholas Piggin
2018-10-16 13:13 ` [PATCH v2 4/5] mm/cow: optimise pte dirty " Nicholas Piggin
2018-10-16 13:13 ` [PATCH v2 5/5] mm: optimise pte dirty/accessed bit setting by demand based pte insertion Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).