linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/25] mm: Page fault accounting cleanups
@ 2020-06-15 22:15 Peter Xu
  2020-06-15 22:15 ` [PATCH 01/25] mm/um: Fix extra accounting for page fault retries Peter Xu
                   ` (25 more replies)
  0 siblings, 26 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds, Andrea Arcangeli

Gerald Schaefer reported an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b9827063 ("mm: allow
VM_FAULT_RETRY for multiple times"):

  https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

When I was looking at the issue, I actually found some other trivial issues too
related to the page fault accounting, namely:

  - Incorrect accountings

    - The major issue that was reported by Gerald, that we did multiple
      accounting when the page fault retried while we should only do it once
      (for either perf event accounting or per-task accounting).

    - In many archs, major fault is only accounted when the 1st page fault is a
      major fault, or the last page fault is a major fault.  Ideally we should
      account the page fault as a major fault as long as any of the page fault
      retries is a major fault and keep the same behavior across archs.

  - Missing accountings of perf events (PERF_COUNT_SW_PAGE_FAULTS[_[MAJ|MIN]]).

  - Doing accounting with mmap_sem: not a big deal, but logically we can move
    it after releasing the mmap_sem because accounting does not need it.

This series tries to address all of them by introducing mm_fault_accounting()
first, so that we move all the page fault accounting into the common code base,
then call it properly from arch pf handlers just like handle_mm_fault().  There
are some special cases, e.g., the um arch does not have "regs" in the fault
handler, so it's still using the manual statistics.

For each of the patch that fixes a specific arch, I'm CCing the maintainers and
the arch list if there is.  Besides, I only lightly tested this series on x86.

Please have a look, thanks.

Peter Xu (25):
  mm/um: Fix extra accounting for page fault retries
  mm: Introduce mm_fault_accounting()
  mm/alpha: Use mm_fault_accounting()
  mm/arc: Use mm_fault_accounting()
  mm/arm: Use mm_fault_accounting()
  mm/arm64: Use mm_fault_accounting()
  mm/csky: Use mm_fault_accounting()
  mm/hexagon: Use mm_fault_accounting()
  mm/ia64: Use mm_fault_accounting()
  mm/m68k: Use mm_fault_accounting()
  mm/microblaze: Use mm_fault_accounting()
  mm/mips: Use mm_fault_accounting()
  mm/nds32: Use mm_fault_accounting()
  mm/nios2: Use mm_fault_accounting()
  mm/openrisc: Use mm_fault_accounting()
  mm/parisc: Use mm_fault_accounting()
  mm/powerpc: Use mm_fault_accounting()
  mm/riscv: Use mm_fault_accounting()
  mm/s390: Use mm_fault_accounting()
  mm/sh: Use mm_fault_accounting()
  mm/sparc32: Use mm_fault_accounting()
  mm/sparc64: Use mm_fault_accounting()
  mm/unicore32: Use mm_fault_accounting()
  mm/x86: Use mm_fault_accounting()
  mm/xtensa: Use mm_fault_accounting()

 arch/alpha/mm/fault.c      |  9 ++++-----
 arch/arc/mm/fault.c        | 15 +++------------
 arch/arm/mm/fault.c        | 21 ++++-----------------
 arch/arm64/mm/fault.c      | 17 ++---------------
 arch/csky/mm/fault.c       | 13 +------------
 arch/hexagon/mm/vm_fault.c |  9 +++------
 arch/ia64/mm/fault.c       |  8 +++-----
 arch/m68k/mm/fault.c       | 13 +++----------
 arch/microblaze/mm/fault.c |  8 +++-----
 arch/mips/mm/fault.c       | 14 +++-----------
 arch/nds32/mm/fault.c      | 19 +++----------------
 arch/nios2/mm/fault.c      | 13 +++----------
 arch/openrisc/mm/fault.c   |  8 +++-----
 arch/parisc/mm/fault.c     |  8 +++-----
 arch/powerpc/mm/fault.c    | 13 ++++---------
 arch/riscv/mm/fault.c      | 21 +++------------------
 arch/s390/mm/fault.c       | 21 +++++----------------
 arch/sh/mm/fault.c         | 15 +++------------
 arch/sparc/mm/fault_32.c   | 16 +++-------------
 arch/sparc/mm/fault_64.c   | 16 ++++------------
 arch/um/kernel/trap.c      | 13 +++++++------
 arch/unicore32/mm/fault.c  | 15 +++++++--------
 arch/x86/mm/fault.c        | 10 +---------
 arch/xtensa/mm/fault.c     | 14 +++-----------
 include/linux/mm.h         |  2 ++
 mm/memory.c                | 18 ++++++++++++++++++
 26 files changed, 101 insertions(+), 248 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 01/25] mm/um: Fix extra accounting for page fault retries
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 02/25] mm: Introduce mm_fault_accounting() Peter Xu
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Jeff Dike, Richard Weinberger, Anton Ivanov,
	linux-um

When page fault retried, we should only do the accounting once rather than once
for each page fault retry.

CC: Jeff Dike <jdike@addtoit.com>
CC: Richard Weinberger <richard@nod.at>
CC: Anton Ivanov <anton.ivanov@cambridgegreys.com>
CC: linux-um@lists.infradead.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/um/kernel/trap.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 8f18cf56b3dd..d162168490d1 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -34,6 +34,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 	pte_t *pte;
 	int err = -EFAULT;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
+	vm_fault_t fault, major = 0;
 
 	*code_out = SEGV_MAPERR;
 
@@ -73,9 +74,8 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 	}
 
 	do {
-		vm_fault_t fault;
-
 		fault = handle_mm_fault(vma, address, flags);
+		major |= fault & VM_FAULT_MAJOR;
 
 		if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
 			goto out_nosemaphore;
@@ -92,10 +92,6 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			BUG();
 		}
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			if (fault & VM_FAULT_MAJOR)
-				current->maj_flt++;
-			else
-				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
 				flags |= FAULT_FLAG_TRIED;
 
@@ -103,6 +99,11 @@ int handle_page_fault(unsigned long address, unsigned long ip,
 			}
 		}
 
+		if (major)
+			current->maj_flt++;
+		else
+			current->min_flt++;
+
 		pgd = pgd_offset(mm, address);
 		p4d = p4d_offset(pgd, address);
 		pud = pud_offset(p4d, address);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 02/25] mm: Introduce mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
  2020-06-15 22:15 ` [PATCH 01/25] mm/um: Fix extra accounting for page fault retries Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:32   ` Linus Torvalds
  2020-06-15 22:15 ` [PATCH 03/25] mm/alpha: Use mm_fault_accounting() Peter Xu
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds, Andrea Arcangeli

Provide this helper for doing memory page fault accounting across archs.  It
can be defined unconditionally because perf_sw_event() is always defined, and
perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3fe7371855c..b96db64125be 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1649,6 +1649,8 @@ void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
 int truncate_inode_page(struct address_space *mapping, struct page *page);
 int generic_error_remove_page(struct address_space *mapping, struct page *page);
 int invalidate_inode_page(struct page *page);
+void mm_fault_accounting(struct task_struct *task, struct pt_regs *regs,
+			 unsigned long address, bool major);
 
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..c4f8319f0ed8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -71,6 +71,8 @@
 #include <linux/dax.h>
 #include <linux/oom.h>
 #include <linux/numa.h>
+#include <linux/perf_event.h>
+#include <linux/ptrace.h>
 
 #include <trace/events/kmem.h>
 
@@ -4397,6 +4399,22 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);
 
+void mm_fault_accounting(struct task_struct *task, struct pt_regs *regs,
+			 unsigned long address, bool major)
+{
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+	if (major) {
+		task->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
+			      regs, address);
+	} else {
+		task->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
+			      regs, address);
+	}
+}
+EXPORT_SYMBOL_GPL(mm_fault_accounting);
+
 #ifndef __PAGETABLE_P4D_FOLDED
 /*
  * Allocate p4d page table.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 03/25] mm/alpha: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
  2020-06-15 22:15 ` [PATCH 01/25] mm/um: Fix extra accounting for page fault retries Peter Xu
  2020-06-15 22:15 ` [PATCH 02/25] mm: Introduce mm_fault_accounting() Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 04/25] mm/arc: " Peter Xu
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Richard Henderson, Ivan Kokshaysky,
	Matt Turner, linux-alpha

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: Richard Henderson <rth@twiddle.net>
CC: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
CC: Matt Turner <mattst88@gmail.com>
CC: linux-alpha@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/alpha/mm/fault.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index c2d7b6d7bac7..4f8632ddef25 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -88,7 +88,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	struct mm_struct *mm = current->mm;
 	const struct exception_table_entry *fixup;
 	int si_code = SEGV_MAPERR;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
@@ -149,6 +149,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	   make sure we exit gracefully rather than endlessly redo
 	   the fault.  */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -164,10 +165,6 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 	}
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR)
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -182,6 +179,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
 
 	up_read(&mm->mmap_sem);
 
+	mm_fault_accounting(current, regs, address, major);
+
 	return;
 
 	/* Something tried to access memory that isn't in our memory map.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 04/25] mm/arc: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (2 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 03/25] mm/alpha: Use mm_fault_accounting() Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 05/25] mm/arm: " Peter Xu
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Vineet Gupta, linux-snps-arc

Use the new mm_fault_accounting() helper for page fault accounting.

The functional change here is that now we take the whole page fault as a major
fault as long as any of the retried page fault is a major fault.  Previously we
only considered the last fault.

CC: Vineet Gupta <vgupta@synopsys.com>
CC: linux-snps-arc@lists.infradead.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arc/mm/fault.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 92b339c7adba..bc89d4b9c59d 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -72,6 +72,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	int sig, si_code = SEGV_MAPERR;
 	unsigned int write = 0, exec = 0, mask;
 	vm_fault_t fault = VM_FAULT_SIGSEGV;	/* handle_mm_fault() output */
+	vm_fault_t major = 0;
 	unsigned int flags;			/* handle_mm_fault() input */
 
 	/*
@@ -132,6 +133,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	}
 
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	/* Quick path to respond to signals */
 	if (fault_signal_pending(fault, regs)) {
@@ -156,20 +158,9 @@ void do_page_fault(unsigned long address, struct pt_regs *regs)
 	 * Major/minor page fault accounting
 	 * (in case of retry we only land here once)
 	 */
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	if (likely(!(fault & VM_FAULT_ERROR))) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-				      regs, address);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-				      regs, address);
-		}
-
 		/* Normal return path: fault Handled Gracefully */
+		mm_fault_accounting(tsk, regs, address, major);
 		return;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 05/25] mm/arm: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (3 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 04/25] mm/arc: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 06/25] mm/arm64: " Peter Xu
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Russell King, Will Deacon, linux-arm-kernel

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Meanwhile, take the page fault as a major fault as long as any of the retried
page fault is a major fault.

CC: Russell King <linux@armlinux.org.uk>
CC: Will Deacon <will@kernel.org>
CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm/mm/fault.c | 21 ++++-----------------
 1 file changed, 4 insertions(+), 17 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 2dd5c41cbb8d..92d4436e74da 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -240,7 +240,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	int sig, code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	if (kprobe_page_fault(regs, fsr))
@@ -290,6 +290,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	}
 
 	fault = __do_page_fault(mm, addr, fsr, flags, tsk);
+	major |= fault & VM_FAULT_MAJOR;
 
 	/* If we need to retry but a fatal signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
@@ -301,23 +302,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 		return 0;
 	}
 
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
-
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 	if (!(fault & VM_FAULT_ERROR) && flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-					regs, addr);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-					regs, addr);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
@@ -326,6 +311,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 
 	up_read(&mm->mmap_sem);
 
+	mm_fault_accounting(tsk, regs, addr, major);
+
 	/*
 	 * Handle the "normal" case first - VM_FAULT_MAJOR
 	 */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 06/25] mm/arm64: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (4 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 05/25] mm/arm: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-16  7:43   ` Will Deacon
  2020-06-15 22:15 ` [PATCH 07/25] mm/csky: " Peter Xu
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Catalin Marinas, Will Deacon, linux-arm-kernel

Use the new mm_fault_accounting() helper for page fault accounting.

CC: Catalin Marinas <catalin.marinas@arm.com>
CC: Will Deacon <will@kernel.org>
CC: linux-arm-kernel@lists.infradead.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/mm/fault.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c9cedc0432d2..09af7d7a60ec 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -484,8 +484,6 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 					 addr, esr, regs);
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
-
 	/*
 	 * As per x86, we may deadlock here. However, since the kernel only
 	 * validly references user space from well defined areas of the code,
@@ -535,20 +533,9 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
 			      VM_FAULT_BADACCESS)))) {
 		/*
 		 * Major/minor page fault accounting is only done
-		 * once. If we go through a retry, it is extremely
-		 * likely that the page will be found in page cache at
-		 * that point.
+		 * once.
 		 */
-		if (major) {
-			current->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
-				      addr);
-		} else {
-			current->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
-				      addr);
-		}
-
+		mm_fault_accounting(current, regs, address, major);
 		return 0;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (5 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 06/25] mm/arm64: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-17  7:04   ` Guo Ren
  2020-06-15 22:15 ` [PATCH 08/25] mm/hexagon: " Peter Xu
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Guo Ren, linux-csky

Use the new mm_fault_accounting() helper for page fault accounting.
Also, move the accounting to be after release of mmap_sem.

CC: Guo Ren <guoren@kernel.org>
CC: linux-csky@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/csky/mm/fault.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
index 4e6dc68f3258..8f8d34d27eca 100644
--- a/arch/csky/mm/fault.c
+++ b/arch/csky/mm/fault.c
@@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
 		return;
 	}
 #endif
-
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	/*
 	 * If we're in an interrupt or have no user
 	 * context, we must not take the fault..
@@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
 			goto bad_area;
 		BUG();
 	}
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
-			      address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
-			      address);
-	}
-
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR);
 	return;
 
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 08/25] mm/hexagon: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (6 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 07/25] mm/csky: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 09/25] mm/ia64: " Peter Xu
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Brian Cain, linux-hexagon

Use the new mm_fault_accounting() helper for page fault accounting.

The perf event accounting will be included too if possible.  Also, do the
accounting after releasing the mmap_sem.

CC: Brian Cain <bcain@codeaurora.org>
CC: linux-hexagon@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/hexagon/mm/vm_fault.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 72334b26317a..5fbccfce6083 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -39,7 +39,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	struct mm_struct *mm = current->mm;
 	int si_signo;
 	int si_code = SEGV_MAPERR;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	const struct exception_table_entry *fixup;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
@@ -90,6 +90,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	}
 
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -97,17 +98,13 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
 	/* The most common case -- we are done. */
 	if (likely(!(fault & VM_FAULT_ERROR))) {
 		if (flags & FAULT_FLAG_ALLOW_RETRY) {
-			if (fault & VM_FAULT_MAJOR)
-				current->maj_flt++;
-			else
-				current->min_flt++;
 			if (fault & VM_FAULT_RETRY) {
 				flags |= FAULT_FLAG_TRIED;
 				goto retry;
 			}
 		}
-
 		up_read(&mm->mmap_sem);
+		mm_fault_accounting(current, regs, address, major);
 		return;
 	}
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 09/25] mm/ia64: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (7 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 08/25] mm/hexagon: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 10/25] mm/m68k: " Peter Xu
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds, Andrea Arcangeli

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.  Do the accouting without mmap_sem.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/ia64/mm/fault.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 30d0c1fca99e..1b7c1a537865 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -64,7 +64,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	struct vm_area_struct *vma, *prev_vma;
 	struct mm_struct *mm = current->mm;
 	unsigned long mask;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
@@ -140,6 +140,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	 * fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -162,10 +163,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	}
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR)
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -179,6 +176,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(current, regs, address, major);
 	return;
 
   check_expansion:
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 10/25] mm/m68k: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (8 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 09/25] mm/ia64: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 11/25] mm/microblaze: " Peter Xu
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Geert Uytterhoeven, linux-m68k

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.  Do the accounting without mmap_sem.

CC: Geert Uytterhoeven <geert@linux-m68k.org>
CC: linux-m68k@lists.linux-m68k.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/m68k/mm/fault.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 3bfb5c8ac3c7..6c6d6d77a1be 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -70,7 +70,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct * vma;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	pr_debug("do page fault:\nregs->sr=%#x, regs->pc=%#lx, address=%#lx, %ld, %p\n",
@@ -136,6 +136,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 	pr_debug("handle_mm_fault returns %x\n", fault);
 
 	if (fault_signal_pending(fault, regs))
@@ -151,16 +152,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 		BUG();
 	}
 
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR)
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -175,6 +167,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(current, regs, address, major);
 	return 0;
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 11/25] mm/microblaze: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (9 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 10/25] mm/m68k: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 12/25] mm/mips: " Peter Xu
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Michal Simek

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: Michal Simek <monstr@monstr.eu>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/microblaze/mm/fault.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 3248141f8ed5..db0f3b71c3fe 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -90,7 +90,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct mm_struct *mm = current->mm;
 	int code = SEGV_MAPERR;
 	int is_write = error_code & ESR_S;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	regs->ear = address;
@@ -216,6 +216,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -231,10 +232,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (unlikely(fault & VM_FAULT_MAJOR))
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -249,6 +246,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(current, regs, address, major);
 
 	/*
 	 * keep track of tlb+htab misses that are good addrs but
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 12/25] mm/mips: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (10 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 11/25] mm/microblaze: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 13/25] mm/nds32: " Peter Xu
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Thomas Bogendoerfer, linux-mips

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Since at it, move the accouting out of mmap_sem because not needed.

CC: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
CC: linux-mips@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/mips/mm/fault.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index f8d62cd83b36..0b937fb12614 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -43,7 +43,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	struct mm_struct *mm = tsk->mm;
 	const int field = sizeof(unsigned long) * 2;
 	int si_code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10);
@@ -153,11 +153,11 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	if (unlikely(fault & VM_FAULT_ERROR)) {
 		if (fault & VM_FAULT_OOM)
 			goto out_of_memory;
@@ -168,15 +168,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 		BUG();
 	}
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-						  regs, address);
-			tsk->maj_flt++;
-		} else {
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-						  regs, address);
-			tsk->min_flt++;
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -191,6 +182,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, address, major);
 	return;
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 13/25] mm/nds32: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (11 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 12/25] mm/mips: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-17  1:05   ` Greentime Hu
  2020-06-15 22:15 ` [PATCH 14/25] mm/nios2: " Peter Xu
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Nick Hu, Greentime Hu, Vincent Chen

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Since at it, move the accounting after releasing mmap_sem.

CC: Nick Hu <nickhu@andestech.com>
CC: Greentime Hu <green.hu@gmail.com>
CC: Vincent Chen <deanbo422@gmail.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/nds32/mm/fault.c | 19 +++----------------
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
index f331e533edc2..11537c03ab1d 100644
--- a/arch/nds32/mm/fault.c
+++ b/arch/nds32/mm/fault.c
@@ -78,7 +78,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	int si_code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int mask = VM_ACCESS_FLAGS;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
@@ -208,6 +208,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	 */
 
 	fault = handle_mm_fault(vma, addr, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	/*
 	 * If we need to retry but a fatal signal is pending, handle the
@@ -229,22 +230,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 			goto bad_area;
 	}
 
-	/*
-	 * Major/minor page fault accounting is only done on the initial
-	 * attempt. If we go through a retry, it is extremely likely that the
-	 * page will be found in page cache at that point.
-	 */
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
-				      1, regs, addr);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
-				      1, regs, addr);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -257,6 +243,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, addr, major);
 	return;
 
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 14/25] mm/nios2: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (12 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 13/25] mm/nds32: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 15/25] mm/openrisc: " Peter Xu
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Ley Foon Tan

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: Ley Foon Tan <ley.foon.tan@intel.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/nios2/mm/fault.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c
index ec9d8a9c426f..48617960a07e 100644
--- a/arch/nios2/mm/fault.c
+++ b/arch/nios2/mm/fault.c
@@ -46,7 +46,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	struct task_struct *tsk = current;
 	struct mm_struct *mm = tsk->mm;
 	int code = SEGV_MAPERR;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	cause >>= 2;
@@ -132,6 +132,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -146,16 +147,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 		BUG();
 	}
 
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR)
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -170,6 +162,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(current, regs, address, major);
 	return;
 
 /*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 15/25] mm/openrisc: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (13 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 14/25] mm/nios2: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-16 18:11   ` Stafford Horne
  2020-06-15 22:15 ` [PATCH 16/25] mm/parisc: " Peter Xu
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, openrisc

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: Jonas Bonn <jonas@southpole.se>
CC: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
CC: Stafford Horne <shorne@gmail.com>
CC: openrisc@lists.librecores.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/openrisc/mm/fault.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 8af1cc78c4fb..594195ae8cdb 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -49,7 +49,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	int si_code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
@@ -160,6 +160,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	 */
 
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -176,10 +177,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
 		/*RGD modeled on Cris */
-		if (fault & VM_FAULT_MAJOR)
-			tsk->maj_flt++;
-		else
-			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -193,6 +190,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, address, major);
 	return;
 
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 16/25] mm/parisc: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (14 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 15/25] mm/openrisc: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:15 ` [PATCH 17/25] mm/powerpc: " Peter Xu
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, James E . J . Bottomley, Helge Deller,
	linux-parisc

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
CC: Helge Deller <deller@gmx.de>
CC: linux-parisc@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/parisc/mm/fault.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 86e8c848f3d7..eab1ee8d18c6 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -263,7 +263,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	unsigned long acc_type;
-	vm_fault_t fault = 0;
+	vm_fault_t fault = 0, major = 0;
 	unsigned int flags;
 
 	if (faulthandler_disabled())
@@ -303,6 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 	 */
 
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -323,10 +324,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		BUG();
 	}
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR)
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			/*
 			 * No need to up_read(&mm->mmap_sem) as we would
@@ -338,6 +335,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
 		}
 	}
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(current, regs, address, major);
 	return;
 
 check_expansion:
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 17/25] mm/powerpc: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (15 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 16/25] mm/parisc: " Peter Xu
@ 2020-06-15 22:15 ` Peter Xu
  2020-06-15 22:16 ` [PATCH 18/25] mm/riscv: " Peter Xu
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, linuxppc-dev

Use the new mm_fault_accounting() helper for page fault accounting.

cmo_account_page_fault() is special.  Keep that.

CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/powerpc/mm/fault.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 84af6c8eecf7..6043b639ae42 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -481,8 +481,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	if (!arch_irq_disabled_regs(regs))
 		local_irq_enable();
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	if (error_code & DSISR_KEYFAULT)
 		return bad_key_fault_exception(regs, address,
 					       get_mm_addr_key(mm, address));
@@ -604,14 +602,11 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address,
 	/*
 	 * Major/minor page fault accounting.
 	 */
-	if (major) {
-		current->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+	if (major)
 		cmo_account_page_fault();
-	} else {
-		current->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-	}
+
+	mm_fault_accounting(current, regs, address, major);
+
 	return 0;
 }
 NOKPROBE_SYMBOL(__do_page_fault);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 18/25] mm/riscv: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (16 preceding siblings ...)
  2020-06-15 22:15 ` [PATCH 17/25] mm/powerpc: " Peter Xu
@ 2020-06-15 22:16 ` Peter Xu
  2020-06-18 23:49   ` Palmer Dabbelt
  2020-06-15 22:23 ` [PATCH 19/25] mm/s390: " Peter Xu
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:16 UTC (permalink / raw)
  To: linux-kernel
  Cc: Gerald Schaefer, Andrew Morton, peterx, Linus Torvalds,
	Andrea Arcangeli, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	linux-riscv

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Paul Walmsley <paul.walmsley@sifive.com>
CC: Palmer Dabbelt <palmer@dabbelt.com>
CC: Albert Ou <aou@eecs.berkeley.edu>
CC: linux-riscv@lists.infradead.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/riscv/mm/fault.c | 21 +++------------------
 1 file changed, 3 insertions(+), 18 deletions(-)

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index be84e32adc4c..9262338614d1 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -30,7 +30,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long addr, cause;
-	unsigned int flags = FAULT_FLAG_DEFAULT;
+	unsigned int flags = FAULT_FLAG_DEFAULT, major = 0;
 	int code = SEGV_MAPERR;
 	vm_fault_t fault;
 
@@ -65,9 +65,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
-
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
-
 retry:
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, addr);
@@ -111,6 +108,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, addr, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	/*
 	 * If we need to retry but a fatal signal is pending, handle the
@@ -128,21 +126,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 		BUG();
 	}
 
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
-				      1, regs, addr);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
-				      1, regs, addr);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -156,6 +140,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, addr, major);
 	return;
 
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (17 preceding siblings ...)
  2020-06-15 22:16 ` [PATCH 18/25] mm/riscv: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-16 15:59   ` Alexander Gordeev
  2020-06-15 22:23 ` [PATCH 20/25] mm/sh: " Peter Xu
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-s390

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: Vasily Gorbik <gor@linux.ibm.com>
CC: Christian Borntraeger <borntraeger@de.ibm.com>
CC: linux-s390@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/s390/mm/fault.c | 21 +++++----------------
 1 file changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index dedc28be27ab..8ca207635b59 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	unsigned long trans_exc_code;
 	unsigned long address;
 	unsigned int flags;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 
 	tsk = current;
 	/*
@@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	}
 
 	address = trans_exc_code & __FAIL_ADDR_MASK;
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	flags = FAULT_FLAG_DEFAULT;
 	if (user_mode(regs))
 		flags |= FAULT_FLAG_USER;
@@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 	if (fault_signal_pending(fault, regs)) {
 		fault = VM_FAULT_SIGNAL;
 		if (flags & FAULT_FLAG_RETRY_NOWAIT)
@@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 	if (unlikely(fault & VM_FAULT_ERROR))
 		goto out_up;
 
-	/*
-	 * Major/minor page fault accounting is only done on the
-	 * initial attempt. If we go through a retry, it is extremely
-	 * likely that the page will be found in page cache at that point.
-	 */
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-				      regs, address);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-				      regs, address);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
 			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
@@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
 			goto retry;
 		}
 	}
+
+	mm_fault_accounting(tsk, regs, address, major);
+
 	if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
 		address =  __gmap_link(gmap, current->thread.gmap_addr,
 				       address);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 20/25] mm/sh: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (18 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 19/25] mm/s390: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-07-20 21:25   ` Rich Felker
  2020-06-15 22:23 ` [PATCH 21/25] mm/sparc32: " Peter Xu
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Yoshinori Sato, Rich Felker, linux-sh

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Yoshinori Sato <ysato@users.sourceforge.jp>
CC: Rich Felker <dalias@libc.org>
CC: linux-sh@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/sh/mm/fault.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 5f23d7907597..06b232973488 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -379,7 +379,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	struct vm_area_struct * vma;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
@@ -412,8 +412,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	if ((regs->sr & SR_IMASK) != SR_IMASK)
 		local_irq_enable();
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	/*
 	 * If we're in an interrupt, have no user context or are running
 	 * with pagefaults disabled then we must not take the fault:
@@ -465,21 +463,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
 		if (mm_fault_error(regs, error_code, address, fault))
 			return;
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			tsk->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
-				      regs, address);
-		} else {
-			tsk->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
-				      regs, address);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -493,4 +483,5 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, address, major);
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 21/25] mm/sparc32: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (19 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 20/25] mm/sh: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-15 22:23 ` [PATCH 22/25] mm/sparc64: " Peter Xu
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, David S . Miller, sparclinux

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: David S. Miller <davem@davemloft.net>
CC: sparclinux@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/sparc/mm/fault_32.c | 16 +++-------------
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index f6e0e601f857..299e6e241a1c 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -167,7 +167,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	unsigned long g2;
 	int from_user = !(regs->psr & PSR_PS);
 	int code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	if (text_fault)
@@ -192,9 +192,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	 */
 	if (pagefault_disabled() || !mm)
 		goto no_context;
-
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 retry:
 	down_read(&mm->mmap_sem);
 
@@ -236,6 +233,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -251,15 +249,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	}
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			current->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
-				      1, regs, address);
-		} else {
-			current->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
-				      1, regs, address);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -273,6 +262,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
 	}
 
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(current, regs, address, major);
 	return;
 
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 22/25] mm/sparc64: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (20 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 21/25] mm/sparc32: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-15 22:23 ` [PATCH 23/25] mm/unicore32: " Peter Xu
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, David S . Miller, sparclinux

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: David S. Miller <davem@davemloft.net>
CC: sparclinux@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/sparc/mm/fault_64.c | 16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index c0c0dd471b6b..61be0a0d79c6 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -269,7 +269,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	struct vm_area_struct *vma;
 	unsigned int insn = 0;
 	int si_code, fault_code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned long address, mm_rss;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
@@ -317,8 +317,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	if (faulthandler_disabled() || !mm)
 		goto intr_or_no_mm;
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		if ((regs->tstate & TSTATE_PRIV) &&
 		    !search_exception_tables(regs->tpc)) {
@@ -424,6 +422,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	}
 
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		goto exit_exception;
@@ -439,15 +438,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	}
 
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR) {
-			current->maj_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
-				      1, regs, address);
-		} else {
-			current->min_flt++;
-			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
-				      1, regs, address);
-		}
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -461,6 +451,8 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
 	}
 	up_read(&mm->mmap_sem);
 
+	mm_fault_accounting(current, regs, address, major);
+
 	mm_rss = get_mm_rss(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE)
 	mm_rss -= (mm->context.thp_pte_count * (HPAGE_SIZE / PAGE_SIZE));
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 23/25] mm/unicore32: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (21 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 22/25] mm/sparc64: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-15 22:23 ` [PATCH 24/25] mm/x86: " Peter Xu
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Guan Xuetao

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.
Also, the perf events for page faults will be accounted too when the config has
CONFIG_PERF_EVENTS defined.

CC: Guan Xuetao <gxt@pku.edu.cn>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/unicore32/mm/fault.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 3022104aa613..240b38e81ed6 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -201,7 +201,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	struct task_struct *tsk;
 	struct mm_struct *mm;
 	int sig, code;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	tsk = current;
@@ -245,6 +245,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	}
 
 	fault = __do_pf(mm, addr, fsr, flags, tsk);
+	major |= fault & VM_FAULT_MAJOR;
 
 	/* If we need to retry but a fatal signal is pending, handle the
 	 * signal first. We do not need to release the mmap_sem because
@@ -254,10 +255,6 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 		return 0;
 
 	if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) {
-		if (fault & VM_FAULT_MAJOR)
-			tsk->maj_flt++;
-		else
-			tsk->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 			goto retry;
@@ -267,11 +264,13 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Handle the "normal" case first - VM_FAULT_MAJOR
+	 * Handle the "normal" case first
 	 */
-	if (likely(!(fault &
-	       (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
+	if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP |
+			      VM_FAULT_BADACCESS)))) {
+		mm_fault_accounting(tsk, regs, addr, major);
 		return 0;
+	}
 
 	/*
 	 * If we are in kernel mode at this point, we
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 24/25] mm/x86: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (22 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 23/25] mm/unicore32: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-15 22:23 ` [PATCH 25/25] mm/xtensa: " Peter Xu
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
  25 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H . Peter Anvin

Use the new mm_fault_accounting() helper for page fault accounting.

CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: x86@kernel.org
CC: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/x86/mm/fault.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index a51df516b87b..28635b4e324c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1365,8 +1365,6 @@ void do_user_addr_fault(struct pt_regs *regs,
 			local_irq_enable();
 	}
 
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
 	if (hw_error_code & X86_PF_WRITE)
 		flags |= FAULT_FLAG_WRITE;
 	if (hw_error_code & X86_PF_INSTR)
@@ -1493,13 +1491,7 @@ void do_user_addr_fault(struct pt_regs *regs,
 	 * Major/minor page fault accounting. If any of the events
 	 * returned VM_FAULT_MAJOR, we account it as a major fault.
 	 */
-	if (major) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-	}
+	mm_fault_accounting(tsk, regs, address, major);
 
 	check_v8086_mode(regs, address, tsk);
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 25/25] mm/xtensa: Use mm_fault_accounting()
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (23 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 24/25] mm/x86: " Peter Xu
@ 2020-06-15 22:23 ` Peter Xu
  2020-06-15 23:13   ` Max Filippov
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
  25 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 22:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Xu, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Chris Zankel, Max Filippov, linux-xtensa

Use the new mm_fault_accounting() helper for page fault accounting.

Avoid doing page fault accounting multiple times if the page fault is retried.

CC: Chris Zankel <chris@zankel.net>
CC: Max Filippov <jcmvbkbc@gmail.com>
CC: linux-xtensa@linux-xtensa.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/xtensa/mm/fault.c | 14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index e7172bd53ced..511a2c5f9857 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -42,7 +42,7 @@ void do_page_fault(struct pt_regs *regs)
 	int code;
 
 	int is_write, is_exec;
-	vm_fault_t fault;
+	vm_fault_t fault, major = 0;
 	unsigned int flags = FAULT_FLAG_DEFAULT;
 
 	code = SEGV_MAPERR;
@@ -109,6 +109,7 @@ void do_page_fault(struct pt_regs *regs)
 	 * the fault.
 	 */
 	fault = handle_mm_fault(vma, address, flags);
+	major |= fault & VM_FAULT_MAJOR;
 
 	if (fault_signal_pending(fault, regs))
 		return;
@@ -123,10 +124,6 @@ void do_page_fault(struct pt_regs *regs)
 		BUG();
 	}
 	if (flags & FAULT_FLAG_ALLOW_RETRY) {
-		if (fault & VM_FAULT_MAJOR)
-			current->maj_flt++;
-		else
-			current->min_flt++;
 		if (fault & VM_FAULT_RETRY) {
 			flags |= FAULT_FLAG_TRIED;
 
@@ -140,12 +137,7 @@ void do_page_fault(struct pt_regs *regs)
 	}
 
 	up_read(&mm->mmap_sem);
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-	if (flags & VM_FAULT_MAJOR)
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
-	else
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
-
+	mm_fault_accounting(current, regs, address, major);
 	return;
 
 	/* Something tried to access memory that isn't in our memory map..
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/25] mm: Introduce mm_fault_accounting()
  2020-06-15 22:15 ` [PATCH 02/25] mm: Introduce mm_fault_accounting() Peter Xu
@ 2020-06-15 22:32   ` Linus Torvalds
  2020-06-15 23:19     ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2020-06-15 22:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli

On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>
> Provide this helper for doing memory page fault accounting across archs.  It
> can be defined unconditionally because perf_sw_event() is always defined, and
> perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.

Well, the downside is that now it forces a separate I$ miss and all
those extra arguments because it's a out-of-line function and the
compiler won't see that they all go away.

Yeah, maybe some day maybe we'll have LTO and these kinds of things
will not matter. And maybe they already don't. But it seems kind of
sad to basically force non-optimal code generation from this series.

Why would you export the symbol, btw? Page fault handling is never a module.

              Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 25/25] mm/xtensa: Use mm_fault_accounting()
  2020-06-15 22:23 ` [PATCH 25/25] mm/xtensa: " Peter Xu
@ 2020-06-15 23:13   ` Max Filippov
  0 siblings, 0 replies; 60+ messages in thread
From: Max Filippov @ 2020-06-15 23:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: LKML, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Chris Zankel,
	open list:TENSILICA XTENSA PORT (xtensa)

On Mon, Jun 15, 2020 at 3:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Chris Zankel <chris@zankel.net>
> CC: Max Filippov <jcmvbkbc@gmail.com>
> CC: linux-xtensa@linux-xtensa.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/xtensa/mm/fault.c | 14 +++-----------
>  1 file changed, 3 insertions(+), 11 deletions(-)

Acked-by: Max Filippov <jcmvbkbc@gmail.com>

-- 
Thanks.
-- Max

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/25] mm: Introduce mm_fault_accounting()
  2020-06-15 22:32   ` Linus Torvalds
@ 2020-06-15 23:19     ` Peter Xu
  2020-06-16 19:00       ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-15 23:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli

On Mon, Jun 15, 2020 at 03:32:40PM -0700, Linus Torvalds wrote:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Provide this helper for doing memory page fault accounting across archs.  It
> > can be defined unconditionally because perf_sw_event() is always defined, and
> > perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.
> 
> Well, the downside is that now it forces a separate I$ miss and all
> those extra arguments because it's a out-of-line function and the
> compiler won't see that they all go away.
> 
> Yeah, maybe some day maybe we'll have LTO and these kinds of things
> will not matter. And maybe they already don't. But it seems kind of
> sad to basically force non-optimal code generation from this series.

I tried to make it static inline firstly in linux/mm.h, however it'll need to
have linux/mm.h include linux/perf_event.h which seems to have created a loop
dependency of headers.  I verified current code will at least generate inlined
functions too for x86 (no mm_fault_accounting() in "objdump -t vmlinux") with
gcc10.

Another alternative is to make it a macro, it's just that I feel the function
definition is a bit cleaner.  Any further suggestions welcomed too.

> 
> Why would you export the symbol, btw? Page fault handling is never a module.

I followed handle_mm_fault() which is exported too, since potentially
mm_fault_accounting() should always be called in the same context of
handle_mm_fault().  Or do you prefer me to drop it?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/25] mm/arm64: Use mm_fault_accounting()
  2020-06-15 22:15 ` [PATCH 06/25] mm/arm64: " Peter Xu
@ 2020-06-16  7:43   ` Will Deacon
  2020-06-16 15:59     ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Will Deacon @ 2020-06-16  7:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Catalin Marinas, linux-arm-kernel

On Mon, Jun 15, 2020 at 06:15:48PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
> 
> CC: Catalin Marinas <catalin.marinas@arm.com>
> CC: Will Deacon <will@kernel.org>
> CC: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/arm64/mm/fault.c | 17 ++---------------
>  1 file changed, 2 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c9cedc0432d2..09af7d7a60ec 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -484,8 +484,6 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  					 addr, esr, regs);
>  	}
>  
> -	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> -
>  	/*
>  	 * As per x86, we may deadlock here. However, since the kernel only
>  	 * validly references user space from well defined areas of the code,
> @@ -535,20 +533,9 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  			      VM_FAULT_BADACCESS)))) {
>  		/*
>  		 * Major/minor page fault accounting is only done
> -		 * once. If we go through a retry, it is extremely
> -		 * likely that the page will be found in page cache at
> -		 * that point.
> +		 * once.
>  		 */
> -		if (major) {
> -			current->maj_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
> -				      addr);
> -		} else {
> -			current->min_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
> -				      addr);
> -		}
> -
> +		mm_fault_accounting(current, regs, address, major);

Please can you explain why it's ok to move the PERF_COUNT_SW_PAGE_FAULTS
update like this? Seems like a user-visible change to me, so some
justification would really help.

Will

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 06/25] mm/arm64: Use mm_fault_accounting()
  2020-06-16  7:43   ` Will Deacon
@ 2020-06-16 15:59     ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-16 15:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Catalin Marinas, linux-arm-kernel

Hi, Will,

On Tue, Jun 16, 2020 at 08:43:08AM +0100, Will Deacon wrote:
> Please can you explain why it's ok to move the PERF_COUNT_SW_PAGE_FAULTS
> update like this? Seems like a user-visible change to me, so some
> justification would really help.

Indeed this could be a functional change for PERF_COUNT_SW_PAGE_FAULTS on some
archs, e.g., for arm64, PERF_COUNT_SW_PAGE_FAULTS previously will also contain
accounting of severe errors where we go into the "no_context" path.  However if
you see the other archs, it's not always true, for example, the xtensa arch
only accounts the correctly handled faults (arch/xtensa/mm/fault.c).

After I thought about this, I don't think it's extremely useful (or please
correct me if I missed something important) to use PERF_COUNT_SW_PAGE_FAULTS
for fatal error accountings among all those correct ones.  After all they are
really extremely rare cases, and even if we got a sigbus for a process, we'll
normally got something dumped in dmesg so if we really want to capture the
error cases there should always be a better way (because by following things
like dmesg we can not only know how many error faults triggered, but also on
the details of the errors).

IOW, my understanding of users of PERF_COUNT_SW_PAGE_FAULTS is that they want
to trap normal/correct page faults, not really care about rare errors.

Then when I went back to think PERF_COUNT_SW_PAGE_FAULTS, it's really about:

  A=PERF_COUNT_SW_PAGE_FAULTS 
  B=PERF_COUNT_SW_PAGE_FAULTS_MAJ
  C=PERF_COUNT_SW_PAGE_FAULTS_MIN

And:

  A=B+C

If that's the case (which is simple and clear), it's really helpful too that we
unify this definition across all the architectures, then it'll also be easier
for us to provide some helper like mm_fault_accounting() so that the accounting
can be managed in the general code rather than in arch-specific ways.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-15 22:23 ` [PATCH 19/25] mm/s390: " Peter Xu
@ 2020-06-16 15:59   ` Alexander Gordeev
  2020-06-16 16:35     ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Alexander Gordeev @ 2020-06-16 15:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-s390

On Mon, Jun 15, 2020 at 06:23:02PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
> 
> Avoid doing page fault accounting multiple times if the page fault is retried.
> 
> CC: Heiko Carstens <heiko.carstens@de.ibm.com>
> CC: Vasily Gorbik <gor@linux.ibm.com>
> CC: Christian Borntraeger <borntraeger@de.ibm.com>
> CC: linux-s390@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/s390/mm/fault.c | 21 +++++----------------
>  1 file changed, 5 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index dedc28be27ab..8ca207635b59 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	unsigned long trans_exc_code;
>  	unsigned long address;
>  	unsigned int flags;
> -	vm_fault_t fault;
> +	vm_fault_t fault, major = 0;
> 
>  	tsk = current;
>  	/*
> @@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	}
> 
>  	address = trans_exc_code & __FAIL_ADDR_MASK;
> -	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>  	flags = FAULT_FLAG_DEFAULT;
>  	if (user_mode(regs))
>  		flags |= FAULT_FLAG_USER;
> @@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	 * the fault.
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
> +	major |= fault & VM_FAULT_MAJOR;
>  	if (fault_signal_pending(fault, regs)) {
>  		fault = VM_FAULT_SIGNAL;
>  		if (flags & FAULT_FLAG_RETRY_NOWAIT)
> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  	if (unlikely(fault & VM_FAULT_ERROR))
>  		goto out_up;
> 
> -	/*
> -	 * Major/minor page fault accounting is only done on the
> -	 * initial attempt. If we go through a retry, it is extremely
> -	 * likely that the page will be found in page cache at that point.
> -	 */
>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> -		if (fault & VM_FAULT_MAJOR) {
> -			tsk->maj_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> -				      regs, address);
> -		} else {
> -			tsk->min_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> -				      regs, address);
> -		}
>  		if (fault & VM_FAULT_RETRY) {
>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {

Seems like the call to mm_fault_accounting() will be missed if
we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
jumps to "out_up"...

> @@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>  			goto retry;
>  		}
>  	}
> +
> +	mm_fault_accounting(tsk, regs, address, major);
> +
>  	if (IS_ENABLED(CONFIG_PGSTE) && gmap) {
>  		address =  __gmap_link(gmap, current->thread.gmap_addr,
>  				       address);
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-16 15:59   ` Alexander Gordeev
@ 2020-06-16 16:35     ` Peter Xu
  2020-06-17  6:19       ` Christian Borntraeger
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-16 16:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, linux-s390

Hi, Alexander,

On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> > @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >  	if (unlikely(fault & VM_FAULT_ERROR))
> >  		goto out_up;
> > 
> > -	/*
> > -	 * Major/minor page fault accounting is only done on the
> > -	 * initial attempt. If we go through a retry, it is extremely
> > -	 * likely that the page will be found in page cache at that point.
> > -	 */
> >  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > -		if (fault & VM_FAULT_MAJOR) {
> > -			tsk->maj_flt++;
> > -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> > -				      regs, address);
> > -		} else {
> > -			tsk->min_flt++;
> > -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> > -				      regs, address);
> > -		}
> >  		if (fault & VM_FAULT_RETRY) {
> >  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> 
> Seems like the call to mm_fault_accounting() will be missed if
> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> jumps to "out_up"...

This is true as a functional change.  However that also means that we've got a
VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
than handled correctly (for instance, due to some try_lock failed during the
fault process).

To me, that case should not be counted as a page fault at all?  Or we might get
the same duplicated accounting when the page fault retried from a higher stack.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 15/25] mm/openrisc: Use mm_fault_accounting()
  2020-06-15 22:15 ` [PATCH 15/25] mm/openrisc: " Peter Xu
@ 2020-06-16 18:11   ` Stafford Horne
  0 siblings, 0 replies; 60+ messages in thread
From: Stafford Horne @ 2020-06-16 18:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Jonas Bonn, Stefan Kristiansson, openrisc

On Mon, Jun 15, 2020 at 06:15:57PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
> 
> Avoid doing page fault accounting multiple times if the page fault is retried.
> Also, the perf events for page faults will be accounted too when the config has
> CONFIG_PERF_EVENTS defined.
> 
> CC: Jonas Bonn <jonas@southpole.se>
> CC: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
> CC: Stafford Horne <shorne@gmail.com>
> CC: openrisc@lists.librecores.org
> Signed-off-by: Peter Xu <peterx@redhat.com>

Acked-by: Stafford Horne <shorne@gmail.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
                   ` (24 preceding siblings ...)
  2020-06-15 22:23 ` [PATCH 25/25] mm/xtensa: " Peter Xu
@ 2020-06-16 18:55 ` Linus Torvalds
  2020-06-16 21:03   ` Peter Xu
  2020-06-17  0:55   ` Michael Ellerman
  25 siblings, 2 replies; 60+ messages in thread
From: Linus Torvalds @ 2020-06-16 18:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli, openrisc, linux-arch, Alexander Gordeev,
	linux-s390, Will Deacon, Catalin Marinas, Linux ARM

[-- Attachment #1: Type: text/plain, Size: 2788 bytes --]

On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>
> This series tries to address all of them by introducing mm_fault_accounting()
> first, so that we move all the page fault accounting into the common code base,
> then call it properly from arch pf handlers just like handle_mm_fault().

Hmm.

So having looked at this a bit more, I'd actually like to go even
further, and just get rid of the per-architecture code _entirely_.

Here's a straw-man patch to the generic code - the idea is mostly laid
out in the comment that I'm just quoting here directly too:

        /*
         * Do accounting in the common code, to avoid unnecessary
         * architecture differences or duplicated code.
         *
         * We arbitrarily make the rules be:
         *
         *  - faults that never even got here (because the address
         *    wasn't valid). That includes arch_vma_access_permitted()
         *    failing above.
         *
         *    So this is expressly not a "this many hardware page
         *    faults" counter. Use the hw profiling for that.
         *
         *  - incomplete faults (ie RETRY) do not count (see above).
         *    They will only count once completed.
         *
         *  - the fault counts as a "major" fault when the final
         *    successful fault is VM_FAULT_MAJOR, or if it was a
         *    retry (which implies that we couldn't handle it
         *    immediately previously).
         *
         *  - if the fault is done for GUP, regs wil be NULL and
         *    no accounting will be done (but you _could_ pass in
         *    your own regs and it would be accounted to the thread
         *    doing the fault, not to the target!)
         */

the code itself in the patch is

 (a) pretty trivial and self-evident

 (b) INCOMPLETE

that (b) is worth noting: this patch won't compile on its own. It
intentionally leaves all the users without the new 'regs' argument,
because you obviously simply need to remove all the code that
currently tries to do any accounting.

Comments?

This is a bigger change, but I think it might be worth it to _really_
consolidate the major/minor logic.

One detail worth noting: I do wonder if we should put the

    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

just in the arch code at the top of the fault handling, and consider
it entirely unrelated to the major/minor fault handling. The
major/minor faults fundamnetally are about successes. But the plain
PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
things that never even get to this point at all.

I'm not convinced it's useful to have three SW events that are defined
to be A=B+C.

But this does *not* do that part. It' sreally just an RFC patch.

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 2918 bytes --]

 include/linux/mm.h |  3 ++-
 mm/memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..e82a604339c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1648,7 +1648,8 @@ int invalidate_inode_page(struct page *page);
 
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
-			unsigned long address, unsigned int flags);
+			unsigned long address, unsigned int flags,
+			struct pt_regs *);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index dc7f3543b1fd..f15056151fb1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -72,6 +72,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 
+#include <linux/perf_event.h>
 #include <trace/events/kmem.h>
 
 #include <asm/io.h>
@@ -4353,7 +4354,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
-		unsigned int flags)
+		unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
 
@@ -4394,6 +4395,49 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
+	if (ret & VM_FAULT_RETRY)
+		return ret;
+
+	/*
+	 * Do accounting in the common code, to avoid unnecessary
+	 * architecture differences or duplicated code.
+	 *
+	 * We arbitrarily make the rules be:
+	 *
+	 *  - faults that never even got here (because the address
+	 *    wasn't valid). That includes arch_vma_access_permitted()
+	 *    failing above.
+	 *
+	 *    So this is expressly not a "this many hardware page
+	 *    faults" counter. Use the hw profiling for that.
+	 *
+	 *  - incomplete faults (ie RETRY) do not count (see above).
+	 *    They will only count once completed.
+	 *
+	 *  - the fault counts as a "major" fault when the final
+	 *    successful fault is VM_FAULT_MAJOR, or if it was a
+	 *    retry (which implies that we couldn't handle it
+	 *    immediately previously).
+	 *
+	 *  - if the fault is done for GUP, regs wil be NULL and
+	 *    no accounting will be done (but you _could_ pass in
+	 *    your own regs and it would be accounted to the thread
+	 *    doing the fault, not to the target!)
+	 */
+
+	if (!regs)
+		return ret;
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
+		current->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+	} else {
+		current->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/25] mm: Introduce mm_fault_accounting()
  2020-06-15 23:19     ` Peter Xu
@ 2020-06-16 19:00       ` Andrew Morton
  2020-06-17 16:26         ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2020-06-16 19:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linus Torvalds, Linux Kernel Mailing List, Gerald Schaefer,
	Andrea Arcangeli

On Mon, 15 Jun 2020 19:19:17 -0400 Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jun 15, 2020 at 03:32:40PM -0700, Linus Torvalds wrote:
> > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Provide this helper for doing memory page fault accounting across archs.  It
> > > can be defined unconditionally because perf_sw_event() is always defined, and
> > > perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.
> > 
> > Well, the downside is that now it forces a separate I$ miss and all
> > those extra arguments because it's a out-of-line function and the
> > compiler won't see that they all go away.
> > 
> > Yeah, maybe some day maybe we'll have LTO and these kinds of things
> > will not matter. And maybe they already don't. But it seems kind of
> > sad to basically force non-optimal code generation from this series.
> 
> I tried to make it static inline firstly in linux/mm.h, however it'll need to
> have linux/mm.h include linux/perf_event.h which seems to have created a loop
> dependency of headers.  I verified current code will at least generate inlined
> functions too for x86 (no mm_fault_accounting() in "objdump -t vmlinux") with
> gcc10.
> 
> Another alternative is to make it a macro, it's just that I feel the function
> definition is a bit cleaner.  Any further suggestions welcomed too.

Could create a new header file mm_fault.h which includes mm.h and
perf_event.h.  A later cleanup could move other fault-related things
into that header and add the appropriate inclusions into files which
use these things.

btw, I think mm_account_fault() might be a better name for this function.

And some (kerneldoc) documentation would be nice.  Although this
function is pretty self-evident.

> > 
> > Why would you export the symbol, btw? Page fault handling is never a module.
> 
> I followed handle_mm_fault() which is exported too, since potentially
> mm_fault_accounting() should always be called in the same context of
> handle_mm_fault().  Or do you prefer me to drop it?

Let's not add an unneeded export.  If someone for some reason needs it
later, it can be added then.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
@ 2020-06-16 21:03   ` Peter Xu
  2020-06-17  0:55   ` Michael Ellerman
  1 sibling, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-16 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli, openrisc, linux-arch, Alexander Gordeev,
	linux-s390, Will Deacon, Catalin Marinas, Linux ARM

On Tue, Jun 16, 2020 at 11:55:17AM -0700, Linus Torvalds wrote:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > This series tries to address all of them by introducing mm_fault_accounting()
> > first, so that we move all the page fault accounting into the common code base,
> > then call it properly from arch pf handlers just like handle_mm_fault().
> 
> Hmm.
> 
> So having looked at this a bit more, I'd actually like to go even
> further, and just get rid of the per-architecture code _entirely_.
> 
> Here's a straw-man patch to the generic code - the idea is mostly laid
> out in the comment that I'm just quoting here directly too:
> 
>         /*
>          * Do accounting in the common code, to avoid unnecessary
>          * architecture differences or duplicated code.
>          *
>          * We arbitrarily make the rules be:
>          *
>          *  - faults that never even got here (because the address
>          *    wasn't valid). That includes arch_vma_access_permitted()
>          *    failing above.
>          *
>          *    So this is expressly not a "this many hardware page
>          *    faults" counter. Use the hw profiling for that.
>          *
>          *  - incomplete faults (ie RETRY) do not count (see above).
>          *    They will only count once completed.
>          *
>          *  - the fault counts as a "major" fault when the final
>          *    successful fault is VM_FAULT_MAJOR, or if it was a
>          *    retry (which implies that we couldn't handle it
>          *    immediately previously).
>          *
>          *  - if the fault is done for GUP, regs wil be NULL and
>          *    no accounting will be done (but you _could_ pass in
>          *    your own regs and it would be accounted to the thread
>          *    doing the fault, not to the target!)
>          */
> 
> the code itself in the patch is
> 
>  (a) pretty trivial and self-evident
> 
>  (b) INCOMPLETE
> 
> that (b) is worth noting: this patch won't compile on its own. It
> intentionally leaves all the users without the new 'regs' argument,
> because you obviously simply need to remove all the code that
> currently tries to do any accounting.
> 
> Comments?

Looks clean to me.  The definition of "major faults" will slightly change even
for those who has done the "major |= fault & MAJOR" operations before, but at
least I can't see anything bad about that either...

To make things easier, we can use the 1st patch to introduce this change,
however pass regs==NULL at the callers to never trigger this accounting.  Then
we can still use one patch for each arch to do the final convertions.

> 
> This is a bigger change, but I think it might be worth it to _really_
> consolidate the major/minor logic.
> 
> One detail worth noting: I do wonder if we should put the
> 
>     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> 
> just in the arch code at the top of the fault handling, and consider
> it entirely unrelated to the major/minor fault handling. The
> major/minor faults fundamnetally are about successes. But the plain
> PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> things that never even get to this point at all.
> 
> I'm not convinced it's useful to have three SW events that are defined
> to be A=B+C.

IMHO it's still common to have a "total" statistics in softwares even if each
of the subsets are accounted separately.  Here it's just a bit special because
there're only two elements so the addition is so straightforward.  It seems a
trade-off on whether we'd like to do the accounting of errornous faults, or we
want to make it cleaner by put them altogether but only successful page faults.
I slightly preferred the latter due to the fact that I failed to find great
usefulness out of keeping error fault accountings, but no strong opinions..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
  2020-06-16 21:03   ` Peter Xu
@ 2020-06-17  0:55   ` Michael Ellerman
  2020-06-17  8:04     ` Will Deacon
  1 sibling, 1 reply; 60+ messages in thread
From: Michael Ellerman @ 2020-06-17  0:55 UTC (permalink / raw)
  To: Linus Torvalds, Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Andrea Arcangeli, openrisc, linux-arch, Alexander Gordeev,
	linux-s390, Will Deacon, Catalin Marinas, Linux ARM

Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>> This series tries to address all of them by introducing mm_fault_accounting()
>> first, so that we move all the page fault accounting into the common code base,
>> then call it properly from arch pf handlers just like handle_mm_fault().
>
> Hmm.
>
> So having looked at this a bit more, I'd actually like to go even
> further, and just get rid of the per-architecture code _entirely_.

<snip>

> One detail worth noting: I do wonder if we should put the
>
>     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>
> just in the arch code at the top of the fault handling, and consider
> it entirely unrelated to the major/minor fault handling. The
> major/minor faults fundamnetally are about successes. But the plain
> PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> things that never even get to this point at all.

Yeah I think we should keep it in the arch code at roughly the top.

If it's moved to the end you could have a process spinning taking bad
page faults (and fixing them up), and see no sign of it from the perf
page fault counters.

cheers

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 13/25] mm/nds32: Use mm_fault_accounting()
  2020-06-15 22:15 ` [PATCH 13/25] mm/nds32: " Peter Xu
@ 2020-06-17  1:05   ` Greentime Hu
  0 siblings, 0 replies; 60+ messages in thread
From: Greentime Hu @ 2020-06-17  1:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Nick Hu, Vincent Chen

Peter Xu <peterx@redhat.com> 於 2020年6月16日 週二 上午6:16寫道:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
> Since at it, move the accounting after releasing mmap_sem.
>
> CC: Nick Hu <nickhu@andestech.com>
> CC: Greentime Hu <green.hu@gmail.com>
> CC: Vincent Chen <deanbo422@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/nds32/mm/fault.c | 19 +++----------------
>  1 file changed, 3 insertions(+), 16 deletions(-)
>
> diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
> index f331e533edc2..11537c03ab1d 100644
> --- a/arch/nds32/mm/fault.c
> +++ b/arch/nds32/mm/fault.c
> @@ -78,7 +78,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>         struct mm_struct *mm;
>         struct vm_area_struct *vma;
>         int si_code;
> -       vm_fault_t fault;
> +       vm_fault_t fault, major = 0;
>         unsigned int mask = VM_ACCESS_FLAGS;
>         unsigned int flags = FAULT_FLAG_DEFAULT;
>
> @@ -208,6 +208,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>          */
>
>         fault = handle_mm_fault(vma, addr, flags);
> +       major |= fault & VM_FAULT_MAJOR;
>
>         /*
>          * If we need to retry but a fatal signal is pending, handle the
> @@ -229,22 +230,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>                         goto bad_area;
>         }
>
> -       /*
> -        * Major/minor page fault accounting is only done on the initial
> -        * attempt. If we go through a retry, it is extremely likely that the
> -        * page will be found in page cache at that point.
> -        */
> -       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
>         if (flags & FAULT_FLAG_ALLOW_RETRY) {
> -               if (fault & VM_FAULT_MAJOR) {
> -                       tsk->maj_flt++;
> -                       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
> -                                     1, regs, addr);
> -               } else {
> -                       tsk->min_flt++;
> -                       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
> -                                     1, regs, addr);
> -               }
>                 if (fault & VM_FAULT_RETRY) {
>                         flags |= FAULT_FLAG_TRIED;
>
> @@ -257,6 +243,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
>         }
>
>         up_read(&mm->mmap_sem);
> +       mm_fault_accounting(tsk, regs, addr, major);
>         return;
>
>         /*
> --
Hi Peter,

Thank you. :)
Acked-by: Greentime Hu <green.hu@gmail.com>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-16 16:35     ` Peter Xu
@ 2020-06-17  6:19       ` Christian Borntraeger
  2020-06-17 16:06         ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Christian Borntraeger @ 2020-06-17  6:19 UTC (permalink / raw)
  To: Peter Xu, Alexander Gordeev
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Heiko Carstens, Vasily Gorbik, linux-s390



On 16.06.20 18:35, Peter Xu wrote:
> Hi, Alexander,
> 
> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>  	if (unlikely(fault & VM_FAULT_ERROR))
>>>  		goto out_up;
>>>
>>> -	/*
>>> -	 * Major/minor page fault accounting is only done on the
>>> -	 * initial attempt. If we go through a retry, it is extremely
>>> -	 * likely that the page will be found in page cache at that point.
>>> -	 */
>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>> -		if (fault & VM_FAULT_MAJOR) {
>>> -			tsk->maj_flt++;
>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>> -				      regs, address);
>>> -		} else {
>>> -			tsk->min_flt++;
>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>> -				      regs, address);
>>> -		}
>>>  		if (fault & VM_FAULT_RETRY) {
>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>>
>> Seems like the call to mm_fault_accounting() will be missed if
>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>> jumps to "out_up"...
> 
> This is true as a functional change.  However that also means that we've got a
> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> than handled correctly (for instance, due to some try_lock failed during the
> fault process).
> 
> To me, that case should not be counted as a page fault at all?  Or we might get
> the same duplicated accounting when the page fault retried from a higher stack.
> 
> Thanks

This case below (the one with the gmap) is the KVM case for doing a so called
pseudo page fault to our guests. (we notify our guests about major host page
faults and let it reschedule to something else instead of halting the vcpu).
This is being resolved with either gup or fixup_user_fault asynchronously by
KVM code (this can also be sync when the guest does not match some conditions)
We do not change the counters in that code as far as I can tell so we should
continue to do it here.

(see arch/s390/kvm/kvm-s390.c
static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
{
[...]
        } else if (current->thread.gmap_pfault) {
                trace_kvm_s390_major_guest_pfault(vcpu);
                current->thread.gmap_pfault = 0;
                if (kvm_arch_setup_async_pf(vcpu))
                        return 0;
                return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
        }

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-15 22:15 ` [PATCH 07/25] mm/csky: " Peter Xu
@ 2020-06-17  7:04   ` Guo Ren
  2020-06-17 15:49     ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Guo Ren @ 2020-06-17  7:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, linux-csky

Hi Peter,

On Tue, Jun 16, 2020 at 6:16 AM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
> Also, move the accounting to be after release of mmap_sem.
>
> CC: Guo Ren <guoren@kernel.org>
> CC: linux-csky@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/csky/mm/fault.c | 13 +------------
>  1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
> index 4e6dc68f3258..8f8d34d27eca 100644
> --- a/arch/csky/mm/fault.c
> +++ b/arch/csky/mm/fault.c
> @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
>                 return;
>         }
>  #endif
> -
> -       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>         /*
>          * If we're in an interrupt or have no user
>          * context, we must not take the fault..
> @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
>                         goto bad_area;
>                 BUG();
>         }
> -       if (fault & VM_FAULT_MAJOR) {
> -               tsk->maj_flt++;
> -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
> -                             address);
> -       } else {
> -               tsk->min_flt++;
> -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
> -                             address);
> -       }
> -
>         up_read(&mm->mmap_sem);
> +       mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR);
>         return;
>
>         /*
> --
> 2.26.2
>
I notice that you move it out of mm->mmap_sem's area, all archs should
follow the rule ? Can you give me a clarification and put it into de
commit log ?
Thx
-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-17  0:55   ` Michael Ellerman
@ 2020-06-17  8:04     ` Will Deacon
  2020-06-17 16:10       ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Will Deacon @ 2020-06-17  8:04 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Linus Torvalds, Peter Xu, Linux Kernel Mailing List,
	Gerald Schaefer, Andrew Morton, Andrea Arcangeli, openrisc,
	linux-arch, Alexander Gordeev, linux-s390, Catalin Marinas,
	Linux ARM

On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> >> This series tries to address all of them by introducing mm_fault_accounting()
> >> first, so that we move all the page fault accounting into the common code base,
> >> then call it properly from arch pf handlers just like handle_mm_fault().
> >
> > Hmm.
> >
> > So having looked at this a bit more, I'd actually like to go even
> > further, and just get rid of the per-architecture code _entirely_.
> 
> <snip>
> 
> > One detail worth noting: I do wonder if we should put the
> >
> >     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> >
> > just in the arch code at the top of the fault handling, and consider
> > it entirely unrelated to the major/minor fault handling. The
> > major/minor faults fundamnetally are about successes. But the plain
> > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > things that never even get to this point at all.
> 
> Yeah I think we should keep it in the arch code at roughly the top.

I agree. It's a nice idea to consolidate the code, but I don't see that
it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
changing the semantics (and a potentially less useful way. Of course,
moving more of do_page_fault() out of the arch code would be great, but
that's a much bigger effort.

> If it's moved to the end you could have a process spinning taking bad
> page faults (and fixing them up), and see no sign of it from the perf
> page fault counters.

The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
if _all_ of the following are true:

  1. The fault isn't handled by kprobes
  2. The pagefault handler is enabled
  3. We have an mm (current->mm)
  4. The fault isn't an unexpected kernel fault on a user address (we oops
     and kill the task in this case)

Which loosely corresponds to "we took a fault on a user address that it
looks like we can handle".

That said, I'm happy to tweak this if it brings us into line with other
architectures.

Will

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-17  7:04   ` Guo Ren
@ 2020-06-17 15:49     ` Peter Xu
  2020-06-17 17:53       ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-17 15:49 UTC (permalink / raw)
  To: Guo Ren
  Cc: Linux Kernel Mailing List, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, linux-csky

On Wed, Jun 17, 2020 at 03:04:49PM +0800, Guo Ren wrote:
> Hi Peter,

Hi, Guo,

> 
> On Tue, Jun 16, 2020 at 6:16 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Use the new mm_fault_accounting() helper for page fault accounting.
> > Also, move the accounting to be after release of mmap_sem.
> >
> > CC: Guo Ren <guoren@kernel.org>
> > CC: linux-csky@vger.kernel.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/csky/mm/fault.c | 13 +------------
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
> > index 4e6dc68f3258..8f8d34d27eca 100644
> > --- a/arch/csky/mm/fault.c
> > +++ b/arch/csky/mm/fault.c
> > @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
> >                 return;
> >         }
> >  #endif
> > -
> > -       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> >         /*
> >          * If we're in an interrupt or have no user
> >          * context, we must not take the fault..
> > @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
> >                         goto bad_area;
> >                 BUG();
> >         }
> > -       if (fault & VM_FAULT_MAJOR) {
> > -               tsk->maj_flt++;
> > -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
> > -                             address);
> > -       } else {
> > -               tsk->min_flt++;
> > -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
> > -                             address);
> > -       }
> > -
> >         up_read(&mm->mmap_sem);
> > +       mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR);
> >         return;
> >
> >         /*
> > --
> > 2.26.2
> >
> I notice that you move it out of mm->mmap_sem's area, all archs should
> follow the rule ? Can you give me a clarification and put it into de
> commit log ?

I don't think it's a must, but mmap_sem should not be required at least by
observing current code.  E.g., do_user_addr_fault() of x86 does the accounting
without mmap_sem even before this series.

The perf events should be thread safe on its own.  Frankly speaking I'm not
very certain about my understanding on the per task counters, because iiuc we
can also try to get user pages remotely of a thread in parallel with the page
fault happening on the same thread, then it seems to me that the per task pf
counters can be accessed on different cores simultaneously.  However otoh that
seems to be very rare, and it's still acceptable to me as a trade off to avoid
overhead of locks or atomic ops on the counters.  I'd be glad to be corrected
if I missed anything important here...

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-17  6:19       ` Christian Borntraeger
@ 2020-06-17 16:06         ` Peter Xu
  2020-06-17 16:14           ` Christian Borntraeger
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-17 16:06 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alexander Gordeev, linux-kernel, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	linux-s390

Hi, Christian,

On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> 
> 
> On 16.06.20 18:35, Peter Xu wrote:
> > Hi, Alexander,
> > 
> > On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>  	if (unlikely(fault & VM_FAULT_ERROR))
> >>>  		goto out_up;
> >>>
> >>> -	/*
> >>> -	 * Major/minor page fault accounting is only done on the
> >>> -	 * initial attempt. If we go through a retry, it is extremely
> >>> -	 * likely that the page will be found in page cache at that point.
> >>> -	 */
> >>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>> -		if (fault & VM_FAULT_MAJOR) {
> >>> -			tsk->maj_flt++;
> >>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>> -				      regs, address);
> >>> -		} else {
> >>> -			tsk->min_flt++;
> >>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>> -				      regs, address);
> >>> -		}
> >>>  		if (fault & VM_FAULT_RETRY) {
> >>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {

[1]

> >>
> >> Seems like the call to mm_fault_accounting() will be missed if
> >> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >> jumps to "out_up"...
> > 
> > This is true as a functional change.  However that also means that we've got a
> > VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> > than handled correctly (for instance, due to some try_lock failed during the
> > fault process).
> > 
> > To me, that case should not be counted as a page fault at all?  Or we might get
> > the same duplicated accounting when the page fault retried from a higher stack.
> > 
> > Thanks
> 
> This case below (the one with the gmap) is the KVM case for doing a so called
> pseudo page fault to our guests. (we notify our guests about major host page
> faults and let it reschedule to something else instead of halting the vcpu).
> This is being resolved with either gup or fixup_user_fault asynchronously by
> KVM code (this can also be sync when the guest does not match some conditions)
> We do not change the counters in that code as far as I can tell so we should
> continue to do it here.
> 
> (see arch/s390/kvm/kvm-s390.c
> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> {
> [...]
>         } else if (current->thread.gmap_pfault) {
>                 trace_kvm_s390_major_guest_pfault(vcpu);
>                 current->thread.gmap_pfault = 0;
>                 if (kvm_arch_setup_async_pf(vcpu))
>                         return 0;
>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
>         }

Please correct me if I'm wrong... but I still think what this patch does is the
right thing to do.

Note again that IMHO when reached [1] above it means the page fault is not
handled correctly so we need to fallback to KVM async page fault, then we
shouldn't increment the accountings until it's finally handled correctly. That
final accounting should be done in the async pf path in gup code where the page
fault is handled:

  kvm_arch_fault_in_page
    gmap_fault
      fixup_user_fault

Where in fixup_user_fault() we have:

	if (tsk) {
		if (major)
			tsk->maj_flt++;
		else
			tsk->min_flt++;
	}

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 00/25] mm: Page fault accounting cleanups
  2020-06-17  8:04     ` Will Deacon
@ 2020-06-17 16:10       ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-17 16:10 UTC (permalink / raw)
  To: Will Deacon
  Cc: Michael Ellerman, Linus Torvalds, Linux Kernel Mailing List,
	Gerald Schaefer, Andrew Morton, Andrea Arcangeli, openrisc,
	linux-arch, Alexander Gordeev, linux-s390, Catalin Marinas,
	Linux ARM

On Wed, Jun 17, 2020 at 09:04:06AM +0100, Will Deacon wrote:
> On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> > Linus Torvalds <torvalds@linux-foundation.org> writes:
> > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> > >> This series tries to address all of them by introducing mm_fault_accounting()
> > >> first, so that we move all the page fault accounting into the common code base,
> > >> then call it properly from arch pf handlers just like handle_mm_fault().
> > >
> > > Hmm.
> > >
> > > So having looked at this a bit more, I'd actually like to go even
> > > further, and just get rid of the per-architecture code _entirely_.
> > 
> > <snip>
> > 
> > > One detail worth noting: I do wonder if we should put the
> > >
> > >     perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> > >
> > > just in the arch code at the top of the fault handling, and consider
> > > it entirely unrelated to the major/minor fault handling. The
> > > major/minor faults fundamnetally are about successes. But the plain
> > > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > > things that never even get to this point at all.
> > 
> > Yeah I think we should keep it in the arch code at roughly the top.
> 
> I agree. It's a nice idea to consolidate the code, but I don't see that
> it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
> changing the semantics (and a potentially less useful way. Of course,
> moving more of do_page_fault() out of the arch code would be great, but
> that's a much bigger effort.
> 
> > If it's moved to the end you could have a process spinning taking bad
> > page faults (and fixing them up), and see no sign of it from the perf
> > page fault counters.
> 
> The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
> if _all_ of the following are true:
> 
>   1. The fault isn't handled by kprobes
>   2. The pagefault handler is enabled
>   3. We have an mm (current->mm)
>   4. The fault isn't an unexpected kernel fault on a user address (we oops
>      and kill the task in this case)
> 
> Which loosely corresponds to "we took a fault on a user address that it
> looks like we can handle".
> 
> That said, I'm happy to tweak this if it brings us into line with other
> architectures.

I see.  I'll keep the semantics for PERF_COUNT_SW_PAGE_FAULTS in the next
version.  Thanks for all the comments!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-17 16:06         ` Peter Xu
@ 2020-06-17 16:14           ` Christian Borntraeger
  2020-06-17 16:44             ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Christian Borntraeger @ 2020-06-17 16:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Alexander Gordeev, linux-kernel, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	linux-s390



On 17.06.20 18:06, Peter Xu wrote:
> Hi, Christian,
> 
> On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 16.06.20 18:35, Peter Xu wrote:
>>> Hi, Alexander,
>>>
>>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>>  	if (unlikely(fault & VM_FAULT_ERROR))
>>>>>  		goto out_up;
>>>>>
>>>>> -	/*
>>>>> -	 * Major/minor page fault accounting is only done on the
>>>>> -	 * initial attempt. If we go through a retry, it is extremely
>>>>> -	 * likely that the page will be found in page cache at that point.
>>>>> -	 */
>>>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>>>> -		if (fault & VM_FAULT_MAJOR) {
>>>>> -			tsk->maj_flt++;
>>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>>>> -				      regs, address);
>>>>> -		} else {
>>>>> -			tsk->min_flt++;
>>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>>>> -				      regs, address);
>>>>> -		}
>>>>>  		if (fault & VM_FAULT_RETRY) {
>>>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> 
> [1]
> 
>>>>
>>>> Seems like the call to mm_fault_accounting() will be missed if
>>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>>>> jumps to "out_up"...
>>>
>>> This is true as a functional change.  However that also means that we've got a
>>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
>>> than handled correctly (for instance, due to some try_lock failed during the
>>> fault process).
>>>
>>> To me, that case should not be counted as a page fault at all?  Or we might get
>>> the same duplicated accounting when the page fault retried from a higher stack.
>>>
>>> Thanks
>>
>> This case below (the one with the gmap) is the KVM case for doing a so called
>> pseudo page fault to our guests. (we notify our guests about major host page
>> faults and let it reschedule to something else instead of halting the vcpu).
>> This is being resolved with either gup or fixup_user_fault asynchronously by
>> KVM code (this can also be sync when the guest does not match some conditions)
>> We do not change the counters in that code as far as I can tell so we should
>> continue to do it here.
>>
>> (see arch/s390/kvm/kvm-s390.c
>> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
>> {
>> [...]
>>         } else if (current->thread.gmap_pfault) {
>>                 trace_kvm_s390_major_guest_pfault(vcpu);
>>                 current->thread.gmap_pfault = 0;
>>                 if (kvm_arch_setup_async_pf(vcpu))
>>                         return 0;
>>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
>>         }
> 
> Please correct me if I'm wrong... but I still think what this patch does is the
> right thing to do.
> 
> Note again that IMHO when reached [1] above it means the page fault is not
> handled correctly so we need to fallback to KVM async page fault, then we
> shouldn't increment the accountings until it's finally handled correctly. That
> final accounting should be done in the async pf path in gup code where the page
> fault is handled:
> 
>   kvm_arch_fault_in_page
>     gmap_fault
>       fixup_user_fault
> 
> Where in fixup_user_fault() we have:
> 
> 	if (tsk) {
> 		if (major)
> 			tsk->maj_flt++;
> 		else
> 			tsk->min_flt++;
> 	}
> 

Right that case does work. Its the case where we do not inject a pseudo pagefault and
instead fall back to synchronous fault-in.
What is about the other case:

kvm_setup_async_pf
	->workqueue
		async_pf_execute
			get_user_pages_remote

Does get_user_pages_remote do the accounting as well? I cant see that.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 02/25] mm: Introduce mm_fault_accounting()
  2020-06-16 19:00       ` Andrew Morton
@ 2020-06-17 16:26         ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-17 16:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Linux Kernel Mailing List, Gerald Schaefer,
	Andrea Arcangeli

On Tue, Jun 16, 2020 at 12:00:37PM -0700, Andrew Morton wrote:
> On Mon, 15 Jun 2020 19:19:17 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jun 15, 2020 at 03:32:40PM -0700, Linus Torvalds wrote:
> > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Provide this helper for doing memory page fault accounting across archs.  It
> > > > can be defined unconditionally because perf_sw_event() is always defined, and
> > > > perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.
> > > 
> > > Well, the downside is that now it forces a separate I$ miss and all
> > > those extra arguments because it's a out-of-line function and the
> > > compiler won't see that they all go away.
> > > 
> > > Yeah, maybe some day maybe we'll have LTO and these kinds of things
> > > will not matter. And maybe they already don't. But it seems kind of
> > > sad to basically force non-optimal code generation from this series.
> > 
> > I tried to make it static inline firstly in linux/mm.h, however it'll need to
> > have linux/mm.h include linux/perf_event.h which seems to have created a loop
> > dependency of headers.  I verified current code will at least generate inlined
> > functions too for x86 (no mm_fault_accounting() in "objdump -t vmlinux") with
> > gcc10.
> > 
> > Another alternative is to make it a macro, it's just that I feel the function
> > definition is a bit cleaner.  Any further suggestions welcomed too.
> 
> Could create a new header file mm_fault.h which includes mm.h and
> perf_event.h.  A later cleanup could move other fault-related things
> into that header and add the appropriate inclusions into files which
> use these things.
> 
> btw, I think mm_account_fault() might be a better name for this function.
> 
> And some (kerneldoc) documentation would be nice.  Although this
> function is pretty self-evident.
> 
> > > 
> > > Why would you export the symbol, btw? Page fault handling is never a module.
> > 
> > I followed handle_mm_fault() which is exported too, since potentially
> > mm_fault_accounting() should always be called in the same context of
> > handle_mm_fault().  Or do you prefer me to drop it?
> 
> Let's not add an unneeded export.  If someone for some reason needs it
> later, it can be added then.

I plan to take the approach that Linus suggested, probably with
mm_account_fault() declared as static inline in memory.c.  I'll remember to add
some kerneldoc too.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 19/25] mm/s390: Use mm_fault_accounting()
  2020-06-17 16:14           ` Christian Borntraeger
@ 2020-06-17 16:44             ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-17 16:44 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alexander Gordeev, linux-kernel, Gerald Schaefer, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli, Heiko Carstens, Vasily Gorbik,
	linux-s390

On Wed, Jun 17, 2020 at 06:14:52PM +0200, Christian Borntraeger wrote:
> 
> 
> On 17.06.20 18:06, Peter Xu wrote:
> > Hi, Christian,
> > 
> > On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 16.06.20 18:35, Peter Xu wrote:
> >>> Hi, Alexander,
> >>>
> >>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>>>  	if (unlikely(fault & VM_FAULT_ERROR))
> >>>>>  		goto out_up;
> >>>>>
> >>>>> -	/*
> >>>>> -	 * Major/minor page fault accounting is only done on the
> >>>>> -	 * initial attempt. If we go through a retry, it is extremely
> >>>>> -	 * likely that the page will be found in page cache at that point.
> >>>>> -	 */
> >>>>>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>>>> -		if (fault & VM_FAULT_MAJOR) {
> >>>>> -			tsk->maj_flt++;
> >>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>>>> -				      regs, address);
> >>>>> -		} else {
> >>>>> -			tsk->min_flt++;
> >>>>> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>>>> -				      regs, address);
> >>>>> -		}
> >>>>>  		if (fault & VM_FAULT_RETRY) {
> >>>>>  			if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>>>  			    (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> > 
> > [1]
> > 
> >>>>
> >>>> Seems like the call to mm_fault_accounting() will be missed if
> >>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >>>> jumps to "out_up"...
> >>>
> >>> This is true as a functional change.  However that also means that we've got a
> >>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> >>> than handled correctly (for instance, due to some try_lock failed during the
> >>> fault process).
> >>>
> >>> To me, that case should not be counted as a page fault at all?  Or we might get
> >>> the same duplicated accounting when the page fault retried from a higher stack.
> >>>
> >>> Thanks
> >>
> >> This case below (the one with the gmap) is the KVM case for doing a so called
> >> pseudo page fault to our guests. (we notify our guests about major host page
> >> faults and let it reschedule to something else instead of halting the vcpu).
> >> This is being resolved with either gup or fixup_user_fault asynchronously by
> >> KVM code (this can also be sync when the guest does not match some conditions)
> >> We do not change the counters in that code as far as I can tell so we should
> >> continue to do it here.
> >>
> >> (see arch/s390/kvm/kvm-s390.c
> >> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> >> {
> >> [...]
> >>         } else if (current->thread.gmap_pfault) {
> >>                 trace_kvm_s390_major_guest_pfault(vcpu);
> >>                 current->thread.gmap_pfault = 0;
> >>                 if (kvm_arch_setup_async_pf(vcpu))
> >>                         return 0;
> >>                 return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
> >>         }
> > 
> > Please correct me if I'm wrong... but I still think what this patch does is the
> > right thing to do.
> > 
> > Note again that IMHO when reached [1] above it means the page fault is not
> > handled correctly so we need to fallback to KVM async page fault, then we
> > shouldn't increment the accountings until it's finally handled correctly. That
> > final accounting should be done in the async pf path in gup code where the page
> > fault is handled:
> > 
> >   kvm_arch_fault_in_page
> >     gmap_fault
> >       fixup_user_fault
> > 
> > Where in fixup_user_fault() we have:
> > 
> > 	if (tsk) {
> > 		if (major)
> > 			tsk->maj_flt++;
> > 		else
> > 			tsk->min_flt++;
> > 	}
> > 
> 
> Right that case does work. Its the case where we do not inject a pseudo pagefault and
> instead fall back to synchronous fault-in.
> What is about the other case:
> 
> kvm_setup_async_pf
> 	->workqueue
> 		async_pf_execute
> 			get_user_pages_remote
> 
> Does get_user_pages_remote do the accounting as well? I cant see that.
> 

It should, with:

  get_user_pages_remote
    __get_user_pages_remote
      __get_user_pages_locked
        __get_user_pages
          faultin_page

Where the accounting is done in faultin_page().

We probably need to also move the accounting in faultin_page() to be after the
retry check too, but that should be irrelevant to this patch.

Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-17 15:49     ` Peter Xu
@ 2020-06-17 17:53       ` Linus Torvalds
  2020-06-17 19:58         ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2020-06-17 17:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Wed, Jun 17, 2020 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> I don't think it's a must, but mmap_sem should not be required at least by
> observing current code.  E.g., do_user_addr_fault() of x86 does the accounting
> without mmap_sem even before this series.

All the accounting should be per-thread and not need any locking.

Which is why a remote GUP should never account to the remote mm - not
only isn't there an unambiguous thread to account to (an mm can share
many threads), but it would require locking not just for the remote
update, but for all normal page faults.

                 Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-17 17:53       ` Linus Torvalds
@ 2020-06-17 19:58         ` Peter Xu
  2020-06-17 20:15           ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-17 19:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Wed, Jun 17, 2020 at 10:53:23AM -0700, Linus Torvalds wrote:
> On Wed, Jun 17, 2020 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > I don't think it's a must, but mmap_sem should not be required at least by
> > observing current code.  E.g., do_user_addr_fault() of x86 does the accounting
> > without mmap_sem even before this series.
> 
> All the accounting should be per-thread and not need any locking.
> 
> Which is why a remote GUP should never account to the remote mm - not
> only isn't there an unambiguous thread to account to (an mm can share
> many threads), but it would require locking not just for the remote
> update, but for all normal page faults.

But currently remote GUP will still do the page fault accounting on the remote
task_struct, am I right?  E.g., when the get_user_pages_remote() is called with
"tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
accounting for that remote task/thread?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-17 19:58         ` Peter Xu
@ 2020-06-17 20:15           ` Linus Torvalds
  2020-06-18 14:38             ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2020-06-17 20:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Wed, Jun 17, 2020 at 12:58 PM Peter Xu <peterx@redhat.com> wrote:
>
> But currently remote GUP will still do the page fault accounting on the remote
> task_struct, am I right?  E.g., when the get_user_pages_remote() is called with
> "tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
> accounting for that remote task/thread?

Well, that would be a data race and fundamentally buggy.

It would be ok with something like ptrace (which only works when the
target is quiescent), but is completely wrong otherwise.

I guess it works fine in practice, and it's only statistics so even if
you were to have a data race it doesn't much matter, but it's
definitely conceptually very very wrong.

The fault stats should be about who does the fault (they are about the
_thread_) not about who the fault is done to (which is about the
_mm_).

Allocating the fault data to somebody else sounds frankly silly and
stupid to me, exactly because it's (a) racy and (b) not even
conceptually correct. The other thread literally _isn't_ doing a major
page fault, for crissake!

Now, there are some actual per-mm statistics too (the rss stuff etc),
and it's fundamentally harder exactly because of the shared data. See
the mm_counter stuff etc. Those are not about who does soemthing, they
are about the resulting MM state.

            Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-17 20:15           ` Linus Torvalds
@ 2020-06-18 14:38             ` Peter Xu
  2020-06-18 17:15               ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-18 14:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Wed, Jun 17, 2020 at 01:15:31PM -0700, Linus Torvalds wrote:
> On Wed, Jun 17, 2020 at 12:58 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > But currently remote GUP will still do the page fault accounting on the remote
> > task_struct, am I right?  E.g., when the get_user_pages_remote() is called with
> > "tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
> > accounting for that remote task/thread?
> 
> Well, that would be a data race and fundamentally buggy.
> 
> It would be ok with something like ptrace (which only works when the
> target is quiescent), but is completely wrong otherwise.
> 
> I guess it works fine in practice, and it's only statistics so even if
> you were to have a data race it doesn't much matter, but it's
> definitely conceptually very very wrong.
> 
> The fault stats should be about who does the fault (they are about the
> _thread_) not about who the fault is done to (which is about the
> _mm_).
> 
> Allocating the fault data to somebody else sounds frankly silly and
> stupid to me, exactly because it's (a) racy and (b) not even
> conceptually correct. The other thread literally _isn't_ doing a major
> page fault, for crissake!
> 
> Now, there are some actual per-mm statistics too (the rss stuff etc),
> and it's fundamentally harder exactly because of the shared data. See
> the mm_counter stuff etc. Those are not about who does soemthing, they
> are about the resulting MM state.

I see.  How about I move another small step further to cover GUP (to be
explicit, faultin_page() and fixup_user_fault())?

GUP needs the per-task accounting, but not the perf events.  We can do that by
slightly changing the new approach into:

        bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);

        if (major)
                current->maj_flt++;
        else
                current->min_flt++;

        if (!regs)
                return ret;

        if (major)
                perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
        else
                perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

With the change we will always account that onto current task.  Since there're
GUP calls that are not even accounted at all, e.g., get_user_pages_remote()
with tsk==NULL: for these calls, we also account these page faults onto current
task.

Another major benefit I can see (besides a completely cleaned up accounting) is
that we can remove the task_struct pointer in the whole GUP code, because
AFAICT that's only used for this pf accounting either in faultin_page() or
fixup_user_fault(), which seems questionable itself now to not use current...

Any opinions?

Thanks,

--
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-18 14:38             ` Peter Xu
@ 2020-06-18 17:15               ` Linus Torvalds
  2020-06-18 21:24                 ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Linus Torvalds @ 2020-06-18 17:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Thu, Jun 18, 2020 at 7:38 AM Peter Xu <peterx@redhat.com> wrote:
>
> GUP needs the per-task accounting, but not the perf events.  We can do that by
> slightly changing the new approach into:
>
>         bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
>
>         if (major)
>                 current->maj_flt++;
>         else
>                 current->min_flt++;
>
>         if (!regs)
>                 return ret;
>
>         if (major)
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
>         else
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

Ack, I think this is the right thing to do.

No normal situation will ever notice the difference, with remote
accesses being as rare and specialized as they are. But being able to
remote the otherwise unused 'tsk' parameter sounds like the right
thing to do too.

It might be worth adding a comment about why.

Also, honestly, how about we remove the 'major' variable entirely, and
instead make the code be something like

        unsigned long *flt;
        int event_type;
        ...

        /* Major fault */
        if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
                flt = &current->maj_flt;
                event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
        } else {
                flt = &current->min_flt;
                event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
        }
        *flt++;
        if (regs)
                perf_sw_event(event_type, 1, regs, address);

instead. Less source code duplication, and I bet it improves code
generation too.

             Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-18 17:15               ` Linus Torvalds
@ 2020-06-18 21:24                 ` Peter Xu
  2020-06-18 22:28                   ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-18 21:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Thu, Jun 18, 2020 at 10:15:50AM -0700, Linus Torvalds wrote:
> On Thu, Jun 18, 2020 at 7:38 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > GUP needs the per-task accounting, but not the perf events.  We can do that by
> > slightly changing the new approach into:
> >
> >         bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
> >
> >         if (major)
> >                 current->maj_flt++;
> >         else
> >                 current->min_flt++;
> >
> >         if (!regs)
> >                 return ret;
> >
> >         if (major)
> >                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
> >         else
> >                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
> 
> Ack, I think this is the right thing to do.
> 
> No normal situation will ever notice the difference, with remote
> accesses being as rare and specialized as they are. But being able to
> remote the otherwise unused 'tsk' parameter sounds like the right
> thing to do too.
> 
> It might be worth adding a comment about why.
> 
> Also, honestly, how about we remove the 'major' variable entirely, and
> instead make the code be something like
> 
>         unsigned long *flt;
>         int event_type;
>         ...
> 
>         /* Major fault */
>         if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
>                 flt = &current->maj_flt;
>                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
>         } else {
>                 flt = &current->min_flt;
>                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
>         }
>         *flt++;
>         if (regs)
>                 perf_sw_event(event_type, 1, regs, address);
> 
> instead. Less source code duplication, and I bet it improves code
> generation too.

Will do.  Thanks!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-18 21:24                 ` Peter Xu
@ 2020-06-18 22:28                   ` Peter Xu
  2020-06-18 22:59                     ` Linus Torvalds
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Xu @ 2020-06-18 22:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Thu, Jun 18, 2020 at 05:24:30PM -0400, Peter Xu wrote:
> >         /* Major fault */
> >         if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
> >                 flt = &current->maj_flt;
> >                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
> >         } else {
> >                 flt = &current->min_flt;
> >                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
> >         }
> >         *flt++;
> >         if (regs)
> >                 perf_sw_event(event_type, 1, regs, address);

Sadly, this line seems to fail the compilation:

  CC      mm/memory.o
In file included from ././include/linux/compiler_types.h:68,
                 from <command-line>:                      
./arch/x86/include/asm/jump_label.h: In function ‘handle_mm_fault’:
./include/linux/compiler-gcc.h:120:38: warning: ‘asm’ operand 0 probably does not match constraints
  120 | #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
      |                                      ^~~
./arch/x86/include/asm/jump_label.h:25:2: note: in expansion of macro ‘asm_volatile_goto’
   25 |  asm_volatile_goto("1:" 
      |  ^~~~~~~~~~~~~~~~~
./include/linux/compiler-gcc.h:120:38: error: impossible constraint in ‘asm’
  120 | #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
      |                                      ^~~            
./arch/x86/include/asm/jump_label.h:25:2: note: in expansion of macro ‘asm_volatile_goto’
   25 |  asm_volatile_goto("1:"                                                                                                
      |  ^~~~~~~~~~~~~~~~~
make[1]: *** [scripts/Makefile.build:267: mm/memory.o] Error 1
make: *** [Makefile:1729: mm] Error 2

Frankly speaking I have no solid understanding on what's the error about... But
my gut feeling is that it's about the static keys, where perf_sw_event() may
only support static event types (otherwise we won't be able to know whether to
patch the instruction with no-op or a jump?).

I'll go back to the simple version for now, until I know a solution..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 07/25] mm/csky: Use mm_fault_accounting()
  2020-06-18 22:28                   ` Peter Xu
@ 2020-06-18 22:59                     ` Linus Torvalds
  0 siblings, 0 replies; 60+ messages in thread
From: Linus Torvalds @ 2020-06-18 22:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Guo Ren, Linux Kernel Mailing List, Gerald Schaefer,
	Andrew Morton, Andrea Arcangeli, linux-csky

On Thu, Jun 18, 2020 at 3:28 PM Peter Xu <peterx@redhat.com> wrote:
>
> > >         if (regs)
> > >                 perf_sw_event(event_type, 1, regs, address);
>
> Sadly, this line seems to fail the compilation:

Yeah, I should have known that. We require a constant event ID,
because it uses the magical static keys.

So never mind. The perf_sw_event() calls can't be shared for two different ID's.

           Linus

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 18/25] mm/riscv: Use mm_fault_accounting()
  2020-06-15 22:16 ` [PATCH 18/25] mm/riscv: " Peter Xu
@ 2020-06-18 23:49   ` Palmer Dabbelt
  2020-06-19  0:12     ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Palmer Dabbelt @ 2020-06-18 23:49 UTC (permalink / raw)
  To: peterx
  Cc: linux-kernel, gerald.schaefer, akpm, peterx, Linus Torvalds,
	aarcange, Paul Walmsley, aou, linux-riscv

On Mon, 15 Jun 2020 15:16:00 PDT (-0700), peterx@redhat.com wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Paul Walmsley <paul.walmsley@sifive.com>
> CC: Palmer Dabbelt <palmer@dabbelt.com>
> CC: Albert Ou <aou@eecs.berkeley.edu>
> CC: linux-riscv@lists.infradead.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/riscv/mm/fault.c | 21 +++------------------
>  1 file changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index be84e32adc4c..9262338614d1 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -30,7 +30,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>  	struct vm_area_struct *vma;
>  	struct mm_struct *mm;
>  	unsigned long addr, cause;
> -	unsigned int flags = FAULT_FLAG_DEFAULT;
> +	unsigned int flags = FAULT_FLAG_DEFAULT, major = 0;
>  	int code = SEGV_MAPERR;
>  	vm_fault_t fault;
>
> @@ -65,9 +65,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>
>  	if (user_mode(regs))
>  		flags |= FAULT_FLAG_USER;
> -
> -	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> -
>  retry:
>  	down_read(&mm->mmap_sem);
>  	vma = find_vma(mm, addr);
> @@ -111,6 +108,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>  	 * the fault.
>  	 */
>  	fault = handle_mm_fault(vma, addr, flags);
> +	major |= fault & VM_FAULT_MAJOR;
>
>  	/*
>  	 * If we need to retry but a fatal signal is pending, handle the
> @@ -128,21 +126,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>  		BUG();
>  	}
>
> -	/*
> -	 * Major/minor page fault accounting is only done on the
> -	 * initial attempt. If we go through a retry, it is extremely
> -	 * likely that the page will be found in page cache at that point.
> -	 */
>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> -		if (fault & VM_FAULT_MAJOR) {
> -			tsk->maj_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
> -				      1, regs, addr);
> -		} else {
> -			tsk->min_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
> -				      1, regs, addr);
> -		}
>  		if (fault & VM_FAULT_RETRY) {
>  			flags |= FAULT_FLAG_TRIED;
>
> @@ -156,6 +140,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>  	}
>
>  	up_read(&mm->mmap_sem);
> +	mm_fault_accounting(tsk, regs, addr, major);
>  	return;
>
>  	/*

AFAICT this changes the behavior of the perf event: it used to count any fault,
whereas now it only counts those that succeed successfully.  If everyone else
is doing it that way then I'm happy to change us over, but this definately
isn't just avoiding retries.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 18/25] mm/riscv: Use mm_fault_accounting()
  2020-06-18 23:49   ` Palmer Dabbelt
@ 2020-06-19  0:12     ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-06-19  0:12 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: linux-kernel, gerald.schaefer, akpm, Linus Torvalds, aarcange,
	Paul Walmsley, aou, linux-riscv

On Thu, Jun 18, 2020 at 04:49:23PM -0700, Palmer Dabbelt wrote:
> AFAICT this changes the behavior of the perf event: it used to count any fault,
> whereas now it only counts those that succeed successfully.  If everyone else
> is doing it that way then I'm happy to change us over, but this definately
> isn't just avoiding retries.

Right, I'm preparing v2 to keep the old behavior of PERF_COUNT_SW_PAGE_FAULTS,
while with a nicer approach.

Thanks for looking!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 20/25] mm/sh: Use mm_fault_accounting()
  2020-06-15 22:23 ` [PATCH 20/25] mm/sh: " Peter Xu
@ 2020-07-20 21:25   ` Rich Felker
  2020-07-20 22:05     ` Peter Xu
  0 siblings, 1 reply; 60+ messages in thread
From: Rich Felker @ 2020-07-20 21:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Yoshinori Sato, linux-sh

On Mon, Jun 15, 2020 at 06:23:06PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
> 
> Avoid doing page fault accounting multiple times if the page fault is retried.
> 
> CC: Yoshinori Sato <ysato@users.sourceforge.jp>
> CC: Rich Felker <dalias@libc.org>
> CC: linux-sh@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/sh/mm/fault.c | 15 +++------------
>  1 file changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
> index 5f23d7907597..06b232973488 100644
> --- a/arch/sh/mm/fault.c
> +++ b/arch/sh/mm/fault.c
> @@ -379,7 +379,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
>  	struct task_struct *tsk;
>  	struct mm_struct *mm;
>  	struct vm_area_struct * vma;
> -	vm_fault_t fault;
> +	vm_fault_t fault, major = 0;
>  	unsigned int flags = FAULT_FLAG_DEFAULT;
>  
>  	tsk = current;
> @@ -412,8 +412,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
>  	if ((regs->sr & SR_IMASK) != SR_IMASK)
>  		local_irq_enable();
>  
> -	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> -
>  	/*
>  	 * If we're in an interrupt, have no user context or are running
>  	 * with pagefaults disabled then we must not take the fault:
> @@ -465,21 +463,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
>  	 * the fault.
>  	 */
>  	fault = handle_mm_fault(vma, address, flags);
> +	major |= fault & VM_FAULT_MAJOR;
>  
>  	if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
>  		if (mm_fault_error(regs, error_code, address, fault))
>  			return;
>  
>  	if (flags & FAULT_FLAG_ALLOW_RETRY) {
> -		if (fault & VM_FAULT_MAJOR) {
> -			tsk->maj_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> -				      regs, address);
> -		} else {
> -			tsk->min_flt++;
> -			perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> -				      regs, address);
> -		}
>  		if (fault & VM_FAULT_RETRY) {
>  			flags |= FAULT_FLAG_TRIED;
>  
> @@ -493,4 +483,5 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
>  	}
>  
>  	up_read(&mm->mmap_sem);
> +	mm_fault_accounting(tsk, regs, address, major);
>  }
> -- 
> 2.26.2

What's the status on the rest of this series? Do you need any action
from my side (arch/sh) at this time?

Rich

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 20/25] mm/sh: Use mm_fault_accounting()
  2020-07-20 21:25   ` Rich Felker
@ 2020-07-20 22:05     ` Peter Xu
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Xu @ 2020-07-20 22:05 UTC (permalink / raw)
  To: Rich Felker
  Cc: linux-kernel, Gerald Schaefer, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli, Yoshinori Sato, linux-sh

Hi, Rich,

On Mon, Jul 20, 2020 at 05:25:24PM -0400, Rich Felker wrote:
> What's the status on the rest of this series? Do you need any action
> from my side (arch/sh) at this time?

This series should have been queued by Andrew in -next.  So iiuc no further
action is needed.

Thanks for asking,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2020-07-20 22:06 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-15 22:15 [PATCH 00/25] mm: Page fault accounting cleanups Peter Xu
2020-06-15 22:15 ` [PATCH 01/25] mm/um: Fix extra accounting for page fault retries Peter Xu
2020-06-15 22:15 ` [PATCH 02/25] mm: Introduce mm_fault_accounting() Peter Xu
2020-06-15 22:32   ` Linus Torvalds
2020-06-15 23:19     ` Peter Xu
2020-06-16 19:00       ` Andrew Morton
2020-06-17 16:26         ` Peter Xu
2020-06-15 22:15 ` [PATCH 03/25] mm/alpha: Use mm_fault_accounting() Peter Xu
2020-06-15 22:15 ` [PATCH 04/25] mm/arc: " Peter Xu
2020-06-15 22:15 ` [PATCH 05/25] mm/arm: " Peter Xu
2020-06-15 22:15 ` [PATCH 06/25] mm/arm64: " Peter Xu
2020-06-16  7:43   ` Will Deacon
2020-06-16 15:59     ` Peter Xu
2020-06-15 22:15 ` [PATCH 07/25] mm/csky: " Peter Xu
2020-06-17  7:04   ` Guo Ren
2020-06-17 15:49     ` Peter Xu
2020-06-17 17:53       ` Linus Torvalds
2020-06-17 19:58         ` Peter Xu
2020-06-17 20:15           ` Linus Torvalds
2020-06-18 14:38             ` Peter Xu
2020-06-18 17:15               ` Linus Torvalds
2020-06-18 21:24                 ` Peter Xu
2020-06-18 22:28                   ` Peter Xu
2020-06-18 22:59                     ` Linus Torvalds
2020-06-15 22:15 ` [PATCH 08/25] mm/hexagon: " Peter Xu
2020-06-15 22:15 ` [PATCH 09/25] mm/ia64: " Peter Xu
2020-06-15 22:15 ` [PATCH 10/25] mm/m68k: " Peter Xu
2020-06-15 22:15 ` [PATCH 11/25] mm/microblaze: " Peter Xu
2020-06-15 22:15 ` [PATCH 12/25] mm/mips: " Peter Xu
2020-06-15 22:15 ` [PATCH 13/25] mm/nds32: " Peter Xu
2020-06-17  1:05   ` Greentime Hu
2020-06-15 22:15 ` [PATCH 14/25] mm/nios2: " Peter Xu
2020-06-15 22:15 ` [PATCH 15/25] mm/openrisc: " Peter Xu
2020-06-16 18:11   ` Stafford Horne
2020-06-15 22:15 ` [PATCH 16/25] mm/parisc: " Peter Xu
2020-06-15 22:15 ` [PATCH 17/25] mm/powerpc: " Peter Xu
2020-06-15 22:16 ` [PATCH 18/25] mm/riscv: " Peter Xu
2020-06-18 23:49   ` Palmer Dabbelt
2020-06-19  0:12     ` Peter Xu
2020-06-15 22:23 ` [PATCH 19/25] mm/s390: " Peter Xu
2020-06-16 15:59   ` Alexander Gordeev
2020-06-16 16:35     ` Peter Xu
2020-06-17  6:19       ` Christian Borntraeger
2020-06-17 16:06         ` Peter Xu
2020-06-17 16:14           ` Christian Borntraeger
2020-06-17 16:44             ` Peter Xu
2020-06-15 22:23 ` [PATCH 20/25] mm/sh: " Peter Xu
2020-07-20 21:25   ` Rich Felker
2020-07-20 22:05     ` Peter Xu
2020-06-15 22:23 ` [PATCH 21/25] mm/sparc32: " Peter Xu
2020-06-15 22:23 ` [PATCH 22/25] mm/sparc64: " Peter Xu
2020-06-15 22:23 ` [PATCH 23/25] mm/unicore32: " Peter Xu
2020-06-15 22:23 ` [PATCH 24/25] mm/x86: " Peter Xu
2020-06-15 22:23 ` [PATCH 25/25] mm/xtensa: " Peter Xu
2020-06-15 23:13   ` Max Filippov
2020-06-16 18:55 ` [PATCH 00/25] mm: Page fault accounting cleanups Linus Torvalds
2020-06-16 21:03   ` Peter Xu
2020-06-17  0:55   ` Michael Ellerman
2020-06-17  8:04     ` Will Deacon
2020-06-17 16:10       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).