Re: [PATCH 00/25] mm: Page fault accounting cleanups

From: Linus Torvalds <torvalds@linux-foundation.org>
To: Peter Xu <peterx@redhat.com>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Gerald Schaefer <gerald.schaefer@de.ibm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	openrisc@lists.librecores.org,
	linux-arch <linux-arch@vger.kernel.org>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	linux-s390 <linux-s390@vger.kernel.org>,
	Will Deacon <will@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Linux ARM <linux-arm-kernel@lists.infradead.org>
Subject: Re: [PATCH 00/25] mm: Page fault accounting cleanups
Date: Tue, 16 Jun 2020 11:55:17 -0700	[thread overview]
Message-ID: <CAHk-=wiTjaXHu+uxMi0xCZQOm4KVr0MucECAK=Zm4p4YZZ1XEg@mail.gmail.com> (raw)
In-Reply-To: <20200615221607.7764-1-peterx@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2788 bytes --]

On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>
> This series tries to address all of them by introducing mm_fault_accounting()
> first, so that we move all the page fault accounting into the common code base,
> then call it properly from arch pf handlers just like handle_mm_fault().

Hmm.

So having looked at this a bit more, I'd actually like to go even
further, and just get rid of the per-architecture code _entirely_.

Here's a straw-man patch to the generic code - the idea is mostly laid
out in the comment that I'm just quoting here directly too:

        /*
         * Do accounting in the common code, to avoid unnecessary
         * architecture differences or duplicated code.
         *
         * We arbitrarily make the rules be:
         *
         *  - faults that never even got here (because the address
         *    wasn't valid). That includes arch_vma_access_permitted()
         *    failing above.
         *
         *    So this is expressly not a "this many hardware page
         *    faults" counter. Use the hw profiling for that.
         *
         *  - incomplete faults (ie RETRY) do not count (see above).
         *    They will only count once completed.
         *
         *  - the fault counts as a "major" fault when the final
         *    successful fault is VM_FAULT_MAJOR, or if it was a
         *    retry (which implies that we couldn't handle it
         *    immediately previously).
         *
         *  - if the fault is done for GUP, regs wil be NULL and
         *    no accounting will be done (but you _could_ pass in
         *    your own regs and it would be accounted to the thread
         *    doing the fault, not to the target!)
         */

the code itself in the patch is

 (a) pretty trivial and self-evident

 (b) INCOMPLETE

that (b) is worth noting: this patch won't compile on its own. It
intentionally leaves all the users without the new 'regs' argument,
because you obviously simply need to remove all the code that
currently tries to do any accounting.

Comments?

This is a bigger change, but I think it might be worth it to _really_
consolidate the major/minor logic.

One detail worth noting: I do wonder if we should put the

    perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);

just in the arch code at the top of the fault handling, and consider
it entirely unrelated to the major/minor fault handling. The
major/minor faults fundamnetally are about successes. But the plain
PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
things that never even get to this point at all.

I'm not convinced it's useful to have three SW events that are defined
to be A=B+C.

But this does *not* do that part. It' sreally just an RFC patch.

                    Linus

[-- Attachment #2: patch --]
[-- Type: application/octet-stream, Size: 2918 bytes --]

 include/linux/mm.h |  3 ++-
 mm/memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..e82a604339c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1648,7 +1648,8 @@ int invalidate_inode_page(struct page *page);
 
 #ifdef CONFIG_MMU
 extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
-			unsigned long address, unsigned int flags);
+			unsigned long address, unsigned int flags,
+			struct pt_regs *);
 extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long address, unsigned int fault_flags,
 			    bool *unlocked);
diff --git a/mm/memory.c b/mm/memory.c
index dc7f3543b1fd..f15056151fb1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -72,6 +72,7 @@
 #include <linux/oom.h>
 #include <linux/numa.h>
 
+#include <linux/perf_event.h>
 #include <trace/events/kmem.h>
 
 #include <asm/io.h>
@@ -4353,7 +4354,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
  * return value.  See filemap_fault() and __lock_page_or_retry().
  */
 vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
-		unsigned int flags)
+		unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
 
@@ -4394,6 +4395,49 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 
+	if (ret & VM_FAULT_RETRY)
+		return ret;
+
+	/*
+	 * Do accounting in the common code, to avoid unnecessary
+	 * architecture differences or duplicated code.
+	 *
+	 * We arbitrarily make the rules be:
+	 *
+	 *  - faults that never even got here (because the address
+	 *    wasn't valid). That includes arch_vma_access_permitted()
+	 *    failing above.
+	 *
+	 *    So this is expressly not a "this many hardware page
+	 *    faults" counter. Use the hw profiling for that.
+	 *
+	 *  - incomplete faults (ie RETRY) do not count (see above).
+	 *    They will only count once completed.
+	 *
+	 *  - the fault counts as a "major" fault when the final
+	 *    successful fault is VM_FAULT_MAJOR, or if it was a
+	 *    retry (which implies that we couldn't handle it
+	 *    immediately previously).
+	 *
+	 *  - if the fault is done for GUP, regs wil be NULL and
+	 *    no accounting will be done (but you _could_ pass in
+	 *    your own regs and it would be accounted to the thread
+	 *    doing the fault, not to the target!)
+	 */
+
+	if (!regs)
+		return ret;
+
+	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
+	if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
+		current->maj_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
+	} else {
+		current->min_flt++;
+		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
+	}
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(handle_mm_fault);