Gerald Schaefer reported an issue a week ago regarding to incorrect page fault accountings for retried page fault after commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"): https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/ When I was looking at the issue, I actually found some other trivial issues too related to the page fault accounting, namely: - Incorrect accountings - The major issue that was reported by Gerald, that we did multiple accounting when the page fault retried while we should only do it once (for either perf event accounting or per-task accounting). - In many archs, major fault is only accounted when the 1st page fault is a major fault, or the last page fault is a major fault. Ideally we should account the page fault as a major fault as long as any of the page fault retries is a major fault and keep the same behavior across archs. - Missing accountings of perf events (PERF_COUNT_SW_PAGE_FAULTS[_[MAJ|MIN]]). - Doing accounting with mmap_sem: not a big deal, but logically we can move it after releasing the mmap_sem because accounting does not need it. This series tries to address all of them by introducing mm_fault_accounting() first, so that we move all the page fault accounting into the common code base, then call it properly from arch pf handlers just like handle_mm_fault(). There are some special cases, e.g., the um arch does not have "regs" in the fault handler, so it's still using the manual statistics. For each of the patch that fixes a specific arch, I'm CCing the maintainers and the arch list if there is. Besides, I only lightly tested this series on x86. Please have a look, thanks. Peter Xu (25): mm/um: Fix extra accounting for page fault retries mm: Introduce mm_fault_accounting() mm/alpha: Use mm_fault_accounting() mm/arc: Use mm_fault_accounting() mm/arm: Use mm_fault_accounting() mm/arm64: Use mm_fault_accounting() mm/csky: Use mm_fault_accounting() mm/hexagon: Use mm_fault_accounting() mm/ia64: Use mm_fault_accounting() mm/m68k: Use mm_fault_accounting() mm/microblaze: Use mm_fault_accounting() mm/mips: Use mm_fault_accounting() mm/nds32: Use mm_fault_accounting() mm/nios2: Use mm_fault_accounting() mm/openrisc: Use mm_fault_accounting() mm/parisc: Use mm_fault_accounting() mm/powerpc: Use mm_fault_accounting() mm/riscv: Use mm_fault_accounting() mm/s390: Use mm_fault_accounting() mm/sh: Use mm_fault_accounting() mm/sparc32: Use mm_fault_accounting() mm/sparc64: Use mm_fault_accounting() mm/unicore32: Use mm_fault_accounting() mm/x86: Use mm_fault_accounting() mm/xtensa: Use mm_fault_accounting() arch/alpha/mm/fault.c | 9 ++++----- arch/arc/mm/fault.c | 15 +++------------ arch/arm/mm/fault.c | 21 ++++----------------- arch/arm64/mm/fault.c | 17 ++--------------- arch/csky/mm/fault.c | 13 +------------ arch/hexagon/mm/vm_fault.c | 9 +++------ arch/ia64/mm/fault.c | 8 +++----- arch/m68k/mm/fault.c | 13 +++---------- arch/microblaze/mm/fault.c | 8 +++----- arch/mips/mm/fault.c | 14 +++----------- arch/nds32/mm/fault.c | 19 +++---------------- arch/nios2/mm/fault.c | 13 +++---------- arch/openrisc/mm/fault.c | 8 +++----- arch/parisc/mm/fault.c | 8 +++----- arch/powerpc/mm/fault.c | 13 ++++--------- arch/riscv/mm/fault.c | 21 +++------------------ arch/s390/mm/fault.c | 21 +++++---------------- arch/sh/mm/fault.c | 15 +++------------ arch/sparc/mm/fault_32.c | 16 +++------------- arch/sparc/mm/fault_64.c | 16 ++++------------ arch/um/kernel/trap.c | 13 +++++++------ arch/unicore32/mm/fault.c | 15 +++++++-------- arch/x86/mm/fault.c | 10 +--------- arch/xtensa/mm/fault.c | 14 +++----------- include/linux/mm.h | 2 ++ mm/memory.c | 18 ++++++++++++++++++ 26 files changed, 101 insertions(+), 248 deletions(-) -- 2.26.2
When page fault retried, we should only do the accounting once rather than once for each page fault retry. CC: Jeff Dike <jdike@addtoit.com> CC: Richard Weinberger <richard@nod.at> CC: Anton Ivanov <anton.ivanov@cambridgegreys.com> CC: linux-um@lists.infradead.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/um/kernel/trap.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 8f18cf56b3dd..d162168490d1 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -34,6 +34,7 @@ int handle_page_fault(unsigned long address, unsigned long ip, pte_t *pte; int err = -EFAULT; unsigned int flags = FAULT_FLAG_DEFAULT; + vm_fault_t fault, major = 0; *code_out = SEGV_MAPERR; @@ -73,9 +74,8 @@ int handle_page_fault(unsigned long address, unsigned long ip, } do { - vm_fault_t fault; - fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) goto out_nosemaphore; @@ -92,10 +92,6 @@ int handle_page_fault(unsigned long address, unsigned long ip, BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -103,6 +99,11 @@ int handle_page_fault(unsigned long address, unsigned long ip, } } + if (major) + current->maj_flt++; + else + current->min_flt++; + pgd = pgd_offset(mm, address); p4d = p4d_offset(pgd, address); pud = pud_offset(p4d, address); -- 2.26.2
Provide this helper for doing memory page fault accounting across archs. It can be defined unconditionally because perf_sw_event() is always defined, and perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS. Signed-off-by: Peter Xu <peterx@redhat.com> --- include/linux/mm.h | 2 ++ mm/memory.c | 18 ++++++++++++++++++ 2 files changed, 20 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index f3fe7371855c..b96db64125be 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1649,6 +1649,8 @@ void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int truncate_inode_page(struct address_space *mapping, struct page *page); int generic_error_remove_page(struct address_space *mapping, struct page *page); int invalidate_inode_page(struct page *page); +void mm_fault_accounting(struct task_struct *task, struct pt_regs *regs, + unsigned long address, bool major); #ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, diff --git a/mm/memory.c b/mm/memory.c index f703fe8c8346..c4f8319f0ed8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -71,6 +71,8 @@ #include <linux/dax.h> #include <linux/oom.h> #include <linux/numa.h> +#include <linux/perf_event.h> +#include <linux/ptrace.h> #include <trace/events/kmem.h> @@ -4397,6 +4399,22 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(handle_mm_fault); +void mm_fault_accounting(struct task_struct *task, struct pt_regs *regs, + unsigned long address, bool major) +{ + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + if (major) { + task->maj_flt++; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, + regs, address); + } else { + task->min_flt++; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, + regs, address); + } +} +EXPORT_SYMBOL_GPL(mm_fault_accounting); + #ifndef __PAGETABLE_P4D_FOLDED /* * Allocate p4d page table. -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. CC: Richard Henderson <rth@twiddle.net> CC: Ivan Kokshaysky <ink@jurassic.park.msu.ru> CC: Matt Turner <mattst88@gmail.com> CC: linux-alpha@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/alpha/mm/fault.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index c2d7b6d7bac7..4f8632ddef25 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -88,7 +88,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, struct mm_struct *mm = current->mm; const struct exception_table_entry *fixup; int si_code = SEGV_MAPERR; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; /* As of EV6, a load into $31/$f31 is a prefetch, and never faults @@ -149,6 +149,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, make sure we exit gracefully rather than endlessly redo the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -164,10 +165,6 @@ do_page_fault(unsigned long address, unsigned long mmcsr, } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -182,6 +179,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr, up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); + return; /* Something tried to access memory that isn't in our memory map. -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. The functional change here is that now we take the whole page fault as a major fault as long as any of the retried page fault is a major fault. Previously we only considered the last fault. CC: Vineet Gupta <vgupta@synopsys.com> CC: linux-snps-arc@lists.infradead.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/arc/mm/fault.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 92b339c7adba..bc89d4b9c59d 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -72,6 +72,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) int sig, si_code = SEGV_MAPERR; unsigned int write = 0, exec = 0, mask; vm_fault_t fault = VM_FAULT_SIGSEGV; /* handle_mm_fault() output */ + vm_fault_t major = 0; unsigned int flags; /* handle_mm_fault() input */ /* @@ -132,6 +133,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) } fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -156,20 +158,9 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) * Major/minor page fault accounting * (in case of retry we only land here once) */ - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (likely(!(fault & VM_FAULT_ERROR))) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } - /* Normal return path: fault Handled Gracefully */ + mm_fault_accounting(tsk, regs, address, major); return; } -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Meanwhile, take the page fault as a major fault as long as any of the retried page fault is a major fault. CC: Russell King <linux@armlinux.org.uk> CC: Will Deacon <will@kernel.org> CC: linux-arm-kernel@lists.infradead.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/arm/mm/fault.c | 21 ++++----------------- 1 file changed, 4 insertions(+), 17 deletions(-) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index 2dd5c41cbb8d..92d4436e74da 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -240,7 +240,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) struct task_struct *tsk; struct mm_struct *mm; int sig, code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; if (kprobe_page_fault(regs, fsr)) @@ -290,6 +290,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) } fault = __do_page_fault(mm, addr, fsr, flags, tsk); + major |= fault & VM_FAULT_MAJOR; /* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_sem because @@ -301,23 +302,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) return 0; } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ - - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (!(fault & VM_FAULT_ERROR) && flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; @@ -326,6 +311,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, addr, major); + /* * Handle the "normal" case first - VM_FAULT_MAJOR */ -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. CC: Catalin Marinas <catalin.marinas@arm.com> CC: Will Deacon <will@kernel.org> CC: linux-arm-kernel@lists.infradead.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/arm64/mm/fault.c | 17 ++--------------- 1 file changed, 2 insertions(+), 15 deletions(-) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index c9cedc0432d2..09af7d7a60ec 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -484,8 +484,6 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, addr, esr, regs); } - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); - /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, @@ -535,20 +533,9 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, VM_FAULT_BADACCESS)))) { /* * Major/minor page fault accounting is only done - * once. If we go through a retry, it is extremely - * likely that the page will be found in page cache at - * that point. + * once. */ - if (major) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, - addr); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, - addr); - } - + mm_fault_accounting(current, regs, address, major); return 0; } -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Also, move the accounting to be after release of mmap_sem. CC: Guo Ren <guoren@kernel.org> CC: linux-csky@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/csky/mm/fault.c | 13 +------------ 1 file changed, 1 insertion(+), 12 deletions(-) diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index 4e6dc68f3258..8f8d34d27eca 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, return; } #endif - - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); /* * If we're in an interrupt or have no user * context, we must not take the fault.. @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, goto bad_area; BUG(); } - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, - address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, - address); - } - up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. The perf event accounting will be included too if possible. Also, do the accounting after releasing the mmap_sem. CC: Brian Cain <bcain@codeaurora.org> CC: linux-hexagon@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/hexagon/mm/vm_fault.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index 72334b26317a..5fbccfce6083 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -39,7 +39,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) struct mm_struct *mm = current->mm; int si_signo; int si_code = SEGV_MAPERR; - vm_fault_t fault; + vm_fault_t fault, major = 0; const struct exception_table_entry *fixup; unsigned int flags = FAULT_FLAG_DEFAULT; @@ -90,6 +90,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) } fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -97,17 +98,13 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; } } - up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); return; } -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. Do the accouting without mmap_sem. Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/ia64/mm/fault.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 30d0c1fca99e..1b7c1a537865 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -64,7 +64,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re struct vm_area_struct *vma, *prev_vma; struct mm_struct *mm = current->mm; unsigned long mask; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT) @@ -140,6 +140,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re * fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -162,10 +163,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -179,6 +176,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); return; check_expansion: -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. Do the accounting without mmap_sem. CC: Geert Uytterhoeven <geert@linux-m68k.org> CC: linux-m68k@lists.linux-m68k.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/m68k/mm/fault.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 3bfb5c8ac3c7..6c6d6d77a1be 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -70,7 +70,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, { struct mm_struct *mm = current->mm; struct vm_area_struct * vma; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; pr_debug("do page fault:\nregs->sr=%#x, regs->pc=%#lx, address=%#lx, %ld, %p\n", @@ -136,6 +136,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; pr_debug("handle_mm_fault returns %x\n", fault); if (fault_signal_pending(fault, regs)) @@ -151,16 +152,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -175,6 +167,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); return 0; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. CC: Michal Simek <monstr@monstr.eu> Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/microblaze/mm/fault.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index 3248141f8ed5..db0f3b71c3fe 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -90,7 +90,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, struct mm_struct *mm = current->mm; int code = SEGV_MAPERR; int is_write = error_code & ESR_S; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; regs->ear = address; @@ -216,6 +216,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -231,10 +232,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (unlikely(fault & VM_FAULT_MAJOR)) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -249,6 +246,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); /* * keep track of tlb+htab misses that are good addrs but -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Since at it, move the accouting out of mmap_sem because not needed. CC: Thomas Bogendoerfer <tsbogend@alpha.franken.de> CC: linux-mips@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/mips/mm/fault.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index f8d62cd83b36..0b937fb12614 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -43,7 +43,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, struct mm_struct *mm = tsk->mm; const int field = sizeof(unsigned long) * 2; int si_code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; static DEFINE_RATELIMIT_STATE(ratelimit_state, 5 * HZ, 10); @@ -153,11 +153,11 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -168,15 +168,6 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - tsk->maj_flt++; - } else { - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - tsk->min_flt++; - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -191,6 +182,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, } up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, address, major); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Since at it, move the accounting after releasing mmap_sem. CC: Nick Hu <nickhu@andestech.com> CC: Greentime Hu <green.hu@gmail.com> CC: Vincent Chen <deanbo422@gmail.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/nds32/mm/fault.c | 19 +++---------------- 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c index f331e533edc2..11537c03ab1d 100644 --- a/arch/nds32/mm/fault.c +++ b/arch/nds32/mm/fault.c @@ -78,7 +78,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, struct mm_struct *mm; struct vm_area_struct *vma; int si_code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int mask = VM_ACCESS_FLAGS; unsigned int flags = FAULT_FLAG_DEFAULT; @@ -208,6 +208,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, */ fault = handle_mm_fault(vma, addr, flags); + major |= fault & VM_FAULT_MAJOR; /* * If we need to retry but a fatal signal is pending, handle the @@ -229,22 +230,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, goto bad_area; } - /* - * Major/minor page fault accounting is only done on the initial - * attempt. If we go through a retry, it is extremely likely that the - * page will be found in page cache at that point. - */ - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -257,6 +243,7 @@ void do_page_fault(unsigned long entry, unsigned long addr, } up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, addr, major); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. CC: Ley Foon Tan <ley.foon.tan@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/nios2/mm/fault.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c index ec9d8a9c426f..48617960a07e 100644 --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -46,7 +46,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, struct task_struct *tsk = current; struct mm_struct *mm = tsk->mm; int code = SEGV_MAPERR; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; cause >>= 2; @@ -132,6 +132,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -146,16 +147,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -170,6 +162,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. CC: Jonas Bonn <jonas@southpole.se> CC: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> CC: Stafford Horne <shorne@gmail.com> CC: openrisc@lists.librecores.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/openrisc/mm/fault.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index 8af1cc78c4fb..594195ae8cdb 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -49,7 +49,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, struct mm_struct *mm; struct vm_area_struct *vma; int si_code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current; @@ -160,6 +160,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -176,10 +177,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, if (flags & FAULT_FLAG_ALLOW_RETRY) { /*RGD modeled on Cris */ - if (fault & VM_FAULT_MAJOR) - tsk->maj_flt++; - else - tsk->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -193,6 +190,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, } up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, address, major); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. CC: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> CC: Helge Deller <deller@gmx.de> CC: linux-parisc@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/parisc/mm/fault.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 86e8c848f3d7..eab1ee8d18c6 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -263,7 +263,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, struct task_struct *tsk; struct mm_struct *mm; unsigned long acc_type; - vm_fault_t fault = 0; + vm_fault_t fault = 0, major = 0; unsigned int flags; if (faulthandler_disabled()) @@ -303,6 +303,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -323,10 +324,6 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { /* * No need to up_read(&mm->mmap_sem) as we would @@ -338,6 +335,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, } } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); return; check_expansion: -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. cmo_account_page_fault() is special. Keep that. CC: Michael Ellerman <mpe@ellerman.id.au> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: linuxppc-dev@lists.ozlabs.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/powerpc/mm/fault.c | 13 ++++--------- 1 file changed, 4 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 84af6c8eecf7..6043b639ae42 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -481,8 +481,6 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, if (!arch_irq_disabled_regs(regs)) local_irq_enable(); - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (error_code & DSISR_KEYFAULT) return bad_key_fault_exception(regs, address, get_mm_addr_key(mm, address)); @@ -604,14 +602,11 @@ static int __do_page_fault(struct pt_regs *regs, unsigned long address, /* * Major/minor page fault accounting. */ - if (major) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); + if (major) cmo_account_page_fault(); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } + + mm_fault_accounting(current, regs, address, major); + return 0; } NOKPROBE_SYMBOL(__do_page_fault); -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. CC: Paul Walmsley <paul.walmsley@sifive.com> CC: Palmer Dabbelt <palmer@dabbelt.com> CC: Albert Ou <aou@eecs.berkeley.edu> CC: linux-riscv@lists.infradead.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/riscv/mm/fault.c | 21 +++------------------ 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index be84e32adc4c..9262338614d1 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -30,7 +30,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) struct vm_area_struct *vma; struct mm_struct *mm; unsigned long addr, cause; - unsigned int flags = FAULT_FLAG_DEFAULT; + unsigned int flags = FAULT_FLAG_DEFAULT, major = 0; int code = SEGV_MAPERR; vm_fault_t fault; @@ -65,9 +65,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs) if (user_mode(regs)) flags |= FAULT_FLAG_USER; - - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); - retry: down_read(&mm->mmap_sem); vma = find_vma(mm, addr); @@ -111,6 +108,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) * the fault. */ fault = handle_mm_fault(vma, addr, flags); + major |= fault & VM_FAULT_MAJOR; /* * If we need to retry but a fatal signal is pending, handle the @@ -128,21 +126,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -156,6 +140,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs) } up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, addr, major); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: Vasily Gorbik <gor@linux.ibm.com> CC: Christian Borntraeger <borntraeger@de.ibm.com> CC: linux-s390@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/s390/mm/fault.c | 21 +++++---------------- 1 file changed, 5 insertions(+), 16 deletions(-) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index dedc28be27ab..8ca207635b59 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) unsigned long trans_exc_code; unsigned long address; unsigned int flags; - vm_fault_t fault; + vm_fault_t fault, major = 0; tsk = current; /* @@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) } address = trans_exc_code & __FAIL_ADDR_MASK; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); flags = FAULT_FLAG_DEFAULT; if (user_mode(regs)) flags |= FAULT_FLAG_USER; @@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; if (flags & FAULT_FLAG_RETRY_NOWAIT) @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) if (unlikely(fault & VM_FAULT_ERROR)) goto out_up; - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } if (fault & VM_FAULT_RETRY) { if (IS_ENABLED(CONFIG_PGSTE) && gmap && (flags & FAULT_FLAG_RETRY_NOWAIT)) { @@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) goto retry; } } + + mm_fault_accounting(tsk, regs, address, major); + if (IS_ENABLED(CONFIG_PGSTE) && gmap) { address = __gmap_link(gmap, current->thread.gmap_addr, address); -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. CC: Yoshinori Sato <ysato@users.sourceforge.jp> CC: Rich Felker <dalias@libc.org> CC: linux-sh@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/sh/mm/fault.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index 5f23d7907597..06b232973488 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -379,7 +379,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct * vma; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current; @@ -412,8 +412,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, if ((regs->sr & SR_IMASK) != SR_IMASK) local_irq_enable(); - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - /* * If we're in an interrupt, have no user context or are running * with pagefaults disabled then we must not take the fault: @@ -465,21 +463,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR))) if (mm_fault_error(regs, error_code, address, fault)) return; if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -493,4 +483,5 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, } up_read(&mm->mmap_sem); + mm_fault_accounting(tsk, regs, address, major); } -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. CC: David S. Miller <davem@davemloft.net> CC: sparclinux@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/sparc/mm/fault_32.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index f6e0e601f857..299e6e241a1c 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -167,7 +167,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, unsigned long g2; int from_user = !(regs->psr & PSR_PS); int code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; if (text_fault) @@ -192,9 +192,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, */ if (pagefault_disabled() || !mm) goto no_context; - - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - retry: down_read(&mm->mmap_sem); @@ -236,6 +233,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -251,15 +249,6 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, address); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -273,6 +262,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); return; /* -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. CC: David S. Miller <davem@davemloft.net> CC: sparclinux@vger.kernel.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/sparc/mm/fault_64.c | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index c0c0dd471b6b..61be0a0d79c6 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -269,7 +269,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) struct vm_area_struct *vma; unsigned int insn = 0; int si_code, fault_code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned long address, mm_rss; unsigned int flags = FAULT_FLAG_DEFAULT; @@ -317,8 +317,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) if (faulthandler_disabled() || !mm) goto intr_or_no_mm; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (!down_read_trylock(&mm->mmap_sem)) { if ((regs->tstate & TSTATE_PRIV) && !search_exception_tables(regs->tpc)) { @@ -424,6 +422,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) } fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) goto exit_exception; @@ -439,15 +438,6 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, address); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -461,6 +451,8 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) } up_read(&mm->mmap_sem); + mm_fault_accounting(current, regs, address, major); + mm_rss = get_mm_rss(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) mm_rss -= (mm->context.thp_pte_count * (HPAGE_SIZE / PAGE_SIZE)); -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. Also, the perf events for page faults will be accounted too when the config has CONFIG_PERF_EVENTS defined. CC: Guan Xuetao <gxt@pku.edu.cn> Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/unicore32/mm/fault.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c index 3022104aa613..240b38e81ed6 100644 --- a/arch/unicore32/mm/fault.c +++ b/arch/unicore32/mm/fault.c @@ -201,7 +201,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) struct task_struct *tsk; struct mm_struct *mm; int sig, code; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current; @@ -245,6 +245,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) } fault = __do_pf(mm, addr, fsr, flags, tsk); + major |= fault & VM_FAULT_MAJOR; /* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_sem because @@ -254,10 +255,6 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) return 0; if (!(fault & VM_FAULT_ERROR) && (flags & FAULT_FLAG_ALLOW_RETRY)) { - if (fault & VM_FAULT_MAJOR) - tsk->maj_flt++; - else - tsk->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; @@ -267,11 +264,13 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs) up_read(&mm->mmap_sem); /* - * Handle the "normal" case first - VM_FAULT_MAJOR + * Handle the "normal" case first */ - if (likely(!(fault & - (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS)))) + if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | + VM_FAULT_BADACCESS)))) { + mm_fault_accounting(tsk, regs, addr, major); return 0; + } /* * If we are in kernel mode at this point, we -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. CC: Dave Hansen <dave.hansen@linux.intel.com> CC: Andy Lutomirski <luto@kernel.org> CC: Peter Zijlstra <peterz@infradead.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: Borislav Petkov <bp@alien8.de> CC: x86@kernel.org CC: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/x86/mm/fault.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a51df516b87b..28635b4e324c 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1365,8 +1365,6 @@ void do_user_addr_fault(struct pt_regs *regs, local_irq_enable(); } - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (hw_error_code & X86_PF_WRITE) flags |= FAULT_FLAG_WRITE; if (hw_error_code & X86_PF_INSTR) @@ -1493,13 +1491,7 @@ void do_user_addr_fault(struct pt_regs *regs, * Major/minor page fault accounting. If any of the events * returned VM_FAULT_MAJOR, we account it as a major fault. */ - if (major) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } + mm_fault_accounting(tsk, regs, address, major); check_v8086_mode(regs, address, tsk); } -- 2.26.2
Use the new mm_fault_accounting() helper for page fault accounting. Avoid doing page fault accounting multiple times if the page fault is retried. CC: Chris Zankel <chris@zankel.net> CC: Max Filippov <jcmvbkbc@gmail.com> CC: linux-xtensa@linux-xtensa.org Signed-off-by: Peter Xu <peterx@redhat.com> --- arch/xtensa/mm/fault.c | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index e7172bd53ced..511a2c5f9857 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -42,7 +42,7 @@ void do_page_fault(struct pt_regs *regs) int code; int is_write, is_exec; - vm_fault_t fault; + vm_fault_t fault, major = 0; unsigned int flags = FAULT_FLAG_DEFAULT; code = SEGV_MAPERR; @@ -109,6 +109,7 @@ void do_page_fault(struct pt_regs *regs) * the fault. */ fault = handle_mm_fault(vma, address, flags); + major |= fault & VM_FAULT_MAJOR; if (fault_signal_pending(fault, regs)) return; @@ -123,10 +124,6 @@ void do_page_fault(struct pt_regs *regs) BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -140,12 +137,7 @@ void do_page_fault(struct pt_regs *regs) } up_read(&mm->mmap_sem); - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (flags & VM_FAULT_MAJOR) - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - else - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - + mm_fault_accounting(current, regs, address, major); return; /* Something tried to access memory that isn't in our memory map.. -- 2.26.2
On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
>
> Provide this helper for doing memory page fault accounting across archs. It
> can be defined unconditionally because perf_sw_event() is always defined, and
> perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.
Well, the downside is that now it forces a separate I$ miss and all
those extra arguments because it's a out-of-line function and the
compiler won't see that they all go away.
Yeah, maybe some day maybe we'll have LTO and these kinds of things
will not matter. And maybe they already don't. But it seems kind of
sad to basically force non-optimal code generation from this series.
Why would you export the symbol, btw? Page fault handling is never a module.
Linus
On Mon, Jun 15, 2020 at 3:23 PM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Chris Zankel <chris@zankel.net>
> CC: Max Filippov <jcmvbkbc@gmail.com>
> CC: linux-xtensa@linux-xtensa.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> arch/xtensa/mm/fault.c | 14 +++-----------
> 1 file changed, 3 insertions(+), 11 deletions(-)
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
--
Thanks.
-- Max
On Mon, Jun 15, 2020 at 03:32:40PM -0700, Linus Torvalds wrote: > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote: > > > > Provide this helper for doing memory page fault accounting across archs. It > > can be defined unconditionally because perf_sw_event() is always defined, and > > perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS. > > Well, the downside is that now it forces a separate I$ miss and all > those extra arguments because it's a out-of-line function and the > compiler won't see that they all go away. > > Yeah, maybe some day maybe we'll have LTO and these kinds of things > will not matter. And maybe they already don't. But it seems kind of > sad to basically force non-optimal code generation from this series. I tried to make it static inline firstly in linux/mm.h, however it'll need to have linux/mm.h include linux/perf_event.h which seems to have created a loop dependency of headers. I verified current code will at least generate inlined functions too for x86 (no mm_fault_accounting() in "objdump -t vmlinux") with gcc10. Another alternative is to make it a macro, it's just that I feel the function definition is a bit cleaner. Any further suggestions welcomed too. > > Why would you export the symbol, btw? Page fault handling is never a module. I followed handle_mm_fault() which is exported too, since potentially mm_fault_accounting() should always be called in the same context of handle_mm_fault(). Or do you prefer me to drop it? Thanks, -- Peter Xu
On Mon, Jun 15, 2020 at 06:15:48PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> CC: Catalin Marinas <catalin.marinas@arm.com>
> CC: Will Deacon <will@kernel.org>
> CC: linux-arm-kernel@lists.infradead.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> arch/arm64/mm/fault.c | 17 ++---------------
> 1 file changed, 2 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c9cedc0432d2..09af7d7a60ec 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -484,8 +484,6 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
> addr, esr, regs);
> }
>
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> -
> /*
> * As per x86, we may deadlock here. However, since the kernel only
> * validly references user space from well defined areas of the code,
> @@ -535,20 +533,9 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
> VM_FAULT_BADACCESS)))) {
> /*
> * Major/minor page fault accounting is only done
> - * once. If we go through a retry, it is extremely
> - * likely that the page will be found in page cache at
> - * that point.
> + * once.
> */
> - if (major) {
> - current->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
> - addr);
> - } else {
> - current->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
> - addr);
> - }
> -
> + mm_fault_accounting(current, regs, address, major);
Please can you explain why it's ok to move the PERF_COUNT_SW_PAGE_FAULTS
update like this? Seems like a user-visible change to me, so some
justification would really help.
Will
Hi, Will,
On Tue, Jun 16, 2020 at 08:43:08AM +0100, Will Deacon wrote:
> Please can you explain why it's ok to move the PERF_COUNT_SW_PAGE_FAULTS
> update like this? Seems like a user-visible change to me, so some
> justification would really help.
Indeed this could be a functional change for PERF_COUNT_SW_PAGE_FAULTS on some
archs, e.g., for arm64, PERF_COUNT_SW_PAGE_FAULTS previously will also contain
accounting of severe errors where we go into the "no_context" path. However if
you see the other archs, it's not always true, for example, the xtensa arch
only accounts the correctly handled faults (arch/xtensa/mm/fault.c).
After I thought about this, I don't think it's extremely useful (or please
correct me if I missed something important) to use PERF_COUNT_SW_PAGE_FAULTS
for fatal error accountings among all those correct ones. After all they are
really extremely rare cases, and even if we got a sigbus for a process, we'll
normally got something dumped in dmesg so if we really want to capture the
error cases there should always be a better way (because by following things
like dmesg we can not only know how many error faults triggered, but also on
the details of the errors).
IOW, my understanding of users of PERF_COUNT_SW_PAGE_FAULTS is that they want
to trap normal/correct page faults, not really care about rare errors.
Then when I went back to think PERF_COUNT_SW_PAGE_FAULTS, it's really about:
A=PERF_COUNT_SW_PAGE_FAULTS
B=PERF_COUNT_SW_PAGE_FAULTS_MAJ
C=PERF_COUNT_SW_PAGE_FAULTS_MIN
And:
A=B+C
If that's the case (which is simple and clear), it's really helpful too that we
unify this definition across all the architectures, then it'll also be easier
for us to provide some helper like mm_fault_accounting() so that the accounting
can be managed in the general code rather than in arch-specific ways.
Thanks,
--
Peter Xu
On Mon, Jun 15, 2020 at 06:23:02PM -0400, Peter Xu wrote: > Use the new mm_fault_accounting() helper for page fault accounting. > > Avoid doing page fault accounting multiple times if the page fault is retried. > > CC: Heiko Carstens <heiko.carstens@de.ibm.com> > CC: Vasily Gorbik <gor@linux.ibm.com> > CC: Christian Borntraeger <borntraeger@de.ibm.com> > CC: linux-s390@vger.kernel.org > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > arch/s390/mm/fault.c | 21 +++++---------------- > 1 file changed, 5 insertions(+), 16 deletions(-) > > diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c > index dedc28be27ab..8ca207635b59 100644 > --- a/arch/s390/mm/fault.c > +++ b/arch/s390/mm/fault.c > @@ -392,7 +392,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > unsigned long trans_exc_code; > unsigned long address; > unsigned int flags; > - vm_fault_t fault; > + vm_fault_t fault, major = 0; > > tsk = current; > /* > @@ -428,7 +428,6 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > } > > address = trans_exc_code & __FAIL_ADDR_MASK; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); > flags = FAULT_FLAG_DEFAULT; > if (user_mode(regs)) > flags |= FAULT_FLAG_USER; > @@ -480,6 +479,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > * the fault. > */ > fault = handle_mm_fault(vma, address, flags); > + major |= fault & VM_FAULT_MAJOR; > if (fault_signal_pending(fault, regs)) { > fault = VM_FAULT_SIGNAL; > if (flags & FAULT_FLAG_RETRY_NOWAIT) > @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > if (unlikely(fault & VM_FAULT_ERROR)) > goto out_up; > > - /* > - * Major/minor page fault accounting is only done on the > - * initial attempt. If we go through a retry, it is extremely > - * likely that the page will be found in page cache at that point. > - */ > if (flags & FAULT_FLAG_ALLOW_RETRY) { > - if (fault & VM_FAULT_MAJOR) { > - tsk->maj_flt++; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, > - regs, address); > - } else { > - tsk->min_flt++; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, > - regs, address); > - } > if (fault & VM_FAULT_RETRY) { > if (IS_ENABLED(CONFIG_PGSTE) && gmap && > (flags & FAULT_FLAG_RETRY_NOWAIT)) { Seems like the call to mm_fault_accounting() will be missed if we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it jumps to "out_up"... > @@ -519,6 +505,9 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > goto retry; > } > } > + > + mm_fault_accounting(tsk, regs, address, major); > + > if (IS_ENABLED(CONFIG_PGSTE) && gmap) { > address = __gmap_link(gmap, current->thread.gmap_addr, > address); > -- > 2.26.2 >
Hi, Alexander,
On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> > @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> > if (unlikely(fault & VM_FAULT_ERROR))
> > goto out_up;
> >
> > - /*
> > - * Major/minor page fault accounting is only done on the
> > - * initial attempt. If we go through a retry, it is extremely
> > - * likely that the page will be found in page cache at that point.
> > - */
> > if (flags & FAULT_FLAG_ALLOW_RETRY) {
> > - if (fault & VM_FAULT_MAJOR) {
> > - tsk->maj_flt++;
> > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> > - regs, address);
> > - } else {
> > - tsk->min_flt++;
> > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> > - regs, address);
> > - }
> > if (fault & VM_FAULT_RETRY) {
> > if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> > (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>
> Seems like the call to mm_fault_accounting() will be missed if
> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> jumps to "out_up"...
This is true as a functional change. However that also means that we've got a
VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
than handled correctly (for instance, due to some try_lock failed during the
fault process).
To me, that case should not be counted as a page fault at all? Or we might get
the same duplicated accounting when the page fault retried from a higher stack.
Thanks,
--
Peter Xu
On Mon, Jun 15, 2020 at 06:15:57PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
> Also, the perf events for page faults will be accounted too when the config has
> CONFIG_PERF_EVENTS defined.
>
> CC: Jonas Bonn <jonas@southpole.se>
> CC: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
> CC: Stafford Horne <shorne@gmail.com>
> CC: openrisc@lists.librecores.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Stafford Horne <shorne@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 2788 bytes --] On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote: > > This series tries to address all of them by introducing mm_fault_accounting() > first, so that we move all the page fault accounting into the common code base, > then call it properly from arch pf handlers just like handle_mm_fault(). Hmm. So having looked at this a bit more, I'd actually like to go even further, and just get rid of the per-architecture code _entirely_. Here's a straw-man patch to the generic code - the idea is mostly laid out in the comment that I'm just quoting here directly too: /* * Do accounting in the common code, to avoid unnecessary * architecture differences or duplicated code. * * We arbitrarily make the rules be: * * - faults that never even got here (because the address * wasn't valid). That includes arch_vma_access_permitted() * failing above. * * So this is expressly not a "this many hardware page * faults" counter. Use the hw profiling for that. * * - incomplete faults (ie RETRY) do not count (see above). * They will only count once completed. * * - the fault counts as a "major" fault when the final * successful fault is VM_FAULT_MAJOR, or if it was a * retry (which implies that we couldn't handle it * immediately previously). * * - if the fault is done for GUP, regs wil be NULL and * no accounting will be done (but you _could_ pass in * your own regs and it would be accounted to the thread * doing the fault, not to the target!) */ the code itself in the patch is (a) pretty trivial and self-evident (b) INCOMPLETE that (b) is worth noting: this patch won't compile on its own. It intentionally leaves all the users without the new 'regs' argument, because you obviously simply need to remove all the code that currently tries to do any accounting. Comments? This is a bigger change, but I think it might be worth it to _really_ consolidate the major/minor logic. One detail worth noting: I do wonder if we should put the perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); just in the arch code at the top of the fault handling, and consider it entirely unrelated to the major/minor fault handling. The major/minor faults fundamnetally are about successes. But the plain PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including things that never even get to this point at all. I'm not convinced it's useful to have three SW events that are defined to be A=B+C. But this does *not* do that part. It' sreally just an RFC patch. Linus [-- Attachment #2: patch --] [-- Type: application/octet-stream, Size: 2918 bytes --] include/linux/mm.h | 3 ++- mm/memory.c | 46 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 47 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index dc7b87310c10..e82a604339c0 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1648,7 +1648,8 @@ int invalidate_inode_page(struct page *page); #ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, - unsigned long address, unsigned int flags); + unsigned long address, unsigned int flags, + struct pt_regs *); extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); diff --git a/mm/memory.c b/mm/memory.c index dc7f3543b1fd..f15056151fb1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -72,6 +72,7 @@ #include <linux/oom.h> #include <linux/numa.h> +#include <linux/perf_event.h> #include <trace/events/kmem.h> #include <asm/io.h> @@ -4353,7 +4354,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, * return value. See filemap_fault() and __lock_page_or_retry(). */ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) + unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; @@ -4394,6 +4395,49 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, mem_cgroup_oom_synchronize(false); } + if (ret & VM_FAULT_RETRY) + return ret; + + /* + * Do accounting in the common code, to avoid unnecessary + * architecture differences or duplicated code. + * + * We arbitrarily make the rules be: + * + * - faults that never even got here (because the address + * wasn't valid). That includes arch_vma_access_permitted() + * failing above. + * + * So this is expressly not a "this many hardware page + * faults" counter. Use the hw profiling for that. + * + * - incomplete faults (ie RETRY) do not count (see above). + * They will only count once completed. + * + * - the fault counts as a "major" fault when the final + * successful fault is VM_FAULT_MAJOR, or if it was a + * retry (which implies that we couldn't handle it + * immediately previously). + * + * - if the fault is done for GUP, regs wil be NULL and + * no accounting will be done (but you _could_ pass in + * your own regs and it would be accounted to the thread + * doing the fault, not to the target!) + */ + + if (!regs) + return ret; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + + if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) { + current->maj_flt++; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); + } else { + current->min_flt++; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); + } + return ret; } EXPORT_SYMBOL_GPL(handle_mm_fault);
On Mon, 15 Jun 2020 19:19:17 -0400 Peter Xu <peterx@redhat.com> wrote: > On Mon, Jun 15, 2020 at 03:32:40PM -0700, Linus Torvalds wrote: > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote: > > > > > > Provide this helper for doing memory page fault accounting across archs. It > > > can be defined unconditionally because perf_sw_event() is always defined, and > > > perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS. > > > > Well, the downside is that now it forces a separate I$ miss and all > > those extra arguments because it's a out-of-line function and the > > compiler won't see that they all go away. > > > > Yeah, maybe some day maybe we'll have LTO and these kinds of things > > will not matter. And maybe they already don't. But it seems kind of > > sad to basically force non-optimal code generation from this series. > > I tried to make it static inline firstly in linux/mm.h, however it'll need to > have linux/mm.h include linux/perf_event.h which seems to have created a loop > dependency of headers. I verified current code will at least generate inlined > functions too for x86 (no mm_fault_accounting() in "objdump -t vmlinux") with > gcc10. > > Another alternative is to make it a macro, it's just that I feel the function > definition is a bit cleaner. Any further suggestions welcomed too. Could create a new header file mm_fault.h which includes mm.h and perf_event.h. A later cleanup could move other fault-related things into that header and add the appropriate inclusions into files which use these things. btw, I think mm_account_fault() might be a better name for this function. And some (kerneldoc) documentation would be nice. Although this function is pretty self-evident. > > > > Why would you export the symbol, btw? Page fault handling is never a module. > > I followed handle_mm_fault() which is exported too, since potentially > mm_fault_accounting() should always be called in the same context of > handle_mm_fault(). Or do you prefer me to drop it? Let's not add an unneeded export. If someone for some reason needs it later, it can be added then.
On Tue, Jun 16, 2020 at 11:55:17AM -0700, Linus Torvalds wrote: > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote: > > > > This series tries to address all of them by introducing mm_fault_accounting() > > first, so that we move all the page fault accounting into the common code base, > > then call it properly from arch pf handlers just like handle_mm_fault(). > > Hmm. > > So having looked at this a bit more, I'd actually like to go even > further, and just get rid of the per-architecture code _entirely_. > > Here's a straw-man patch to the generic code - the idea is mostly laid > out in the comment that I'm just quoting here directly too: > > /* > * Do accounting in the common code, to avoid unnecessary > * architecture differences or duplicated code. > * > * We arbitrarily make the rules be: > * > * - faults that never even got here (because the address > * wasn't valid). That includes arch_vma_access_permitted() > * failing above. > * > * So this is expressly not a "this many hardware page > * faults" counter. Use the hw profiling for that. > * > * - incomplete faults (ie RETRY) do not count (see above). > * They will only count once completed. > * > * - the fault counts as a "major" fault when the final > * successful fault is VM_FAULT_MAJOR, or if it was a > * retry (which implies that we couldn't handle it > * immediately previously). > * > * - if the fault is done for GUP, regs wil be NULL and > * no accounting will be done (but you _could_ pass in > * your own regs and it would be accounted to the thread > * doing the fault, not to the target!) > */ > > the code itself in the patch is > > (a) pretty trivial and self-evident > > (b) INCOMPLETE > > that (b) is worth noting: this patch won't compile on its own. It > intentionally leaves all the users without the new 'regs' argument, > because you obviously simply need to remove all the code that > currently tries to do any accounting. > > Comments? Looks clean to me. The definition of "major faults" will slightly change even for those who has done the "major |= fault & MAJOR" operations before, but at least I can't see anything bad about that either... To make things easier, we can use the 1st patch to introduce this change, however pass regs==NULL at the callers to never trigger this accounting. Then we can still use one patch for each arch to do the final convertions. > > This is a bigger change, but I think it might be worth it to _really_ > consolidate the major/minor logic. > > One detail worth noting: I do wonder if we should put the > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); > > just in the arch code at the top of the fault handling, and consider > it entirely unrelated to the major/minor fault handling. The > major/minor faults fundamnetally are about successes. But the plain > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including > things that never even get to this point at all. > > I'm not convinced it's useful to have three SW events that are defined > to be A=B+C. IMHO it's still common to have a "total" statistics in softwares even if each of the subsets are accounted separately. Here it's just a bit special because there're only two elements so the addition is so straightforward. It seems a trade-off on whether we'd like to do the accounting of errornous faults, or we want to make it cleaner by put them altogether but only successful page faults. I slightly preferred the latter due to the fact that I failed to find great usefulness out of keeping error fault accountings, but no strong opinions.. Thanks, -- Peter Xu
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote: >> This series tries to address all of them by introducing mm_fault_accounting() >> first, so that we move all the page fault accounting into the common code base, >> then call it properly from arch pf handlers just like handle_mm_fault(). > > Hmm. > > So having looked at this a bit more, I'd actually like to go even > further, and just get rid of the per-architecture code _entirely_. <snip> > One detail worth noting: I do wonder if we should put the > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); > > just in the arch code at the top of the fault handling, and consider > it entirely unrelated to the major/minor fault handling. The > major/minor faults fundamnetally are about successes. But the plain > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including > things that never even get to this point at all. Yeah I think we should keep it in the arch code at roughly the top. If it's moved to the end you could have a process spinning taking bad page faults (and fixing them up), and see no sign of it from the perf page fault counters. cheers
Peter Xu <peterx@redhat.com> 於 2020年6月16日 週二 上午6:16寫道:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
> Since at it, move the accounting after releasing mmap_sem.
>
> CC: Nick Hu <nickhu@andestech.com>
> CC: Greentime Hu <green.hu@gmail.com>
> CC: Vincent Chen <deanbo422@gmail.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> arch/nds32/mm/fault.c | 19 +++----------------
> 1 file changed, 3 insertions(+), 16 deletions(-)
>
> diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c
> index f331e533edc2..11537c03ab1d 100644
> --- a/arch/nds32/mm/fault.c
> +++ b/arch/nds32/mm/fault.c
> @@ -78,7 +78,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
> struct mm_struct *mm;
> struct vm_area_struct *vma;
> int si_code;
> - vm_fault_t fault;
> + vm_fault_t fault, major = 0;
> unsigned int mask = VM_ACCESS_FLAGS;
> unsigned int flags = FAULT_FLAG_DEFAULT;
>
> @@ -208,6 +208,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
> */
>
> fault = handle_mm_fault(vma, addr, flags);
> + major |= fault & VM_FAULT_MAJOR;
>
> /*
> * If we need to retry but a fatal signal is pending, handle the
> @@ -229,22 +230,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
> goto bad_area;
> }
>
> - /*
> - * Major/minor page fault accounting is only done on the initial
> - * attempt. If we go through a retry, it is extremely likely that the
> - * page will be found in page cache at that point.
> - */
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> - if (fault & VM_FAULT_MAJOR) {
> - tsk->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
> - 1, regs, addr);
> - } else {
> - tsk->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
> - 1, regs, addr);
> - }
> if (fault & VM_FAULT_RETRY) {
> flags |= FAULT_FLAG_TRIED;
>
> @@ -257,6 +243,7 @@ void do_page_fault(unsigned long entry, unsigned long addr,
> }
>
> up_read(&mm->mmap_sem);
> + mm_fault_accounting(tsk, regs, addr, major);
> return;
>
> /*
> --
Hi Peter,
Thank you. :)
Acked-by: Greentime Hu <green.hu@gmail.com>
On 16.06.20 18:35, Peter Xu wrote:
> Hi, Alexander,
>
> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>> if (unlikely(fault & VM_FAULT_ERROR))
>>> goto out_up;
>>>
>>> - /*
>>> - * Major/minor page fault accounting is only done on the
>>> - * initial attempt. If we go through a retry, it is extremely
>>> - * likely that the page will be found in page cache at that point.
>>> - */
>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>> - if (fault & VM_FAULT_MAJOR) {
>>> - tsk->maj_flt++;
>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>> - regs, address);
>>> - } else {
>>> - tsk->min_flt++;
>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>> - regs, address);
>>> - }
>>> if (fault & VM_FAULT_RETRY) {
>>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>>
>> Seems like the call to mm_fault_accounting() will be missed if
>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>> jumps to "out_up"...
>
> This is true as a functional change. However that also means that we've got a
> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> than handled correctly (for instance, due to some try_lock failed during the
> fault process).
>
> To me, that case should not be counted as a page fault at all? Or we might get
> the same duplicated accounting when the page fault retried from a higher stack.
>
> Thanks
This case below (the one with the gmap) is the KVM case for doing a so called
pseudo page fault to our guests. (we notify our guests about major host page
faults and let it reschedule to something else instead of halting the vcpu).
This is being resolved with either gup or fixup_user_fault asynchronously by
KVM code (this can also be sync when the guest does not match some conditions)
We do not change the counters in that code as far as I can tell so we should
continue to do it here.
(see arch/s390/kvm/kvm-s390.c
static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
{
[...]
} else if (current->thread.gmap_pfault) {
trace_kvm_s390_major_guest_pfault(vcpu);
current->thread.gmap_pfault = 0;
if (kvm_arch_setup_async_pf(vcpu))
return 0;
return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
}
Hi Peter, On Tue, Jun 16, 2020 at 6:16 AM Peter Xu <peterx@redhat.com> wrote: > > Use the new mm_fault_accounting() helper for page fault accounting. > Also, move the accounting to be after release of mmap_sem. > > CC: Guo Ren <guoren@kernel.org> > CC: linux-csky@vger.kernel.org > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > arch/csky/mm/fault.c | 13 +------------ > 1 file changed, 1 insertion(+), 12 deletions(-) > > diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c > index 4e6dc68f3258..8f8d34d27eca 100644 > --- a/arch/csky/mm/fault.c > +++ b/arch/csky/mm/fault.c > @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, > return; > } > #endif > - > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); > /* > * If we're in an interrupt or have no user > * context, we must not take the fault.. > @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, > goto bad_area; > BUG(); > } > - if (fault & VM_FAULT_MAJOR) { > - tsk->maj_flt++; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, > - address); > - } else { > - tsk->min_flt++; > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, > - address); > - } > - > up_read(&mm->mmap_sem); > + mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR); > return; > > /* > -- > 2.26.2 > I notice that you move it out of mm->mmap_sem's area, all archs should follow the rule ? Can you give me a clarification and put it into de commit log ? Thx -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/
On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote: > >> This series tries to address all of them by introducing mm_fault_accounting() > >> first, so that we move all the page fault accounting into the common code base, > >> then call it properly from arch pf handlers just like handle_mm_fault(). > > > > Hmm. > > > > So having looked at this a bit more, I'd actually like to go even > > further, and just get rid of the per-architecture code _entirely_. > > <snip> > > > One detail worth noting: I do wonder if we should put the > > > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); > > > > just in the arch code at the top of the fault handling, and consider > > it entirely unrelated to the major/minor fault handling. The > > major/minor faults fundamnetally are about successes. But the plain > > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including > > things that never even get to this point at all. > > Yeah I think we should keep it in the arch code at roughly the top. I agree. It's a nice idea to consolidate the code, but I don't see that it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly changing the semantics (and a potentially less useful way. Of course, moving more of do_page_fault() out of the arch code would be great, but that's a much bigger effort. > If it's moved to the end you could have a process spinning taking bad > page faults (and fixing them up), and see no sign of it from the perf > page fault counters. The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS if _all_ of the following are true: 1. The fault isn't handled by kprobes 2. The pagefault handler is enabled 3. We have an mm (current->mm) 4. The fault isn't an unexpected kernel fault on a user address (we oops and kill the task in this case) Which loosely corresponds to "we took a fault on a user address that it looks like we can handle". That said, I'm happy to tweak this if it brings us into line with other architectures. Will
On Wed, Jun 17, 2020 at 03:04:49PM +0800, Guo Ren wrote: > Hi Peter, Hi, Guo, > > On Tue, Jun 16, 2020 at 6:16 AM Peter Xu <peterx@redhat.com> wrote: > > > > Use the new mm_fault_accounting() helper for page fault accounting. > > Also, move the accounting to be after release of mmap_sem. > > > > CC: Guo Ren <guoren@kernel.org> > > CC: linux-csky@vger.kernel.org > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > arch/csky/mm/fault.c | 13 +------------ > > 1 file changed, 1 insertion(+), 12 deletions(-) > > > > diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c > > index 4e6dc68f3258..8f8d34d27eca 100644 > > --- a/arch/csky/mm/fault.c > > +++ b/arch/csky/mm/fault.c > > @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, > > return; > > } > > #endif > > - > > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); > > /* > > * If we're in an interrupt or have no user > > * context, we must not take the fault.. > > @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write, > > goto bad_area; > > BUG(); > > } > > - if (fault & VM_FAULT_MAJOR) { > > - tsk->maj_flt++; > > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, > > - address); > > - } else { > > - tsk->min_flt++; > > - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, > > - address); > > - } > > - > > up_read(&mm->mmap_sem); > > + mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR); > > return; > > > > /* > > -- > > 2.26.2 > > > I notice that you move it out of mm->mmap_sem's area, all archs should > follow the rule ? Can you give me a clarification and put it into de > commit log ? I don't think it's a must, but mmap_sem should not be required at least by observing current code. E.g., do_user_addr_fault() of x86 does the accounting without mmap_sem even before this series. The perf events should be thread safe on its own. Frankly speaking I'm not very certain about my understanding on the per task counters, because iiuc we can also try to get user pages remotely of a thread in parallel with the page fault happening on the same thread, then it seems to me that the per task pf counters can be accessed on different cores simultaneously. However otoh that seems to be very rare, and it's still acceptable to me as a trade off to avoid overhead of locks or atomic ops on the counters. I'd be glad to be corrected if I missed anything important here... Thanks, -- Peter Xu
Hi, Christian, On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote: > > > On 16.06.20 18:35, Peter Xu wrote: > > Hi, Alexander, > > > > On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote: > >>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) > >>> if (unlikely(fault & VM_FAULT_ERROR)) > >>> goto out_up; > >>> > >>> - /* > >>> - * Major/minor page fault accounting is only done on the > >>> - * initial attempt. If we go through a retry, it is extremely > >>> - * likely that the page will be found in page cache at that point. > >>> - */ > >>> if (flags & FAULT_FLAG_ALLOW_RETRY) { > >>> - if (fault & VM_FAULT_MAJOR) { > >>> - tsk->maj_flt++; > >>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, > >>> - regs, address); > >>> - } else { > >>> - tsk->min_flt++; > >>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, > >>> - regs, address); > >>> - } > >>> if (fault & VM_FAULT_RETRY) { > >>> if (IS_ENABLED(CONFIG_PGSTE) && gmap && > >>> (flags & FAULT_FLAG_RETRY_NOWAIT)) { [1] > >> > >> Seems like the call to mm_fault_accounting() will be missed if > >> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it > >> jumps to "out_up"... > > > > This is true as a functional change. However that also means that we've got a > > VM_FAULT_RETRY, which hints that this fault has been requested to retry rather > > than handled correctly (for instance, due to some try_lock failed during the > > fault process). > > > > To me, that case should not be counted as a page fault at all? Or we might get > > the same duplicated accounting when the page fault retried from a higher stack. > > > > Thanks > > This case below (the one with the gmap) is the KVM case for doing a so called > pseudo page fault to our guests. (we notify our guests about major host page > faults and let it reschedule to something else instead of halting the vcpu). > This is being resolved with either gup or fixup_user_fault asynchronously by > KVM code (this can also be sync when the guest does not match some conditions) > We do not change the counters in that code as far as I can tell so we should > continue to do it here. > > (see arch/s390/kvm/kvm-s390.c > static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason) > { > [...] > } else if (current->thread.gmap_pfault) { > trace_kvm_s390_major_guest_pfault(vcpu); > current->thread.gmap_pfault = 0; > if (kvm_arch_setup_async_pf(vcpu)) > return 0; > return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1); > } Please correct me if I'm wrong... but I still think what this patch does is the right thing to do. Note again that IMHO when reached [1] above it means the page fault is not handled correctly so we need to fallback to KVM async page fault, then we shouldn't increment the accountings until it's finally handled correctly. That final accounting should be done in the async pf path in gup code where the page fault is handled: kvm_arch_fault_in_page gmap_fault fixup_user_fault Where in fixup_user_fault() we have: if (tsk) { if (major) tsk->maj_flt++; else tsk->min_flt++; } Thanks, -- Peter Xu
On Wed, Jun 17, 2020 at 09:04:06AM +0100, Will Deacon wrote:
> On Wed, Jun 17, 2020 at 10:55:14AM +1000, Michael Ellerman wrote:
> > Linus Torvalds <torvalds@linux-foundation.org> writes:
> > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> > >> This series tries to address all of them by introducing mm_fault_accounting()
> > >> first, so that we move all the page fault accounting into the common code base,
> > >> then call it properly from arch pf handlers just like handle_mm_fault().
> > >
> > > Hmm.
> > >
> > > So having looked at this a bit more, I'd actually like to go even
> > > further, and just get rid of the per-architecture code _entirely_.
> >
> > <snip>
> >
> > > One detail worth noting: I do wonder if we should put the
> > >
> > > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> > >
> > > just in the arch code at the top of the fault handling, and consider
> > > it entirely unrelated to the major/minor fault handling. The
> > > major/minor faults fundamnetally are about successes. But the plain
> > > PERF_COUNT_SW_PAGE_FAULTS could be about things that fail, including
> > > things that never even get to this point at all.
> >
> > Yeah I think we should keep it in the arch code at roughly the top.
>
> I agree. It's a nice idea to consolidate the code, but I don't see that
> it's really possible for PERF_COUNT_SW_PAGE_FAULTS without significantly
> changing the semantics (and a potentially less useful way. Of course,
> moving more of do_page_fault() out of the arch code would be great, but
> that's a much bigger effort.
>
> > If it's moved to the end you could have a process spinning taking bad
> > page faults (and fixing them up), and see no sign of it from the perf
> > page fault counters.
>
> The current arm64 behaviour is that we record PERF_COUNT_SW_PAGE_FAULTS
> if _all_ of the following are true:
>
> 1. The fault isn't handled by kprobes
> 2. The pagefault handler is enabled
> 3. We have an mm (current->mm)
> 4. The fault isn't an unexpected kernel fault on a user address (we oops
> and kill the task in this case)
>
> Which loosely corresponds to "we took a fault on a user address that it
> looks like we can handle".
>
> That said, I'm happy to tweak this if it brings us into line with other
> architectures.
I see. I'll keep the semantics for PERF_COUNT_SW_PAGE_FAULTS in the next
version. Thanks for all the comments!
--
Peter Xu
On 17.06.20 18:06, Peter Xu wrote:
> Hi, Christian,
>
> On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
>>
>>
>> On 16.06.20 18:35, Peter Xu wrote:
>>> Hi, Alexander,
>>>
>>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
>>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>>>>> if (unlikely(fault & VM_FAULT_ERROR))
>>>>> goto out_up;
>>>>>
>>>>> - /*
>>>>> - * Major/minor page fault accounting is only done on the
>>>>> - * initial attempt. If we go through a retry, it is extremely
>>>>> - * likely that the page will be found in page cache at that point.
>>>>> - */
>>>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>>>>> - if (fault & VM_FAULT_MAJOR) {
>>>>> - tsk->maj_flt++;
>>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
>>>>> - regs, address);
>>>>> - } else {
>>>>> - tsk->min_flt++;
>>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
>>>>> - regs, address);
>>>>> - }
>>>>> if (fault & VM_FAULT_RETRY) {
>>>>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
>>>>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {
>
> [1]
>
>>>>
>>>> Seems like the call to mm_fault_accounting() will be missed if
>>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
>>>> jumps to "out_up"...
>>>
>>> This is true as a functional change. However that also means that we've got a
>>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
>>> than handled correctly (for instance, due to some try_lock failed during the
>>> fault process).
>>>
>>> To me, that case should not be counted as a page fault at all? Or we might get
>>> the same duplicated accounting when the page fault retried from a higher stack.
>>>
>>> Thanks
>>
>> This case below (the one with the gmap) is the KVM case for doing a so called
>> pseudo page fault to our guests. (we notify our guests about major host page
>> faults and let it reschedule to something else instead of halting the vcpu).
>> This is being resolved with either gup or fixup_user_fault asynchronously by
>> KVM code (this can also be sync when the guest does not match some conditions)
>> We do not change the counters in that code as far as I can tell so we should
>> continue to do it here.
>>
>> (see arch/s390/kvm/kvm-s390.c
>> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
>> {
>> [...]
>> } else if (current->thread.gmap_pfault) {
>> trace_kvm_s390_major_guest_pfault(vcpu);
>> current->thread.gmap_pfault = 0;
>> if (kvm_arch_setup_async_pf(vcpu))
>> return 0;
>> return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
>> }
>
> Please correct me if I'm wrong... but I still think what this patch does is the
> right thing to do.
>
> Note again that IMHO when reached [1] above it means the page fault is not
> handled correctly so we need to fallback to KVM async page fault, then we
> shouldn't increment the accountings until it's finally handled correctly. That
> final accounting should be done in the async pf path in gup code where the page
> fault is handled:
>
> kvm_arch_fault_in_page
> gmap_fault
> fixup_user_fault
>
> Where in fixup_user_fault() we have:
>
> if (tsk) {
> if (major)
> tsk->maj_flt++;
> else
> tsk->min_flt++;
> }
>
Right that case does work. Its the case where we do not inject a pseudo pagefault and
instead fall back to synchronous fault-in.
What is about the other case:
kvm_setup_async_pf
->workqueue
async_pf_execute
get_user_pages_remote
Does get_user_pages_remote do the accounting as well? I cant see that.
On Tue, Jun 16, 2020 at 12:00:37PM -0700, Andrew Morton wrote:
> On Mon, 15 Jun 2020 19:19:17 -0400 Peter Xu <peterx@redhat.com> wrote:
>
> > On Mon, Jun 15, 2020 at 03:32:40PM -0700, Linus Torvalds wrote:
> > > On Mon, Jun 15, 2020 at 3:16 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Provide this helper for doing memory page fault accounting across archs. It
> > > > can be defined unconditionally because perf_sw_event() is always defined, and
> > > > perf_sw_event() will be a no-op if !CONFIG_PERF_EVENTS.
> > >
> > > Well, the downside is that now it forces a separate I$ miss and all
> > > those extra arguments because it's a out-of-line function and the
> > > compiler won't see that they all go away.
> > >
> > > Yeah, maybe some day maybe we'll have LTO and these kinds of things
> > > will not matter. And maybe they already don't. But it seems kind of
> > > sad to basically force non-optimal code generation from this series.
> >
> > I tried to make it static inline firstly in linux/mm.h, however it'll need to
> > have linux/mm.h include linux/perf_event.h which seems to have created a loop
> > dependency of headers. I verified current code will at least generate inlined
> > functions too for x86 (no mm_fault_accounting() in "objdump -t vmlinux") with
> > gcc10.
> >
> > Another alternative is to make it a macro, it's just that I feel the function
> > definition is a bit cleaner. Any further suggestions welcomed too.
>
> Could create a new header file mm_fault.h which includes mm.h and
> perf_event.h. A later cleanup could move other fault-related things
> into that header and add the appropriate inclusions into files which
> use these things.
>
> btw, I think mm_account_fault() might be a better name for this function.
>
> And some (kerneldoc) documentation would be nice. Although this
> function is pretty self-evident.
>
> > >
> > > Why would you export the symbol, btw? Page fault handling is never a module.
> >
> > I followed handle_mm_fault() which is exported too, since potentially
> > mm_fault_accounting() should always be called in the same context of
> > handle_mm_fault(). Or do you prefer me to drop it?
>
> Let's not add an unneeded export. If someone for some reason needs it
> later, it can be added then.
I plan to take the approach that Linus suggested, probably with
mm_account_fault() declared as static inline in memory.c. I'll remember to add
some kerneldoc too.
Thanks!
--
Peter Xu
On Wed, Jun 17, 2020 at 06:14:52PM +0200, Christian Borntraeger wrote:
>
>
> On 17.06.20 18:06, Peter Xu wrote:
> > Hi, Christian,
> >
> > On Wed, Jun 17, 2020 at 08:19:29AM +0200, Christian Borntraeger wrote:
> >>
> >>
> >> On 16.06.20 18:35, Peter Xu wrote:
> >>> Hi, Alexander,
> >>>
> >>> On Tue, Jun 16, 2020 at 05:59:33PM +0200, Alexander Gordeev wrote:
> >>>>> @@ -489,21 +489,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
> >>>>> if (unlikely(fault & VM_FAULT_ERROR))
> >>>>> goto out_up;
> >>>>>
> >>>>> - /*
> >>>>> - * Major/minor page fault accounting is only done on the
> >>>>> - * initial attempt. If we go through a retry, it is extremely
> >>>>> - * likely that the page will be found in page cache at that point.
> >>>>> - */
> >>>>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> >>>>> - if (fault & VM_FAULT_MAJOR) {
> >>>>> - tsk->maj_flt++;
> >>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> >>>>> - regs, address);
> >>>>> - } else {
> >>>>> - tsk->min_flt++;
> >>>>> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> >>>>> - regs, address);
> >>>>> - }
> >>>>> if (fault & VM_FAULT_RETRY) {
> >>>>> if (IS_ENABLED(CONFIG_PGSTE) && gmap &&
> >>>>> (flags & FAULT_FLAG_RETRY_NOWAIT)) {
> >
> > [1]
> >
> >>>>
> >>>> Seems like the call to mm_fault_accounting() will be missed if
> >>>> we entered here with FAULT_FLAG_RETRY_NOWAIT flag set, since it
> >>>> jumps to "out_up"...
> >>>
> >>> This is true as a functional change. However that also means that we've got a
> >>> VM_FAULT_RETRY, which hints that this fault has been requested to retry rather
> >>> than handled correctly (for instance, due to some try_lock failed during the
> >>> fault process).
> >>>
> >>> To me, that case should not be counted as a page fault at all? Or we might get
> >>> the same duplicated accounting when the page fault retried from a higher stack.
> >>>
> >>> Thanks
> >>
> >> This case below (the one with the gmap) is the KVM case for doing a so called
> >> pseudo page fault to our guests. (we notify our guests about major host page
> >> faults and let it reschedule to something else instead of halting the vcpu).
> >> This is being resolved with either gup or fixup_user_fault asynchronously by
> >> KVM code (this can also be sync when the guest does not match some conditions)
> >> We do not change the counters in that code as far as I can tell so we should
> >> continue to do it here.
> >>
> >> (see arch/s390/kvm/kvm-s390.c
> >> static int vcpu_post_run(struct kvm_vcpu *vcpu, int exit_reason)
> >> {
> >> [...]
> >> } else if (current->thread.gmap_pfault) {
> >> trace_kvm_s390_major_guest_pfault(vcpu);
> >> current->thread.gmap_pfault = 0;
> >> if (kvm_arch_setup_async_pf(vcpu))
> >> return 0;
> >> return kvm_arch_fault_in_page(vcpu, current->thread.gmap_addr, 1);
> >> }
> >
> > Please correct me if I'm wrong... but I still think what this patch does is the
> > right thing to do.
> >
> > Note again that IMHO when reached [1] above it means the page fault is not
> > handled correctly so we need to fallback to KVM async page fault, then we
> > shouldn't increment the accountings until it's finally handled correctly. That
> > final accounting should be done in the async pf path in gup code where the page
> > fault is handled:
> >
> > kvm_arch_fault_in_page
> > gmap_fault
> > fixup_user_fault
> >
> > Where in fixup_user_fault() we have:
> >
> > if (tsk) {
> > if (major)
> > tsk->maj_flt++;
> > else
> > tsk->min_flt++;
> > }
> >
>
> Right that case does work. Its the case where we do not inject a pseudo pagefault and
> instead fall back to synchronous fault-in.
> What is about the other case:
>
> kvm_setup_async_pf
> ->workqueue
> async_pf_execute
> get_user_pages_remote
>
> Does get_user_pages_remote do the accounting as well? I cant see that.
>
It should, with:
get_user_pages_remote
__get_user_pages_remote
__get_user_pages_locked
__get_user_pages
faultin_page
Where the accounting is done in faultin_page().
We probably need to also move the accounting in faultin_page() to be after the
retry check too, but that should be irrelevant to this patch.
Thanks!
--
Peter Xu
On Wed, Jun 17, 2020 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> I don't think it's a must, but mmap_sem should not be required at least by
> observing current code. E.g., do_user_addr_fault() of x86 does the accounting
> without mmap_sem even before this series.
All the accounting should be per-thread and not need any locking.
Which is why a remote GUP should never account to the remote mm - not
only isn't there an unambiguous thread to account to (an mm can share
many threads), but it would require locking not just for the remote
update, but for all normal page faults.
Linus
On Wed, Jun 17, 2020 at 10:53:23AM -0700, Linus Torvalds wrote:
> On Wed, Jun 17, 2020 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > I don't think it's a must, but mmap_sem should not be required at least by
> > observing current code. E.g., do_user_addr_fault() of x86 does the accounting
> > without mmap_sem even before this series.
>
> All the accounting should be per-thread and not need any locking.
>
> Which is why a remote GUP should never account to the remote mm - not
> only isn't there an unambiguous thread to account to (an mm can share
> many threads), but it would require locking not just for the remote
> update, but for all normal page faults.
But currently remote GUP will still do the page fault accounting on the remote
task_struct, am I right? E.g., when the get_user_pages_remote() is called with
"tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
accounting for that remote task/thread?
Thanks,
--
Peter Xu
On Wed, Jun 17, 2020 at 12:58 PM Peter Xu <peterx@redhat.com> wrote:
>
> But currently remote GUP will still do the page fault accounting on the remote
> task_struct, am I right? E.g., when the get_user_pages_remote() is called with
> "tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
> accounting for that remote task/thread?
Well, that would be a data race and fundamentally buggy.
It would be ok with something like ptrace (which only works when the
target is quiescent), but is completely wrong otherwise.
I guess it works fine in practice, and it's only statistics so even if
you were to have a data race it doesn't much matter, but it's
definitely conceptually very very wrong.
The fault stats should be about who does the fault (they are about the
_thread_) not about who the fault is done to (which is about the
_mm_).
Allocating the fault data to somebody else sounds frankly silly and
stupid to me, exactly because it's (a) racy and (b) not even
conceptually correct. The other thread literally _isn't_ doing a major
page fault, for crissake!
Now, there are some actual per-mm statistics too (the rss stuff etc),
and it's fundamentally harder exactly because of the shared data. See
the mm_counter stuff etc. Those are not about who does soemthing, they
are about the resulting MM state.
Linus
On Wed, Jun 17, 2020 at 01:15:31PM -0700, Linus Torvalds wrote:
> On Wed, Jun 17, 2020 at 12:58 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > But currently remote GUP will still do the page fault accounting on the remote
> > task_struct, am I right? E.g., when the get_user_pages_remote() is called with
> > "tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
> > accounting for that remote task/thread?
>
> Well, that would be a data race and fundamentally buggy.
>
> It would be ok with something like ptrace (which only works when the
> target is quiescent), but is completely wrong otherwise.
>
> I guess it works fine in practice, and it's only statistics so even if
> you were to have a data race it doesn't much matter, but it's
> definitely conceptually very very wrong.
>
> The fault stats should be about who does the fault (they are about the
> _thread_) not about who the fault is done to (which is about the
> _mm_).
>
> Allocating the fault data to somebody else sounds frankly silly and
> stupid to me, exactly because it's (a) racy and (b) not even
> conceptually correct. The other thread literally _isn't_ doing a major
> page fault, for crissake!
>
> Now, there are some actual per-mm statistics too (the rss stuff etc),
> and it's fundamentally harder exactly because of the shared data. See
> the mm_counter stuff etc. Those are not about who does soemthing, they
> are about the resulting MM state.
I see. How about I move another small step further to cover GUP (to be
explicit, faultin_page() and fixup_user_fault())?
GUP needs the per-task accounting, but not the perf events. We can do that by
slightly changing the new approach into:
bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
if (major)
current->maj_flt++;
else
current->min_flt++;
if (!regs)
return ret;
if (major)
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
else
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
With the change we will always account that onto current task. Since there're
GUP calls that are not even accounted at all, e.g., get_user_pages_remote()
with tsk==NULL: for these calls, we also account these page faults onto current
task.
Another major benefit I can see (besides a completely cleaned up accounting) is
that we can remove the task_struct pointer in the whole GUP code, because
AFAICT that's only used for this pf accounting either in faultin_page() or
fixup_user_fault(), which seems questionable itself now to not use current...
Any opinions?
Thanks,
--
Peter Xu
On Thu, Jun 18, 2020 at 7:38 AM Peter Xu <peterx@redhat.com> wrote:
>
> GUP needs the per-task accounting, but not the perf events. We can do that by
> slightly changing the new approach into:
>
> bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
>
> if (major)
> current->maj_flt++;
> else
> current->min_flt++;
>
> if (!regs)
> return ret;
>
> if (major)
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
> else
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
Ack, I think this is the right thing to do.
No normal situation will ever notice the difference, with remote
accesses being as rare and specialized as they are. But being able to
remote the otherwise unused 'tsk' parameter sounds like the right
thing to do too.
It might be worth adding a comment about why.
Also, honestly, how about we remove the 'major' variable entirely, and
instead make the code be something like
unsigned long *flt;
int event_type;
...
/* Major fault */
if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
flt = ¤t->maj_flt;
event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
} else {
flt = ¤t->min_flt;
event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
}
*flt++;
if (regs)
perf_sw_event(event_type, 1, regs, address);
instead. Less source code duplication, and I bet it improves code
generation too.
Linus
On Thu, Jun 18, 2020 at 10:15:50AM -0700, Linus Torvalds wrote:
> On Thu, Jun 18, 2020 at 7:38 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > GUP needs the per-task accounting, but not the perf events. We can do that by
> > slightly changing the new approach into:
> >
> > bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
> >
> > if (major)
> > current->maj_flt++;
> > else
> > current->min_flt++;
> >
> > if (!regs)
> > return ret;
> >
> > if (major)
> > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
> > else
> > perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
>
> Ack, I think this is the right thing to do.
>
> No normal situation will ever notice the difference, with remote
> accesses being as rare and specialized as they are. But being able to
> remote the otherwise unused 'tsk' parameter sounds like the right
> thing to do too.
>
> It might be worth adding a comment about why.
>
> Also, honestly, how about we remove the 'major' variable entirely, and
> instead make the code be something like
>
> unsigned long *flt;
> int event_type;
> ...
>
> /* Major fault */
> if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
> flt = ¤t->maj_flt;
> event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
> } else {
> flt = ¤t->min_flt;
> event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
> }
> *flt++;
> if (regs)
> perf_sw_event(event_type, 1, regs, address);
>
> instead. Less source code duplication, and I bet it improves code
> generation too.
Will do. Thanks!
--
Peter Xu
On Thu, Jun 18, 2020 at 05:24:30PM -0400, Peter Xu wrote:
> > /* Major fault */
> > if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
> > flt = ¤t->maj_flt;
> > event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
> > } else {
> > flt = ¤t->min_flt;
> > event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
> > }
> > *flt++;
> > if (regs)
> > perf_sw_event(event_type, 1, regs, address);
Sadly, this line seems to fail the compilation:
CC mm/memory.o
In file included from ././include/linux/compiler_types.h:68,
from <command-line>:
./arch/x86/include/asm/jump_label.h: In function ‘handle_mm_fault’:
./include/linux/compiler-gcc.h:120:38: warning: ‘asm’ operand 0 probably does not match constraints
120 | #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
| ^~~
./arch/x86/include/asm/jump_label.h:25:2: note: in expansion of macro ‘asm_volatile_goto’
25 | asm_volatile_goto("1:"
| ^~~~~~~~~~~~~~~~~
./include/linux/compiler-gcc.h:120:38: error: impossible constraint in ‘asm’
120 | #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
| ^~~
./arch/x86/include/asm/jump_label.h:25:2: note: in expansion of macro ‘asm_volatile_goto’
25 | asm_volatile_goto("1:"
| ^~~~~~~~~~~~~~~~~
make[1]: *** [scripts/Makefile.build:267: mm/memory.o] Error 1
make: *** [Makefile:1729: mm] Error 2
Frankly speaking I have no solid understanding on what's the error about... But
my gut feeling is that it's about the static keys, where perf_sw_event() may
only support static event types (otherwise we won't be able to know whether to
patch the instruction with no-op or a jump?).
I'll go back to the simple version for now, until I know a solution..
Thanks,
--
Peter Xu
On Thu, Jun 18, 2020 at 3:28 PM Peter Xu <peterx@redhat.com> wrote:
>
> > > if (regs)
> > > perf_sw_event(event_type, 1, regs, address);
>
> Sadly, this line seems to fail the compilation:
Yeah, I should have known that. We require a constant event ID,
because it uses the magical static keys.
So never mind. The perf_sw_event() calls can't be shared for two different ID's.
Linus
On Mon, 15 Jun 2020 15:16:00 PDT (-0700), peterx@redhat.com wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Paul Walmsley <paul.walmsley@sifive.com>
> CC: Palmer Dabbelt <palmer@dabbelt.com>
> CC: Albert Ou <aou@eecs.berkeley.edu>
> CC: linux-riscv@lists.infradead.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> arch/riscv/mm/fault.c | 21 +++------------------
> 1 file changed, 3 insertions(+), 18 deletions(-)
>
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index be84e32adc4c..9262338614d1 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -30,7 +30,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
> struct vm_area_struct *vma;
> struct mm_struct *mm;
> unsigned long addr, cause;
> - unsigned int flags = FAULT_FLAG_DEFAULT;
> + unsigned int flags = FAULT_FLAG_DEFAULT, major = 0;
> int code = SEGV_MAPERR;
> vm_fault_t fault;
>
> @@ -65,9 +65,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>
> if (user_mode(regs))
> flags |= FAULT_FLAG_USER;
> -
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
> -
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, addr);
> @@ -111,6 +108,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
> * the fault.
> */
> fault = handle_mm_fault(vma, addr, flags);
> + major |= fault & VM_FAULT_MAJOR;
>
> /*
> * If we need to retry but a fatal signal is pending, handle the
> @@ -128,21 +126,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
> BUG();
> }
>
> - /*
> - * Major/minor page fault accounting is only done on the
> - * initial attempt. If we go through a retry, it is extremely
> - * likely that the page will be found in page cache at that point.
> - */
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> - if (fault & VM_FAULT_MAJOR) {
> - tsk->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
> - 1, regs, addr);
> - } else {
> - tsk->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
> - 1, regs, addr);
> - }
> if (fault & VM_FAULT_RETRY) {
> flags |= FAULT_FLAG_TRIED;
>
> @@ -156,6 +140,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
> }
>
> up_read(&mm->mmap_sem);
> + mm_fault_accounting(tsk, regs, addr, major);
> return;
>
> /*
AFAICT this changes the behavior of the perf event: it used to count any fault,
whereas now it only counts those that succeed successfully. If everyone else
is doing it that way then I'm happy to change us over, but this definately
isn't just avoiding retries.
On Thu, Jun 18, 2020 at 04:49:23PM -0700, Palmer Dabbelt wrote:
> AFAICT this changes the behavior of the perf event: it used to count any fault,
> whereas now it only counts those that succeed successfully. If everyone else
> is doing it that way then I'm happy to change us over, but this definately
> isn't just avoiding retries.
Right, I'm preparing v2 to keep the old behavior of PERF_COUNT_SW_PAGE_FAULTS,
while with a nicer approach.
Thanks for looking!
--
Peter Xu
On Mon, Jun 15, 2020 at 06:23:06PM -0400, Peter Xu wrote:
> Use the new mm_fault_accounting() helper for page fault accounting.
>
> Avoid doing page fault accounting multiple times if the page fault is retried.
>
> CC: Yoshinori Sato <ysato@users.sourceforge.jp>
> CC: Rich Felker <dalias@libc.org>
> CC: linux-sh@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> arch/sh/mm/fault.c | 15 +++------------
> 1 file changed, 3 insertions(+), 12 deletions(-)
>
> diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
> index 5f23d7907597..06b232973488 100644
> --- a/arch/sh/mm/fault.c
> +++ b/arch/sh/mm/fault.c
> @@ -379,7 +379,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> struct task_struct *tsk;
> struct mm_struct *mm;
> struct vm_area_struct * vma;
> - vm_fault_t fault;
> + vm_fault_t fault, major = 0;
> unsigned int flags = FAULT_FLAG_DEFAULT;
>
> tsk = current;
> @@ -412,8 +412,6 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> if ((regs->sr & SR_IMASK) != SR_IMASK)
> local_irq_enable();
>
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> -
> /*
> * If we're in an interrupt, have no user context or are running
> * with pagefaults disabled then we must not take the fault:
> @@ -465,21 +463,13 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> * the fault.
> */
> fault = handle_mm_fault(vma, address, flags);
> + major |= fault & VM_FAULT_MAJOR;
>
> if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR)))
> if (mm_fault_error(regs, error_code, address, fault))
> return;
>
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> - if (fault & VM_FAULT_MAJOR) {
> - tsk->maj_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1,
> - regs, address);
> - } else {
> - tsk->min_flt++;
> - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1,
> - regs, address);
> - }
> if (fault & VM_FAULT_RETRY) {
> flags |= FAULT_FLAG_TRIED;
>
> @@ -493,4 +483,5 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> }
>
> up_read(&mm->mmap_sem);
> + mm_fault_accounting(tsk, regs, address, major);
> }
> --
> 2.26.2
What's the status on the rest of this series? Do you need any action
from my side (arch/sh) at this time?
Rich
Hi, Rich,
On Mon, Jul 20, 2020 at 05:25:24PM -0400, Rich Felker wrote:
> What's the status on the rest of this series? Do you need any action
> from my side (arch/sh) at this time?
This series should have been queued by Andrew in -next. So iiuc no further
action is needed.
Thanks for asking,
--
Peter Xu