[07/25] mm/csky: Use mm_fault_accounting()
diff mbox series

Message ID 20200615221607.7764-8-peterx@redhat.com
State New
Headers show
Series
  • mm: Page fault accounting cleanups
Related show

Commit Message

Peter Xu June 15, 2020, 10:15 p.m. UTC
Use the new mm_fault_accounting() helper for page fault accounting.
Also, move the accounting to be after release of mmap_sem.

CC: Guo Ren <guoren@kernel.org>
CC: linux-csky@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/csky/mm/fault.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

Comments

Guo Ren June 17, 2020, 7:04 a.m. UTC | #1
Hi Peter,

On Tue, Jun 16, 2020 at 6:16 AM Peter Xu <peterx@redhat.com> wrote:
>
> Use the new mm_fault_accounting() helper for page fault accounting.
> Also, move the accounting to be after release of mmap_sem.
>
> CC: Guo Ren <guoren@kernel.org>
> CC: linux-csky@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/csky/mm/fault.c | 13 +------------
>  1 file changed, 1 insertion(+), 12 deletions(-)
>
> diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
> index 4e6dc68f3258..8f8d34d27eca 100644
> --- a/arch/csky/mm/fault.c
> +++ b/arch/csky/mm/fault.c
> @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
>                 return;
>         }
>  #endif
> -
> -       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>         /*
>          * If we're in an interrupt or have no user
>          * context, we must not take the fault..
> @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
>                         goto bad_area;
>                 BUG();
>         }
> -       if (fault & VM_FAULT_MAJOR) {
> -               tsk->maj_flt++;
> -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
> -                             address);
> -       } else {
> -               tsk->min_flt++;
> -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
> -                             address);
> -       }
> -
>         up_read(&mm->mmap_sem);
> +       mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR);
>         return;
>
>         /*
> --
> 2.26.2
>
I notice that you move it out of mm->mmap_sem's area, all archs should
follow the rule ? Can you give me a clarification and put it into de
commit log ?
Thx
Peter Xu June 17, 2020, 3:49 p.m. UTC | #2
On Wed, Jun 17, 2020 at 03:04:49PM +0800, Guo Ren wrote:
> Hi Peter,

Hi, Guo,

> 
> On Tue, Jun 16, 2020 at 6:16 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Use the new mm_fault_accounting() helper for page fault accounting.
> > Also, move the accounting to be after release of mmap_sem.
> >
> > CC: Guo Ren <guoren@kernel.org>
> > CC: linux-csky@vger.kernel.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/csky/mm/fault.c | 13 +------------
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
> > index 4e6dc68f3258..8f8d34d27eca 100644
> > --- a/arch/csky/mm/fault.c
> > +++ b/arch/csky/mm/fault.c
> > @@ -111,8 +111,6 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
> >                 return;
> >         }
> >  #endif
> > -
> > -       perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> >         /*
> >          * If we're in an interrupt or have no user
> >          * context, we must not take the fault..
> > @@ -160,17 +158,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
> >                         goto bad_area;
> >                 BUG();
> >         }
> > -       if (fault & VM_FAULT_MAJOR) {
> > -               tsk->maj_flt++;
> > -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
> > -                             address);
> > -       } else {
> > -               tsk->min_flt++;
> > -               perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
> > -                             address);
> > -       }
> > -
> >         up_read(&mm->mmap_sem);
> > +       mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR);
> >         return;
> >
> >         /*
> > --
> > 2.26.2
> >
> I notice that you move it out of mm->mmap_sem's area, all archs should
> follow the rule ? Can you give me a clarification and put it into de
> commit log ?

I don't think it's a must, but mmap_sem should not be required at least by
observing current code.  E.g., do_user_addr_fault() of x86 does the accounting
without mmap_sem even before this series.

The perf events should be thread safe on its own.  Frankly speaking I'm not
very certain about my understanding on the per task counters, because iiuc we
can also try to get user pages remotely of a thread in parallel with the page
fault happening on the same thread, then it seems to me that the per task pf
counters can be accessed on different cores simultaneously.  However otoh that
seems to be very rare, and it's still acceptable to me as a trade off to avoid
overhead of locks or atomic ops on the counters.  I'd be glad to be corrected
if I missed anything important here...

Thanks,
Linus Torvalds June 17, 2020, 5:53 p.m. UTC | #3
On Wed, Jun 17, 2020 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
>
> I don't think it's a must, but mmap_sem should not be required at least by
> observing current code.  E.g., do_user_addr_fault() of x86 does the accounting
> without mmap_sem even before this series.

All the accounting should be per-thread and not need any locking.

Which is why a remote GUP should never account to the remote mm - not
only isn't there an unambiguous thread to account to (an mm can share
many threads), but it would require locking not just for the remote
update, but for all normal page faults.

                 Linus
Peter Xu June 17, 2020, 7:58 p.m. UTC | #4
On Wed, Jun 17, 2020 at 10:53:23AM -0700, Linus Torvalds wrote:
> On Wed, Jun 17, 2020 at 8:49 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > I don't think it's a must, but mmap_sem should not be required at least by
> > observing current code.  E.g., do_user_addr_fault() of x86 does the accounting
> > without mmap_sem even before this series.
> 
> All the accounting should be per-thread and not need any locking.
> 
> Which is why a remote GUP should never account to the remote mm - not
> only isn't there an unambiguous thread to account to (an mm can share
> many threads), but it would require locking not just for the remote
> update, but for all normal page faults.

But currently remote GUP will still do the page fault accounting on the remote
task_struct, am I right?  E.g., when the get_user_pages_remote() is called with
"tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
accounting for that remote task/thread?

Thanks,
Linus Torvalds June 17, 2020, 8:15 p.m. UTC | #5
On Wed, Jun 17, 2020 at 12:58 PM Peter Xu <peterx@redhat.com> wrote:
>
> But currently remote GUP will still do the page fault accounting on the remote
> task_struct, am I right?  E.g., when the get_user_pages_remote() is called with
> "tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
> accounting for that remote task/thread?

Well, that would be a data race and fundamentally buggy.

It would be ok with something like ptrace (which only works when the
target is quiescent), but is completely wrong otherwise.

I guess it works fine in practice, and it's only statistics so even if
you were to have a data race it doesn't much matter, but it's
definitely conceptually very very wrong.

The fault stats should be about who does the fault (they are about the
_thread_) not about who the fault is done to (which is about the
_mm_).

Allocating the fault data to somebody else sounds frankly silly and
stupid to me, exactly because it's (a) racy and (b) not even
conceptually correct. The other thread literally _isn't_ doing a major
page fault, for crissake!

Now, there are some actual per-mm statistics too (the rss stuff etc),
and it's fundamentally harder exactly because of the shared data. See
the mm_counter stuff etc. Those are not about who does soemthing, they
are about the resulting MM state.

            Linus
Peter Xu June 18, 2020, 2:38 p.m. UTC | #6
On Wed, Jun 17, 2020 at 01:15:31PM -0700, Linus Torvalds wrote:
> On Wed, Jun 17, 2020 at 12:58 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > But currently remote GUP will still do the page fault accounting on the remote
> > task_struct, am I right?  E.g., when the get_user_pages_remote() is called with
> > "tsk != current", it seems the faultin_page() will still do maj_flt/min_flt
> > accounting for that remote task/thread?
> 
> Well, that would be a data race and fundamentally buggy.
> 
> It would be ok with something like ptrace (which only works when the
> target is quiescent), but is completely wrong otherwise.
> 
> I guess it works fine in practice, and it's only statistics so even if
> you were to have a data race it doesn't much matter, but it's
> definitely conceptually very very wrong.
> 
> The fault stats should be about who does the fault (they are about the
> _thread_) not about who the fault is done to (which is about the
> _mm_).
> 
> Allocating the fault data to somebody else sounds frankly silly and
> stupid to me, exactly because it's (a) racy and (b) not even
> conceptually correct. The other thread literally _isn't_ doing a major
> page fault, for crissake!
> 
> Now, there are some actual per-mm statistics too (the rss stuff etc),
> and it's fundamentally harder exactly because of the shared data. See
> the mm_counter stuff etc. Those are not about who does soemthing, they
> are about the resulting MM state.

I see.  How about I move another small step further to cover GUP (to be
explicit, faultin_page() and fixup_user_fault())?

GUP needs the per-task accounting, but not the perf events.  We can do that by
slightly changing the new approach into:

        bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);

        if (major)
                current->maj_flt++;
        else
                current->min_flt++;

        if (!regs)
                return ret;

        if (major)
                perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
        else
                perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

With the change we will always account that onto current task.  Since there're
GUP calls that are not even accounted at all, e.g., get_user_pages_remote()
with tsk==NULL: for these calls, we also account these page faults onto current
task.

Another major benefit I can see (besides a completely cleaned up accounting) is
that we can remove the task_struct pointer in the whole GUP code, because
AFAICT that's only used for this pf accounting either in faultin_page() or
fixup_user_fault(), which seems questionable itself now to not use current...

Any opinions?

Thanks,

--
Peter Xu
Linus Torvalds June 18, 2020, 5:15 p.m. UTC | #7
On Thu, Jun 18, 2020 at 7:38 AM Peter Xu <peterx@redhat.com> wrote:
>
> GUP needs the per-task accounting, but not the perf events.  We can do that by
> slightly changing the new approach into:
>
>         bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
>
>         if (major)
>                 current->maj_flt++;
>         else
>                 current->min_flt++;
>
>         if (!regs)
>                 return ret;
>
>         if (major)
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
>         else
>                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);

Ack, I think this is the right thing to do.

No normal situation will ever notice the difference, with remote
accesses being as rare and specialized as they are. But being able to
remote the otherwise unused 'tsk' parameter sounds like the right
thing to do too.

It might be worth adding a comment about why.

Also, honestly, how about we remove the 'major' variable entirely, and
instead make the code be something like

        unsigned long *flt;
        int event_type;
        ...

        /* Major fault */
        if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
                flt = &current->maj_flt;
                event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
        } else {
                flt = &current->min_flt;
                event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
        }
        *flt++;
        if (regs)
                perf_sw_event(event_type, 1, regs, address);

instead. Less source code duplication, and I bet it improves code
generation too.

             Linus
Peter Xu June 18, 2020, 9:24 p.m. UTC | #8
On Thu, Jun 18, 2020 at 10:15:50AM -0700, Linus Torvalds wrote:
> On Thu, Jun 18, 2020 at 7:38 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > GUP needs the per-task accounting, but not the perf events.  We can do that by
> > slightly changing the new approach into:
> >
> >         bool major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED);
> >
> >         if (major)
> >                 current->maj_flt++;
> >         else
> >                 current->min_flt++;
> >
> >         if (!regs)
> >                 return ret;
> >
> >         if (major)
> >                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address);
> >         else
> >                 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
> 
> Ack, I think this is the right thing to do.
> 
> No normal situation will ever notice the difference, with remote
> accesses being as rare and specialized as they are. But being able to
> remote the otherwise unused 'tsk' parameter sounds like the right
> thing to do too.
> 
> It might be worth adding a comment about why.
> 
> Also, honestly, how about we remove the 'major' variable entirely, and
> instead make the code be something like
> 
>         unsigned long *flt;
>         int event_type;
>         ...
> 
>         /* Major fault */
>         if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
>                 flt = &current->maj_flt;
>                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
>         } else {
>                 flt = &current->min_flt;
>                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
>         }
>         *flt++;
>         if (regs)
>                 perf_sw_event(event_type, 1, regs, address);
> 
> instead. Less source code duplication, and I bet it improves code
> generation too.

Will do.  Thanks!
Peter Xu June 18, 2020, 10:28 p.m. UTC | #9
On Thu, Jun 18, 2020 at 05:24:30PM -0400, Peter Xu wrote:
> >         /* Major fault */
> >         if ((ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED)) {
> >                 flt = &current->maj_flt;
> >                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MAJ;
> >         } else {
> >                 flt = &current->min_flt;
> >                 event_type = PERF_COUNT_SW_PAGE_FAULTS_MIN;
> >         }
> >         *flt++;
> >         if (regs)
> >                 perf_sw_event(event_type, 1, regs, address);

Sadly, this line seems to fail the compilation:

  CC      mm/memory.o
In file included from ././include/linux/compiler_types.h:68,
                 from <command-line>:                      
./arch/x86/include/asm/jump_label.h: In function ‘handle_mm_fault’:
./include/linux/compiler-gcc.h:120:38: warning: ‘asm’ operand 0 probably does not match constraints
  120 | #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
      |                                      ^~~
./arch/x86/include/asm/jump_label.h:25:2: note: in expansion of macro ‘asm_volatile_goto’
   25 |  asm_volatile_goto("1:" 
      |  ^~~~~~~~~~~~~~~~~
./include/linux/compiler-gcc.h:120:38: error: impossible constraint in ‘asm’
  120 | #define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
      |                                      ^~~            
./arch/x86/include/asm/jump_label.h:25:2: note: in expansion of macro ‘asm_volatile_goto’
   25 |  asm_volatile_goto("1:"                                                                                                
      |  ^~~~~~~~~~~~~~~~~
make[1]: *** [scripts/Makefile.build:267: mm/memory.o] Error 1
make: *** [Makefile:1729: mm] Error 2

Frankly speaking I have no solid understanding on what's the error about... But
my gut feeling is that it's about the static keys, where perf_sw_event() may
only support static event types (otherwise we won't be able to know whether to
patch the instruction with no-op or a jump?).

I'll go back to the simple version for now, until I know a solution..

Thanks,
Linus Torvalds June 18, 2020, 10:59 p.m. UTC | #10
On Thu, Jun 18, 2020 at 3:28 PM Peter Xu <peterx@redhat.com> wrote:
>
> > >         if (regs)
> > >                 perf_sw_event(event_type, 1, regs, address);
>
> Sadly, this line seems to fail the compilation:

Yeah, I should have known that. We require a constant event ID,
because it uses the magical static keys.

So never mind. The perf_sw_event() calls can't be shared for two different ID's.

           Linus

Patch
diff mbox series

diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c
index 4e6dc68f3258..8f8d34d27eca 100644
--- a/arch/csky/mm/fault.c
+++ b/arch/csky/mm/fault.c
@@ -111,8 +111,6 @@  asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
 		return;
 	}
 #endif
-
-	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
 	/*
 	 * If we're in an interrupt or have no user
 	 * context, we must not take the fault..
@@ -160,17 +158,8 @@  asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
 			goto bad_area;
 		BUG();
 	}
-	if (fault & VM_FAULT_MAJOR) {
-		tsk->maj_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs,
-			      address);
-	} else {
-		tsk->min_flt++;
-		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs,
-			      address);
-	}
-
 	up_read(&mm->mmap_sem);
+	mm_fault_accounting(tsk, regs, address, fault & VM_FAULT_MAJOR);
 	return;
 
 	/*