linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* aim7 performance regression by commit 5a50508 report from LKP
@ 2013-01-29  8:25 Yuanhan Liu
  2013-01-29  8:44 ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Yuanhan Liu @ 2013-01-29  8:25 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp

Hi,

Here we, LKP(Linux Kernel Performance, a project to do performance
benchmark testing for Linux kernel, and try to find and fix the
performance regressions), found a 10%-20% performance regression
of aim7 benchmark introduced by commit 5a50508: "mm/rmap: Convert
the struct anon_vma::mutex to an rwsem".

It is quite easy to reproduce such regression with the workload of
workfile.shared in aim7 benchmark[0]. More, you will see more
obvious performance drop by setting fork_test as the only one
workload; the performance drops 2-4 times on our test boxes.

And here is the workfile you can see a obvious performance drop:

----------- workfile ----------
FILESIZE: 1M
POOLSIZE: 10M
10 fork_test
-------------------------------

Commit 5a50508 just convert all mutex lock to down_write, which, in
my opinion, is the same in semantics. While, the result is quite
different on the above workload.

I tried to figure out the root cause, and did some tests and took
some performance counters. Say, here is some /usr/bin/time output:

commit d28d433(good one, the commit before 5a50508):
-------------------------
  User time (seconds): 76.12
  System time (seconds): 4387.35
  Percent of CPU this job got: 599%
  Voluntary context switches: 191800723
  Involuntary context switches: 41312548

commit 5a50508(bad one):
------------------------
  User time (seconds): 98.20
  System time (seconds): 5623.07
  Percent of CPU this job got: 373%
  Voluntary context switches: 615474680
  Involuntary context switches: 20388913


Where you can see:
 1. CPU is much idler at commit 5a50508 than d28d433 (373% vs 599%)

 2. Involves more voluntary context switches at commit 5a50508 than
    d28d433(615474680 vs 191800723), which is about 3.2 times.


The performance drop and above data troubles me a lot, as I do think
rwsem write lock and mutex is the same.

I then tried to check the difference between mutex and rwsem
implementation, and found the key difference and I guess that is the
root cause:

It's about the condition how we quit the for(;;) loop, more accurate,
how to detect we got the lock.

mutex side
==========
In __mutex_lock_common:

        for (;;) {
                if (atomic_xchg(&lock->count, -1) == 1)
                        break;
        }
        ....

In __mutex_unlock_common_slowpath

        atomic_set(&lock->count, 1);
        wake_up_process(waiter->task);


rwsem side
==========
In rwsem_down_failed_common(lock part):

        waiter->task = current;
        ....

        for (;;) {
                if (!waiter.task)
                        break;
        }

In __rwsem_do_wake(unlock part):

        list_del(&waiter->list);
        tsk = waiter->task;
        smp_mb();
        waiter->task = NULL;
        wake_up_process(tsk);


Summary
=======
In a simple word: mutex use lock->count to detect if we can get a lock
or not, while rwsem checks waiter->task. And that is the key problem.

Say, assume processes A, B, tried to get the mutex lock in that order,

So, A got it first, then B is blocked(went to sleep by calling schedule
voluntarily). After a while, A put the lock by setting lock->count to 1
and calling wake_up_process() to wake up B. While before B is woke up,
another process C tries to get the mutex. And C will find the condition
of following is true:

        if (atomic_xchg(&lock->count, -1) == 1)
		break;

Then C will get the lock bypassing B. In such case, one schedule is saved.

While, you will see no bypass getting lock with rwsem, as all process
are put into that wait list in order and wake up in order.

And that's why you saw more voluntary context switch with rwsem write lock
than mutex.

"This can't explain '1. CPU is much idler at commit 5a50508 than
d28d433 (373% vs 599%)' well", added by Huang Ying. He suggests to look
this issue from another side: who owns the lock.

As stated above, anybody can have a chance to own the lock in mutex once
somebody release the lock. Well, there is only one to own the lock in
rwsem write lock, and the one is known already: the one in the head of
wait list. That would result to more contention in rwsem write lock case,
especially if the one _will_ own the lock is not running now.

He also suggested to get some scheduler data, here I sampled some data
from /proc/sched_debug. And the following  is a data list of .nr_running
filed for all CPUs:

commit d28d433(good one):
            (sampled per 2 seconds, just few of them are showed)
------------------------------------------------------------------------------------->
cpu 00:   0   0   ... 102 103  99 110 105 100 104 108 106  98 103 103 ....  98  40  38 
cpu 01:   0   0   ... 103 103  99 110 106 100 104 108 106  98 103 103 ....  98  40  38 
cpu 02:   0   0   ... 106 112 114 104 109 112 107 105 110 109 104 102 .... 109  41  42 
cpu 03:   0   0   ... 106 112 114 104 109 113 107 106 110 109 104 102 .... 109  41  42 
cpu 04:   0   0   ...  95 106 102  98 102 102 102 109 105  93  91 103 ....  98  39  38 
cpu 05:   0   0   ...  95 107 102  98 102 102 103 109 105  93  91 103 ....  99  39  38 
cpu 06:   0   0   ... 106 109 115 111 108 108 102 100 107  99 101 103 .... 110  42  44 
cpu 07:   0   0   ... 107 109 116 111 109 108 102 100 107  99 102 103 .... 110  42  44 
cpu 08:   0   0   ... 106 114 108 104 107 102 108  98 102  96  97  98 ....  99  39  40 
cpu 09:   0   0   ... 106 114 108 104 107 102 108  98 102  96  97  98 ....  99  39  40 
cpu 10:   0   0   ... 111 106 107 109 106 113 111 108 110 101 100 108 .... 104  40  39 
cpu 11:   0   0   ... 111 106 107 109 106 113 112 108 110 101 100 108 .... 104  40  39 
cpu 12:   0   1   ...  96 107 109 109 108 103 105  98 100  93  92  98 ....  98  41  42 
cpu 13:   0   1   ...  96 108 109 109 108 103 105  98 100  93  92  98 ....  98  41  42 
cpu 14:   0   0   ... 102 113 108 102 105 101 105 106 108 101 103 111 .... 113  42  43 
cpu 15:   0   0   ... 102 113 108 102 106 101 105 106 109 101 103 111 .... 114  42  43 

commit 5a50508(bad one):
            (sampled per 2 seconds, just few of them are showed)
------------------------------------------------------------------------------------->
cpu 00:   0   0   ...   0   1   0   1   0   1   0   0   0   0   2   0 ....   0   0   0 
cpu 01:   0   0   ...   0   2   0   0   0   1   0   0   1   0   2   0 ....   0   0   1 
cpu 02:   0   0   ...   0   0   0   0   0   0   0   1   1   0   0   0 ....   0   1   2 
cpu 03:   0   0   ...   0   0   0   0   0   0   1   1   1   0   0   0 ....   1   1   0 
cpu 04:   0   0   ...   0   0   0   0   0   0   1   0   1   0   0   1 ....   0   0   2 
cpu 05:   0   0   ...   0   0   0   0   0   0   0   1   1   0   0   1 ....   0   0   2 
cpu 06:   0   1   ...   0   1   1   0   0   1   3   0   2   1   0   0 ....   1   1   0 
cpu 07:   0   1   ...   0   1   0   0   0   0   3   0   2   1   0   0 ....   1   1   0 
cpu 08:   0   0   ...   0   0   2   1   0   1   0   0   0   0   0   1 ....   0   0   0 
cpu 09:   0   0   ...   0   1   0   1   0   1   0   1   0   0   0   1 ....   0   0   0 
cpu 10:   0   0   ...   0   0   2   1   0   0   1   1   0   1   0   1 ....   0   0   0 
cpu 11:   0   0   ...   0   0   2   1   0   0   1   0   0   1   0   1 ....   0   0   0 
cpu 12:   0   0   ...   2   2   0   1   2   0   1   0   0   0   0   2 ....   1   0   2 
cpu 13:   0   0   ...   2   2   1   1   2   0   1   0   0   0   0   1 ....   2   0   2 
cpu 14:   0   0   ...   0   0   2   1   0   1   0   0   0   0   0   1 ....   0   1   1 
cpu 15:   0   0   ...   0   0   1   1   0   0   2   1   0   0   0   1 ....   0   1   0 

As you can see, the difference is huge! And you can see that in commit
d28d433, all CPUs are always busy! While in commit 5a50508, about half of
CPUs are idle. That matches the result well.


NOTE: this is just a report to let you be aware that we caught a such
regression and wrote what we think may cause the regression. We don't
mean to revert such commit as we know it aimed to make rmap_walk_anon()
and try_to_unmap_anon() more scalable.


[0]: http://aimbench.cvs.sourceforge.net/viewvc/aimbench/caldera/suite7/workfile.shared?view=markup

--
Linux Kernel Performance (LKP) project        Open Source Technology Center
                                                          Intel Corporation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: aim7 performance regression by commit 5a50508 report from LKP
  2013-01-29  8:25 aim7 performance regression by commit 5a50508 report from LKP Yuanhan Liu
@ 2013-01-29  8:44 ` Ingo Molnar
  2013-01-29  9:06   ` Yuanhan Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2013-01-29  8:44 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp


* Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:

> [...]

Very nice measurements and analysis, thanks!

> As stated above, anybody can have a chance to own the lock in 
> mutex once somebody release the lock. Well, there is only one 
> to own the lock in rwsem write lock, and the one is known 
> already: the one in the head of wait list. That would result 
> to more contention in rwsem write lock case, especially if the 
> one _will_ own the lock is not running now.

I think we should allow lock-steal between rwsem writers - that 
will not hurt fairness as most rwsem fairness concerns relate to 
reader vs. writer fairness.

Am I correct to assume that all relevant users in this workload 
are down_write() users?

You can see the type of lock use in:

   perf record -g
   perf report

I bet that allowing rwsem writer lock-steal would improve other 
workloads as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: aim7 performance regression by commit 5a50508 report from LKP
  2013-01-29  8:44 ` Ingo Molnar
@ 2013-01-29  9:06   ` Yuanhan Liu
  2013-01-29  9:12     ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Yuanhan Liu @ 2013-01-29  9:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp

On Tue, Jan 29, 2013 at 09:44:00AM +0100, Ingo Molnar wrote:
> 
> * Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> 
> > [...]
> 
> Very nice measurements and analysis, thanks!
> 
> > As stated above, anybody can have a chance to own the lock in 
> > mutex once somebody release the lock. Well, there is only one 
> > to own the lock in rwsem write lock, and the one is known 
> > already: the one in the head of wait list. That would result 
> > to more contention in rwsem write lock case, especially if the 
> > one _will_ own the lock is not running now.
> 
> I think we should allow lock-steal between rwsem writers - that 
> will not hurt fairness as most rwsem fairness concerns relate to 
> reader vs. writer fairness.

Agreed, and I'm sure this will improve performance and may make this
performance regression go away. 

David, is that Ok to you? If so, I may have a try.

> 
> Am I correct to assume that all relevant users in this workload 
> are down_write() users?

Yes, as commit 5a50508 just convert all mutex to down_write.

Thanks.

	--yliu
> 
> You can see the type of lock use in:
> 
>    perf record -g
>    perf report
> 
> I bet that allowing rwsem writer lock-steal would improve other 
> workloads as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: aim7 performance regression by commit 5a50508 report from LKP
  2013-01-29  9:06   ` Yuanhan Liu
@ 2013-01-29  9:12     ` Ingo Molnar
  2013-01-30  9:34       ` Yuanhan Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2013-01-29  9:12 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp


* Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:

> On Tue, Jan 29, 2013 at 09:44:00AM +0100, Ingo Molnar wrote:
> > 
> > * Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> > 
> > > [...]
> > 
> > Very nice measurements and analysis, thanks!
> > 
> > > As stated above, anybody can have a chance to own the lock in 
> > > mutex once somebody release the lock. Well, there is only one 
> > > to own the lock in rwsem write lock, and the one is known 
> > > already: the one in the head of wait list. That would result 
> > > to more contention in rwsem write lock case, especially if the 
> > > one _will_ own the lock is not running now.
> > 
> > I think we should allow lock-steal between rwsem writers - that 
> > will not hurt fairness as most rwsem fairness concerns relate to 
> > reader vs. writer fairness.
> 
> Agreed, and I'm sure this will improve performance and may 
> make this performance regression go away.
> 
> David, is that Ok to you? If so, I may have a try.

I'm not David but please try it :-)

Making rwsem behavior and scalability similar to mutexes would 
have numerous advantages.

> > Am I correct to assume that all relevant users in this 
> > workload are down_write() users?
> 
> Yes, as commit 5a50508 just convert all mutex to down_write.

A second track of inquiry would be to see whether any of the key 
usage sites could be converted to down_read() or whether the 
lock hold times could be reduced drastically - but I doubt 
that's really possible on such heavily forking workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: aim7 performance regression by commit 5a50508 report from LKP
  2013-01-29  9:12     ` Ingo Molnar
@ 2013-01-30  9:34       ` Yuanhan Liu
  2013-01-31 11:22         ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Yuanhan Liu @ 2013-01-30  9:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp

On Tue, Jan 29, 2013 at 10:12:45AM +0100, Ingo Molnar wrote:
> 
> * Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> 
> > On Tue, Jan 29, 2013 at 09:44:00AM +0100, Ingo Molnar wrote:
> > > 
> > > * Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> > > 
> > > > [...]
> > > 
> > > Very nice measurements and analysis, thanks!
> > > 
> > > > As stated above, anybody can have a chance to own the lock in 
> > > > mutex once somebody release the lock. Well, there is only one 
> > > > to own the lock in rwsem write lock, and the one is known 
> > > > already: the one in the head of wait list. That would result 
> > > > to more contention in rwsem write lock case, especially if the 
> > > > one _will_ own the lock is not running now.
> > > 
> > > I think we should allow lock-steal between rwsem writers - that 
> > > will not hurt fairness as most rwsem fairness concerns relate to 
> > > reader vs. writer fairness.
> > 
> > Agreed, and I'm sure this will improve performance and may 
> > make this performance regression go away.
> > 
> > David, is that Ok to you? If so, I may have a try.
> 
> I'm not David but please try it :-)
> 
> Making rwsem behavior and scalability similar to mutexes would 
> have numerous advantages.
> 
> > > Am I correct to assume that all relevant users in this 
> > > workload are down_write() users?
> > 
> > Yes, as commit 5a50508 just convert all mutex to down_write.
> 
> A second track of inquiry would be to see whether any of the key 
> usage sites could be converted to down_read()

I tried before, and seems I didn't find one.

Well, I did find an irrelevant one: vma_lock_anon_vma at validate_mm()
in mm/mmap.c. That function is _real_ only if CONFIG_DEBUG_VM_RB is
set, and there is no vma_lock_anon_vma_read() or something similar,
thus I guess it's may not worthy to turn it.

> or whether the 
> lock hold times could be reduced drastically

I also found one, but it doesn't sound like the one will reduce lock
hole times drastically:
   vma_lock_anon_vma() seems covered too much code at
   expand_up/downwards.

Well, again, it's quite a tiny optimization for reducing the coverage.

Thanks.

	--yliu



> - but I doubt 
> that's really possible on such heavily forking workloads.
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: aim7 performance regression by commit 5a50508 report from LKP
  2013-01-30  9:34       ` Yuanhan Liu
@ 2013-01-31 11:22         ` Ingo Molnar
  2013-02-01 10:53           ` Yuanhan Liu
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2013-01-31 11:22 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp


* Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:

> > or whether the lock hold times could be reduced drastically
> 
> I also found one, but it doesn't sound like the one will 
> reduce lock hold times drastically:
>
>    vma_lock_anon_vma() seems covered too much code at
>    expand_up/downwards.
> 
> Well, again, it's quite a tiny optimization for reducing the 
> coverage.

The fundamental problem is I think that there's a single anon 
vma lock for the whole workload, right?

Is that really fundamentally needed, could that be spread out 
perhaps?

( But first we want to see how much of the regression we can fix 
  in an equivalent-locking-patterns fashion - improving locking
  is a separate matter. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: aim7 performance regression by commit 5a50508 report from LKP
  2013-01-31 11:22         ` Ingo Molnar
@ 2013-02-01 10:53           ` Yuanhan Liu
  0 siblings, 0 replies; 7+ messages in thread
From: Yuanhan Liu @ 2013-02-01 10:53 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Huang Ying, Linus Torvalds, David Howells, linux-kernel, lkp

On Thu, Jan 31, 2013 at 12:22:52PM +0100, Ingo Molnar wrote:
> 
> * Yuanhan Liu <yuanhan.liu@linux.intel.com> wrote:
> 
> > > or whether the lock hold times could be reduced drastically
> > 
> > I also found one, but it doesn't sound like the one will 
> > reduce lock hold times drastically:
> >
> >    vma_lock_anon_vma() seems covered too much code at
> >    expand_up/downwards.
> > 
> > Well, again, it's quite a tiny optimization for reducing the 
> > coverage.
> 
> The fundamental problem is I think that there's a single anon 
> vma lock for the whole workload, right?

Yes.
> 
> Is that really fundamentally needed, could that be spread out 
> perhaps?

I'm digging into this. 

Thanks.

	--yliu
> 
> ( But first we want to see how much of the regression we can fix 
>   in an equivalent-locking-patterns fashion - improving locking
>   is a separate matter. )
> 
> Thanks,
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-02-01 10:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-29  8:25 aim7 performance regression by commit 5a50508 report from LKP Yuanhan Liu
2013-01-29  8:44 ` Ingo Molnar
2013-01-29  9:06   ` Yuanhan Liu
2013-01-29  9:12     ` Ingo Molnar
2013-01-30  9:34       ` Yuanhan Liu
2013-01-31 11:22         ` Ingo Molnar
2013-02-01 10:53           ` Yuanhan Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).