All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alex Shi <alex.shi@linaro.org>, Andi Kleen <andi@firstfloor.org>,
	Michel Lespinasse <walken@google.com>,
	Davidlohr Bueso <davidlohr.bueso@hp.com>,
	Matthew R Wilcox <matthew.r.wilcox@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Rik van Riel <riel@redhat.com>,
	Peter Hurley <peter@hurleysoftware.com>,
	"Paul E.McKenney" <paulmck@linux.vnet.ibm.com>,
	Jason Low <jason.low2@hp.com>, Waiman Long <Waiman.Long@hp.com>,
	linux-kernel@vger.kernel.org, linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v8 0/9] rwsem performance optimizations
Date: Tue, 15 Oct 2013 17:09:16 -0700	[thread overview]
Message-ID: <1381882156.11046.178.camel@schen9-DESK> (raw)
In-Reply-To: <20131010075444.GD17990@gmail.com>

On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla 	all rwsem    	without optspin
> > 				patches
> > 1               3.0%    	-1.0%   	-1.7%
> > 5               7.2%    	-26.8%  	5.5%
> > 10              5.2%    	-10.6%  	22.1%
> > 20              6.8%    	16.4%   	12.5%
> > 40              -0.2%   	32.7%   	0.0%
> > 
> > So with mutex, the vanilla kernel and the one without optspin both run 
> > faster.  This is consistent with what Peter reported.  With optspin, the 
> > picture is more mixed, with lower throughput at low to moderate number 
> > of threads and higher throughput with high number of threads.
> 
> So, going back to your orignal table:
> 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla all     without optspin
> > 1               3.0%    -1.0%   -1.7%
> > 5               7.2%    -26.8%  5.5%
> > 10              5.2%    -10.6%  22.1%
> > 20              6.8%    16.4%   12.5%
> > 40              -0.2%   32.7%   0.0%
> >
> > In general, vanilla and no-optspin case perform better with 
> > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > worse at low to moderate contention and better at high contention.
> 
> it appears that 'without optspin' appears to be a pretty good choice - if 
> it wasn't for that '1 thread' number, which, if I correctly assume is the 
> uncontended case, is one of the most common usecases ...
> 
> How can the single-threaded case get slower? None of the patches should 
> really cause noticeable overhead in the non-contended case. That looks 
> weird.
> 
> It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> most common contention scenarios in practice - where do we see the first 
> improvement in performance?
> 
> Also, it would be nice to include a noise/sttdev figure, it's really hard 
> to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by 
munmap without doing anything to the memory.  No real workload
will behave that way and it is not the scenario that we 
should optimize for.  A much better approximation of
real usages will be doing mmap, then touching 
the memories being mmaped, followed by munmap.  

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay 
in write acquisitions will be costly as we will be
blocking a lot of readers.  This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
        int i;

        while (1) {
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                for (i=0; i<MEMSIZE; i+=8) {
                        c[i] = 0xa;
                }
                munmap(c, MEMSIZE);

                (*iterations)++;
        }
}

I compare the throughput where I have the complete rwsem 
patchset against vanilla and the case where I take out the 
optimistic spin patch.  I have increased the run time
by 10x from my pervious experiments and do 10 runs for
each case.  The standard deviation is ~1.5% so any changes
under 1.5% is statistically significant.

% change in throughput vs the vanilla kernel.
Threads	all	No-optspin
1	+0.4%	-0.1%
2	+2.0%	+0.2%
3	+1.1%	+1.5%
4	-0.5%	-1.4%
5	-0.1%	-0.1%
10	+2.2%	-1.2%
20	+237.3%	-2.3%
40	+548.1%	+0.3%

For threads 1 to 5, we essentially
have about the same performance as the vanilla case.
We are getting a boost in throughput by 237% for 20 threads
and 548% for 40 threads.  Now when we take out
the optimistic spin, we have mostly similar throughput as
the vanilla kernel for this test.

When I look at the profile of the vanilla
kernel for the 40 threads case, I saw 80% of
cpu time is spent contending for the spin lock of the rwsem
wait queue, when rwsem_down_read_failed in page fault.
When I apply the rwsem patchset with optimistic spin,
this lock contention went down to only 2% of cpu time.

Now when I test the case where we acquire mutex in the
user space before mmap, I got the following data versus
vanilla kernel.  There's little contention on mmap sem 
acquisition in this case.

n	all	No-optspin
1	+0.8%	-1.2%
2	+1.0%	-0.5%
3	+1.8%	+0.2%
4	+1.5%	-0.4%
5	+1.1%	+0.4%
10	+1.5%	-0.3%
20	+1.4%	-0.2%
40	+1.3%	+0.4%

Thanks.

Tim




WARNING: multiple messages have this Message-ID (diff)
From: Tim Chen <tim.c.chen@linux.intel.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Alex Shi <alex.shi@linaro.org>, Andi Kleen <andi@firstfloor.org>,
	Michel Lespinasse <walken@google.com>,
	Davidlohr Bueso <davidlohr.bueso@hp.com>,
	Matthew R Wilcox <matthew.r.wilcox@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Rik van Riel <riel@redhat.com>,
	Peter Hurley <peter@hurleysoftware.com>,
	"Paul E.McKenney" <paulmck@linux.vnet.ibm.com>,
	Jason Low <jason.low2@hp.com>, Waiman Long <Waiman.Long@hp.com>,
	linux-kernel@vger.kernel.org, linux-mm <linux-mm@kvack.org>
Subject: Re: [PATCH v8 0/9] rwsem performance optimizations
Date: Tue, 15 Oct 2013 17:09:16 -0700	[thread overview]
Message-ID: <1381882156.11046.178.camel@schen9-DESK> (raw)
In-Reply-To: <20131010075444.GD17990@gmail.com>

On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote:
> * Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > The throughput of pure mmap with mutex is below vs pure mmap is below:
> > 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla 	all rwsem    	without optspin
> > 				patches
> > 1               3.0%    	-1.0%   	-1.7%
> > 5               7.2%    	-26.8%  	5.5%
> > 10              5.2%    	-10.6%  	22.1%
> > 20              6.8%    	16.4%   	12.5%
> > 40              -0.2%   	32.7%   	0.0%
> > 
> > So with mutex, the vanilla kernel and the one without optspin both run 
> > faster.  This is consistent with what Peter reported.  With optspin, the 
> > picture is more mixed, with lower throughput at low to moderate number 
> > of threads and higher throughput with high number of threads.
> 
> So, going back to your orignal table:
> 
> > % change in performance of the mmap with pthread-mutex vs pure mmap
> > #threads        vanilla all     without optspin
> > 1               3.0%    -1.0%   -1.7%
> > 5               7.2%    -26.8%  5.5%
> > 10              5.2%    -10.6%  22.1%
> > 20              6.8%    16.4%   12.5%
> > 40              -0.2%   32.7%   0.0%
> >
> > In general, vanilla and no-optspin case perform better with 
> > pthread-mutex.  For the case with optspin, mmap with pthread-mutex is 
> > worse at low to moderate contention and better at high contention.
> 
> it appears that 'without optspin' appears to be a pretty good choice - if 
> it wasn't for that '1 thread' number, which, if I correctly assume is the 
> uncontended case, is one of the most common usecases ...
> 
> How can the single-threaded case get slower? None of the patches should 
> really cause noticeable overhead in the non-contended case. That looks 
> weird.
> 
> It would also be nice to see the 2, 3, 4 thread numbers - those are the 
> most common contention scenarios in practice - where do we see the first 
> improvement in performance?
> 
> Also, it would be nice to include a noise/sttdev figure, it's really hard 
> to tell whether -1.7% is statistically significant.

Ingo,

I think that the optimistic spin changes to rwsem should enhance
performance to real workloads after all.

In my previous tests, I was doing mmap followed immediately by 
munmap without doing anything to the memory.  No real workload
will behave that way and it is not the scenario that we 
should optimize for.  A much better approximation of
real usages will be doing mmap, then touching 
the memories being mmaped, followed by munmap.  

This changes the dynamics of the rwsem as we are now dominated
by read acquisitions of mmap sem due to the page faults, instead
of having only write acquisitions from mmap. In this case, any delay 
in write acquisitions will be costly as we will be
blocking a lot of readers.  This is where optimistic spinning on
write acquisitions of mmap sem can provide a very significant boost
to the throughput.

I change the test case to the following with writes to
the mmaped memory:

#define MEMSIZE (1 * 1024 * 1024)

char *testcase_description = "Anonymous memory mmap/munmap of 1MB";

void testcase(unsigned long long *iterations)
{
        int i;

        while (1) {
                char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
                assert(c != MAP_FAILED);
                for (i=0; i<MEMSIZE; i+=8) {
                        c[i] = 0xa;
                }
                munmap(c, MEMSIZE);

                (*iterations)++;
        }
}

I compare the throughput where I have the complete rwsem 
patchset against vanilla and the case where I take out the 
optimistic spin patch.  I have increased the run time
by 10x from my pervious experiments and do 10 runs for
each case.  The standard deviation is ~1.5% so any changes
under 1.5% is statistically significant.

% change in throughput vs the vanilla kernel.
Threads	all	No-optspin
1	+0.4%	-0.1%
2	+2.0%	+0.2%
3	+1.1%	+1.5%
4	-0.5%	-1.4%
5	-0.1%	-0.1%
10	+2.2%	-1.2%
20	+237.3%	-2.3%
40	+548.1%	+0.3%

For threads 1 to 5, we essentially
have about the same performance as the vanilla case.
We are getting a boost in throughput by 237% for 20 threads
and 548% for 40 threads.  Now when we take out
the optimistic spin, we have mostly similar throughput as
the vanilla kernel for this test.

When I look at the profile of the vanilla
kernel for the 40 threads case, I saw 80% of
cpu time is spent contending for the spin lock of the rwsem
wait queue, when rwsem_down_read_failed in page fault.
When I apply the rwsem patchset with optimistic spin,
this lock contention went down to only 2% of cpu time.

Now when I test the case where we acquire mutex in the
user space before mmap, I got the following data versus
vanilla kernel.  There's little contention on mmap sem 
acquisition in this case.

n	all	No-optspin
1	+0.8%	-1.2%
2	+1.0%	-0.5%
3	+1.8%	+0.2%
4	+1.5%	-0.4%
5	+1.1%	+0.4%
10	+1.5%	-0.3%
20	+1.4%	-0.2%
40	+1.3%	+0.4%

Thanks.

Tim



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-10-16  0:09 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com>
2013-10-02 22:38 ` [PATCH v8 0/9] rwsem performance optimizations Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-03  7:32   ` Ingo Molnar
2013-10-03  7:32     ` Ingo Molnar
2013-10-07 22:57     ` Tim Chen
2013-10-07 22:57       ` Tim Chen
2013-10-09  6:15       ` Ingo Molnar
2013-10-09  6:15         ` Ingo Molnar
2013-10-09  7:28         ` Peter Zijlstra
2013-10-09  7:28           ` Peter Zijlstra
2013-10-10  3:14           ` Linus Torvalds
2013-10-10  3:14             ` Linus Torvalds
2013-10-10  5:03             ` Davidlohr Bueso
2013-10-10  5:03               ` Davidlohr Bueso
2013-10-09 16:34         ` Tim Chen
2013-10-09 16:34           ` Tim Chen
2013-10-10  7:54           ` Ingo Molnar
2013-10-10  7:54             ` Ingo Molnar
2013-10-16  0:09             ` Tim Chen [this message]
2013-10-16  0:09               ` Tim Chen
2013-10-16  6:55               ` Ingo Molnar
2013-10-16  6:55                 ` Ingo Molnar
2013-10-16 18:28                 ` Tim Chen
2013-10-16 18:28                   ` Tim Chen
2013-11-04 22:36                   ` Tim Chen
2013-11-04 22:36                     ` Tim Chen
2013-10-16 21:55                 ` Tim Chen
2013-10-16 21:55                   ` Tim Chen
2013-10-18  6:52                   ` Ingo Molnar
2013-10-18  6:52                     ` Ingo Molnar
2013-10-02 22:38 ` [PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 2/9] rwsem: remove 'out' label in do_wake Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 4/9] rwsem/wake: check lock before do atomic update Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-08 19:51   ` Rafael Aquini
2013-10-08 19:51     ` Rafael Aquini
2013-10-08 20:34     ` Tim Chen
2013-10-08 20:34       ` Tim Chen
2013-10-08 21:31       ` Rafael Aquini
2013-10-08 21:31         ` Rafael Aquini
2013-10-02 22:38 ` [PATCH v8 6/9] MCS Lock: optimizations and extra comments Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 7/9] MCS Lock: Barrier corrections Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition Tim Chen
2013-10-02 22:38   ` Tim Chen
2013-10-02 22:38 ` [PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path Tim Chen
2013-10-02 22:38   ` Tim Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1381882156.11046.178.camel@schen9-DESK \
    --to=tim.c.chen@linux.intel.com \
    --cc=Waiman.Long@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@linaro.org \
    --cc=andi@firstfloor.org \
    --cc=dave.hansen@intel.com \
    --cc=davidlohr.bueso@hp.com \
    --cc=jason.low2@hp.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=mingo@elte.hu \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peter@hurleysoftware.com \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.