From: Tim Chen <tim.c.chen@linux.intel.com> To: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Alex Shi <alex.shi@linaro.org>, Andi Kleen <andi@firstfloor.org>, Michel Lespinasse <walken@google.com>, Davidlohr Bueso <davidlohr.bueso@hp.com>, Matthew R Wilcox <matthew.r.wilcox@intel.com>, Dave Hansen <dave.hansen@intel.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Rik van Riel <riel@redhat.com>, Peter Hurley <peter@hurleysoftware.com>, "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>, Jason Low <jason.low2@hp.com>, Waiman Long <Waiman.Long@hp.com>, linux-kernel@vger.kernel.org, linux-mm <linux-mm@kvack.org> Subject: Re: [PATCH v8 0/9] rwsem performance optimizations Date: Wed, 16 Oct 2013 11:28:34 -0700 [thread overview] Message-ID: <1381948114.11046.194.camel@schen9-DESK> (raw) In-Reply-To: <20131016065526.GB22509@gmail.com> On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote: > * Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > > > * Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > > #threads vanilla all rwsem without optspin > > > > patches > > > > 1 3.0% -1.0% -1.7% > > > > 5 7.2% -26.8% 5.5% > > > > 10 5.2% -10.6% 22.1% > > > > 20 6.8% 16.4% 12.5% > > > > 40 -0.2% 32.7% 0.0% > > > > > > > > So with mutex, the vanilla kernel and the one without optspin both run > > > > faster. This is consistent with what Peter reported. With optspin, the > > > > picture is more mixed, with lower throughput at low to moderate number > > > > of threads and higher throughput with high number of threads. > > > > > > So, going back to your orignal table: > > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > > #threads vanilla all without optspin > > > > 1 3.0% -1.0% -1.7% > > > > 5 7.2% -26.8% 5.5% > > > > 10 5.2% -10.6% 22.1% > > > > 20 6.8% 16.4% 12.5% > > > > 40 -0.2% 32.7% 0.0% > > > > > > > > In general, vanilla and no-optspin case perform better with > > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > > > worse at low to moderate contention and better at high contention. > > > > > > it appears that 'without optspin' appears to be a pretty good choice - if > > > it wasn't for that '1 thread' number, which, if I correctly assume is the > > > uncontended case, is one of the most common usecases ... > > > > > > How can the single-threaded case get slower? None of the patches should > > > really cause noticeable overhead in the non-contended case. That looks > > > weird. > > > > > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > > > most common contention scenarios in practice - where do we see the first > > > improvement in performance? > > > > > > Also, it would be nice to include a noise/sttdev figure, it's really hard > > > to tell whether -1.7% is statistically significant. > > > > Ingo, > > > > I think that the optimistic spin changes to rwsem should enhance > > performance to real workloads after all. > > > > In my previous tests, I was doing mmap followed immediately by > > munmap without doing anything to the memory. No real workload > > will behave that way and it is not the scenario that we > > should optimize for. A much better approximation of > > real usages will be doing mmap, then touching > > the memories being mmaped, followed by munmap. > > That's why I asked for a working testcase to be posted ;-) Not just > pseudocode - send the real .c thing please. I was using a modified version of Anton's will-it-scale test. I'll try to port the tests to perf bench to make it easier for other people to run the tests. > > > This changes the dynamics of the rwsem as we are now dominated by read > > acquisitions of mmap sem due to the page faults, instead of having only > > write acquisitions from mmap. [...] > > Absolutely, the page fault read case is the #1 optimization target of > rwsems. > > > [...] In this case, any delay in write acquisitions will be costly as we > > will be blocking a lot of readers. This is where optimistic spinning on > > write acquisitions of mmap sem can provide a very significant boost to > > the throughput. > > > > I change the test case to the following with writes to > > the mmaped memory: > > > > #define MEMSIZE (1 * 1024 * 1024) > > > > char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; > > > > void testcase(unsigned long long *iterations) > > { > > int i; > > > > while (1) { > > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, > > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > > assert(c != MAP_FAILED); > > for (i=0; i<MEMSIZE; i+=8) { > > c[i] = 0xa; > > } > > munmap(c, MEMSIZE); > > > > (*iterations)++; > > } > > } > > It would be _really_ nice to stick this into tools/perf/bench/ as: > > perf bench mem pagefaults > > or so, with a number of parallelism and workload patterns. See > tools/perf/bench/numa.c for a couple of workload generators - although > those are not page fault intense. > > So that future generations can run all these tests too and such. Okay, will do. > > > I compare the throughput where I have the complete rwsem patchset > > against vanilla and the case where I take out the optimistic spin patch. > > I have increased the run time by 10x from my pervious experiments and do > > 10 runs for each case. The standard deviation is ~1.5% so any changes > > under 1.5% is statistically significant. > > > > % change in throughput vs the vanilla kernel. > > Threads all No-optspin > > 1 +0.4% -0.1% > > 2 +2.0% +0.2% > > 3 +1.1% +1.5% > > 4 -0.5% -1.4% > > 5 -0.1% -0.1% > > 10 +2.2% -1.2% > > 20 +237.3% -2.3% > > 40 +548.1% +0.3% > > The tail is impressive. The early parts are important as well, but it's > really hard to tell the significance of the early portion without having > an sttdev column. Here's the data with sdv column: n all sdv No-optspin sdv 1 +0.4% 0.9% -0.1% 0.8% 2 +2.0% 0.8% +0.2% 1.2% 3 +1.1% 0.8% +1.5% 0.6% 4 -0.5% 0.9% -1.4% 1.1% 5 -0.1% 1.1% -0.1% 1.1% 10 +2.2% 0.8% -1.2% 1.0% 20 +237.3% 0.7% -2.3% 1.3% 40 +548.1% 0.8% +0.3% 1.2% > ( "perf stat --repeat N" will give you sttdev output, in handy percentage > form. ) > > > Now when I test the case where we acquire mutex in the > > user space before mmap, I got the following data versus > > vanilla kernel. There's little contention on mmap sem > > acquisition in this case. > > > > n all No-optspin > > 1 +0.8% -1.2% > > 2 +1.0% -0.5% > > 3 +1.8% +0.2% > > 4 +1.5% -0.4% > > 5 +1.1% +0.4% > > 10 +1.5% -0.3% > > 20 +1.4% -0.2% > > 40 +1.3% +0.4% Adding std-dev to above data: n all sdv No-optspin sdv 1 +0.8% 1.0% -1.2% 1.2% 2 +1.0% 1.0% -0.5% 1.0% 3 +1.8% 0.7% +0.2% 0.8% 4 +1.5% 0.8% -0.4% 0.7% 5 +1.1% 1.1% +0.4% 0.3% 10 +1.5% 0.7% -0.3% 0.7% 20 +1.4% 0.8% -0.2% 1.0% 40 +1.3% 0.7% +0.4% 0.5% > > > > Thanks. > > A bit hard to see as there's no comparison _between_ the pthread_mutex and > plain-parallel versions. No contention isn't a great result if performance > suffers because it's all serialized. Now the data for pthread-mutex vs plain-parallel vanilla testcase with std-dev n vanilla sdv Rwsem-all sdv No-optspin sdv 1 +0.5% 0.9% +1.4% 0.9% -0.7% 1.0% 2 -39.3% 1.0% -38.7% 1.1% -39.6% 1.1% 3 -52.6% 1.2% -51.8% 0.7% -52.5% 0.7% 4 -59.8% 0.8% -59.2% 1.0% -59.9% 0.9% 5 -63.5% 1.4% -63.1% 1.4% -63.4% 1.0% 10 -66.1% 1.3% -65.6% 1.3% -66.2% 1.3% 20 +178.3% 0.9% +182.3% 1.0% +177.7% 1.1% 40 +604.8% 1.1% +614.0% 1.0% +607.9% 0.9% The version with full rwsem patchset perform best across the threads. Serialization actually hurts for smaller number of threads even for current vanilla kernel. I'll rerun the tests once I ported them to the perf bench. It may take me a couple of days. Thanks. Tim
WARNING: multiple messages have this Message-ID (diff)
From: Tim Chen <tim.c.chen@linux.intel.com> To: Ingo Molnar <mingo@kernel.org> Cc: Ingo Molnar <mingo@elte.hu>, Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, Andrea Arcangeli <aarcange@redhat.com>, Alex Shi <alex.shi@linaro.org>, Andi Kleen <andi@firstfloor.org>, Michel Lespinasse <walken@google.com>, Davidlohr Bueso <davidlohr.bueso@hp.com>, Matthew R Wilcox <matthew.r.wilcox@intel.com>, Dave Hansen <dave.hansen@intel.com>, Peter Zijlstra <a.p.zijlstra@chello.nl>, Rik van Riel <riel@redhat.com>, Peter Hurley <peter@hurleysoftware.com>, "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>, Jason Low <jason.low2@hp.com>, Waiman Long <Waiman.Long@hp.com>, linux-kernel@vger.kernel.org, linux-mm <linux-mm@kvack.org> Subject: Re: [PATCH v8 0/9] rwsem performance optimizations Date: Wed, 16 Oct 2013 11:28:34 -0700 [thread overview] Message-ID: <1381948114.11046.194.camel@schen9-DESK> (raw) In-Reply-To: <20131016065526.GB22509@gmail.com> On Wed, 2013-10-16 at 08:55 +0200, Ingo Molnar wrote: > * Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > > > * Tim Chen <tim.c.chen@linux.intel.com> wrote: > > > > > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > > #threads vanilla all rwsem without optspin > > > > patches > > > > 1 3.0% -1.0% -1.7% > > > > 5 7.2% -26.8% 5.5% > > > > 10 5.2% -10.6% 22.1% > > > > 20 6.8% 16.4% 12.5% > > > > 40 -0.2% 32.7% 0.0% > > > > > > > > So with mutex, the vanilla kernel and the one without optspin both run > > > > faster. This is consistent with what Peter reported. With optspin, the > > > > picture is more mixed, with lower throughput at low to moderate number > > > > of threads and higher throughput with high number of threads. > > > > > > So, going back to your orignal table: > > > > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > > > #threads vanilla all without optspin > > > > 1 3.0% -1.0% -1.7% > > > > 5 7.2% -26.8% 5.5% > > > > 10 5.2% -10.6% 22.1% > > > > 20 6.8% 16.4% 12.5% > > > > 40 -0.2% 32.7% 0.0% > > > > > > > > In general, vanilla and no-optspin case perform better with > > > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > > > worse at low to moderate contention and better at high contention. > > > > > > it appears that 'without optspin' appears to be a pretty good choice - if > > > it wasn't for that '1 thread' number, which, if I correctly assume is the > > > uncontended case, is one of the most common usecases ... > > > > > > How can the single-threaded case get slower? None of the patches should > > > really cause noticeable overhead in the non-contended case. That looks > > > weird. > > > > > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > > > most common contention scenarios in practice - where do we see the first > > > improvement in performance? > > > > > > Also, it would be nice to include a noise/sttdev figure, it's really hard > > > to tell whether -1.7% is statistically significant. > > > > Ingo, > > > > I think that the optimistic spin changes to rwsem should enhance > > performance to real workloads after all. > > > > In my previous tests, I was doing mmap followed immediately by > > munmap without doing anything to the memory. No real workload > > will behave that way and it is not the scenario that we > > should optimize for. A much better approximation of > > real usages will be doing mmap, then touching > > the memories being mmaped, followed by munmap. > > That's why I asked for a working testcase to be posted ;-) Not just > pseudocode - send the real .c thing please. I was using a modified version of Anton's will-it-scale test. I'll try to port the tests to perf bench to make it easier for other people to run the tests. > > > This changes the dynamics of the rwsem as we are now dominated by read > > acquisitions of mmap sem due to the page faults, instead of having only > > write acquisitions from mmap. [...] > > Absolutely, the page fault read case is the #1 optimization target of > rwsems. > > > [...] In this case, any delay in write acquisitions will be costly as we > > will be blocking a lot of readers. This is where optimistic spinning on > > write acquisitions of mmap sem can provide a very significant boost to > > the throughput. > > > > I change the test case to the following with writes to > > the mmaped memory: > > > > #define MEMSIZE (1 * 1024 * 1024) > > > > char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; > > > > void testcase(unsigned long long *iterations) > > { > > int i; > > > > while (1) { > > char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, > > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > > assert(c != MAP_FAILED); > > for (i=0; i<MEMSIZE; i+=8) { > > c[i] = 0xa; > > } > > munmap(c, MEMSIZE); > > > > (*iterations)++; > > } > > } > > It would be _really_ nice to stick this into tools/perf/bench/ as: > > perf bench mem pagefaults > > or so, with a number of parallelism and workload patterns. See > tools/perf/bench/numa.c for a couple of workload generators - although > those are not page fault intense. > > So that future generations can run all these tests too and such. Okay, will do. > > > I compare the throughput where I have the complete rwsem patchset > > against vanilla and the case where I take out the optimistic spin patch. > > I have increased the run time by 10x from my pervious experiments and do > > 10 runs for each case. The standard deviation is ~1.5% so any changes > > under 1.5% is statistically significant. > > > > % change in throughput vs the vanilla kernel. > > Threads all No-optspin > > 1 +0.4% -0.1% > > 2 +2.0% +0.2% > > 3 +1.1% +1.5% > > 4 -0.5% -1.4% > > 5 -0.1% -0.1% > > 10 +2.2% -1.2% > > 20 +237.3% -2.3% > > 40 +548.1% +0.3% > > The tail is impressive. The early parts are important as well, but it's > really hard to tell the significance of the early portion without having > an sttdev column. Here's the data with sdv column: n all sdv No-optspin sdv 1 +0.4% 0.9% -0.1% 0.8% 2 +2.0% 0.8% +0.2% 1.2% 3 +1.1% 0.8% +1.5% 0.6% 4 -0.5% 0.9% -1.4% 1.1% 5 -0.1% 1.1% -0.1% 1.1% 10 +2.2% 0.8% -1.2% 1.0% 20 +237.3% 0.7% -2.3% 1.3% 40 +548.1% 0.8% +0.3% 1.2% > ( "perf stat --repeat N" will give you sttdev output, in handy percentage > form. ) > > > Now when I test the case where we acquire mutex in the > > user space before mmap, I got the following data versus > > vanilla kernel. There's little contention on mmap sem > > acquisition in this case. > > > > n all No-optspin > > 1 +0.8% -1.2% > > 2 +1.0% -0.5% > > 3 +1.8% +0.2% > > 4 +1.5% -0.4% > > 5 +1.1% +0.4% > > 10 +1.5% -0.3% > > 20 +1.4% -0.2% > > 40 +1.3% +0.4% Adding std-dev to above data: n all sdv No-optspin sdv 1 +0.8% 1.0% -1.2% 1.2% 2 +1.0% 1.0% -0.5% 1.0% 3 +1.8% 0.7% +0.2% 0.8% 4 +1.5% 0.8% -0.4% 0.7% 5 +1.1% 1.1% +0.4% 0.3% 10 +1.5% 0.7% -0.3% 0.7% 20 +1.4% 0.8% -0.2% 1.0% 40 +1.3% 0.7% +0.4% 0.5% > > > > Thanks. > > A bit hard to see as there's no comparison _between_ the pthread_mutex and > plain-parallel versions. No contention isn't a great result if performance > suffers because it's all serialized. Now the data for pthread-mutex vs plain-parallel vanilla testcase with std-dev n vanilla sdv Rwsem-all sdv No-optspin sdv 1 +0.5% 0.9% +1.4% 0.9% -0.7% 1.0% 2 -39.3% 1.0% -38.7% 1.1% -39.6% 1.1% 3 -52.6% 1.2% -51.8% 0.7% -52.5% 0.7% 4 -59.8% 0.8% -59.2% 1.0% -59.9% 0.9% 5 -63.5% 1.4% -63.1% 1.4% -63.4% 1.0% 10 -66.1% 1.3% -65.6% 1.3% -66.2% 1.3% 20 +178.3% 0.9% +182.3% 1.0% +177.7% 1.1% 40 +604.8% 1.1% +614.0% 1.0% +607.9% 0.9% The version with full rwsem patchset perform best across the threads. Serialization actually hurts for smaller number of threads even for current vanilla kernel. I'll rerun the tests once I ported them to the perf bench. It may take me a couple of days. Thanks. Tim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-10-16 18:28 UTC|newest] Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top [not found] <cover.1380748401.git.tim.c.chen@linux.intel.com> 2013-10-02 22:38 ` [PATCH v8 0/9] rwsem performance optimizations Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-03 7:32 ` Ingo Molnar 2013-10-03 7:32 ` Ingo Molnar 2013-10-07 22:57 ` Tim Chen 2013-10-07 22:57 ` Tim Chen 2013-10-09 6:15 ` Ingo Molnar 2013-10-09 6:15 ` Ingo Molnar 2013-10-09 7:28 ` Peter Zijlstra 2013-10-09 7:28 ` Peter Zijlstra 2013-10-10 3:14 ` Linus Torvalds 2013-10-10 3:14 ` Linus Torvalds 2013-10-10 5:03 ` Davidlohr Bueso 2013-10-10 5:03 ` Davidlohr Bueso 2013-10-09 16:34 ` Tim Chen 2013-10-09 16:34 ` Tim Chen 2013-10-10 7:54 ` Ingo Molnar 2013-10-10 7:54 ` Ingo Molnar 2013-10-16 0:09 ` Tim Chen 2013-10-16 0:09 ` Tim Chen 2013-10-16 6:55 ` Ingo Molnar 2013-10-16 6:55 ` Ingo Molnar 2013-10-16 18:28 ` Tim Chen [this message] 2013-10-16 18:28 ` Tim Chen 2013-11-04 22:36 ` Tim Chen 2013-11-04 22:36 ` Tim Chen 2013-10-16 21:55 ` Tim Chen 2013-10-16 21:55 ` Tim Chen 2013-10-18 6:52 ` Ingo Molnar 2013-10-18 6:52 ` Ingo Molnar 2013-10-02 22:38 ` [PATCH v8 1/9] rwsem: check the lock before cpmxchg in down_write_trylock Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 2/9] rwsem: remove 'out' label in do_wake Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 3/9] rwsem: remove try_reader_grant label do_wake Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 4/9] rwsem/wake: check lock before do atomic update Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 5/9] MCS Lock: Restructure the MCS lock defines and locking code into its own file Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-08 19:51 ` Rafael Aquini 2013-10-08 19:51 ` Rafael Aquini 2013-10-08 20:34 ` Tim Chen 2013-10-08 20:34 ` Tim Chen 2013-10-08 21:31 ` Rafael Aquini 2013-10-08 21:31 ` Rafael Aquini 2013-10-02 22:38 ` [PATCH v8 6/9] MCS Lock: optimizations and extra comments Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 7/9] MCS Lock: Barrier corrections Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 8/9] rwsem: do optimistic spinning for writer lock acquisition Tim Chen 2013-10-02 22:38 ` Tim Chen 2013-10-02 22:38 ` [PATCH v8 9/9] rwsem: reduce spinlock contention in wakeup code path Tim Chen 2013-10-02 22:38 ` Tim Chen
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1381948114.11046.194.camel@schen9-DESK \ --to=tim.c.chen@linux.intel.com \ --cc=Waiman.Long@hp.com \ --cc=a.p.zijlstra@chello.nl \ --cc=aarcange@redhat.com \ --cc=akpm@linux-foundation.org \ --cc=alex.shi@linaro.org \ --cc=andi@firstfloor.org \ --cc=dave.hansen@intel.com \ --cc=davidlohr.bueso@hp.com \ --cc=jason.low2@hp.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=matthew.r.wilcox@intel.com \ --cc=mingo@elte.hu \ --cc=mingo@kernel.org \ --cc=paulmck@linux.vnet.ibm.com \ --cc=peter@hurleysoftware.com \ --cc=riel@redhat.com \ --cc=torvalds@linux-foundation.org \ --cc=walken@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.