Re: Question about cacheline bounching with percpu-rwsem and rcu-sync

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
       [not found]   ` <CAEXW_YReo2juN8A3CF+CKv8PcN_cH23gYWkLfkOJQqignyx85g@mail.gmail.com>
@ 2019-06-09  0:24     ` Joel Fernandes
  2019-06-09 12:22       ` Paul E. McKenney
  0 siblings, 1 reply; 3+ messages in thread
From: Joel Fernandes @ 2019-06-09  0:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Oleg Nesterov, Eric Dumazet, rcu, LKML

On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
[snip]
> >
> > Either way, it would be good for you to just try it.  Create a kernel
> > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > and empirically check the scalability on a largish system.  Then compare
> > this to down_read() and up_read()
>
> Will do! thanks.

I created a test for this and the results are quite amazing just
stressed read lock/unlock for rwsem vs percpu-rwsem.
The test is conducted on a dual socket Intel x86_64 machine with 14
cores each socket.

Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424

Graphs/Results here:
https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing

The completion time of the test goes up somewhat exponentially with
the number of threads, for the rwsem case, where as for percpu-rwsem
it is the same. I could add this data to some of the documentation as
well.

Thanks!

 - Joel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
  2019-06-09  0:24     ` Question about cacheline bounching with percpu-rwsem and rcu-sync Joel Fernandes
@ 2019-06-09 12:22       ` Paul E. McKenney
  2019-06-09 21:25         ` Joel Fernandes
  0 siblings, 1 reply; 3+ messages in thread
From: Paul E. McKenney @ 2019-06-09 12:22 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Oleg Nesterov, Eric Dumazet, rcu, LKML

On Sat, Jun 08, 2019 at 08:24:36PM -0400, Joel Fernandes wrote:
> On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> [snip]
> > >
> > > Either way, it would be good for you to just try it.  Create a kernel
> > > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > > and empirically check the scalability on a largish system.  Then compare
> > > this to down_read() and up_read()
> >
> > Will do! thanks.
> 
> I created a test for this and the results are quite amazing just
> stressed read lock/unlock for rwsem vs percpu-rwsem.
> The test is conducted on a dual socket Intel x86_64 machine with 14
> cores each socket.
> 
> Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
> https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424

Interesting location, but looks functional.  ;-)

> Graphs/Results here:
> https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
> 
> The completion time of the test goes up somewhat exponentially with
> the number of threads, for the rwsem case, where as for percpu-rwsem
> it is the same. I could add this data to some of the documentation as
> well.

Actually, the completion time looks to be pretty close to linear in the
number of CPUs.  Which is still really bad, don't get me wrong.

Thank you for doing this, and it might be good to have some documentation
on this.  In perfbook, I use counters to make this point, and perhaps
I need to emphasize more that it also applies to other algorithms,
including locking.  Me, I learned this lesson from a logic analyzer
back in the very early 1990s.  This was back in the days before on-CPU
caches when a logic analyzer could actually tell you something about
the detailed execution.  ;-)

The key point is that you can often closely approximate the performance
of synchronization algorithms by counting the number of cache misses and
the number of CPUs competing for each cache line.

If you want to get the microbenchmark test code itself upstream,
one approach might be to have a kernel/locking/lockperf.c similar to
kernel/rcu/rcuperf.c.

Thoughts?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Question about cacheline bounching with percpu-rwsem and rcu-sync
  2019-06-09 12:22       ` Paul E. McKenney
@ 2019-06-09 21:25         ` Joel Fernandes
  0 siblings, 0 replies; 3+ messages in thread
From: Joel Fernandes @ 2019-06-09 21:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Oleg Nesterov, Eric Dumazet, rcu, LKML

On Sun, Jun 09, 2019 at 05:22:26AM -0700, Paul E. McKenney wrote:
> On Sat, Jun 08, 2019 at 08:24:36PM -0400, Joel Fernandes wrote:
> > On Fri, May 31, 2019 at 10:43 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > [snip]
> > > >
> > > > Either way, it would be good for you to just try it.  Create a kernel
> > > > module or similar than hammers on percpu_down_read() and percpu_up_read(),
> > > > and empirically check the scalability on a largish system.  Then compare
> > > > this to down_read() and up_read()
> > >
> > > Will do! thanks.
> > 
> > I created a test for this and the results are quite amazing just
> > stressed read lock/unlock for rwsem vs percpu-rwsem.
> > The test is conducted on a dual socket Intel x86_64 machine with 14
> > cores each socket.
> > 
> > Test runs 10,000,000 loops of rwsem vs percpu-rwsem:
> > https://github.com/joelagnel/linux-kernel/commit/8fe968116bd887592301179a53b7b3200db84424
> 
> Interesting location, but looks functional.  ;-)
> 
> > Graphs/Results here:
> > https://docs.google.com/spreadsheets/d/1cbVLNK8tzTZNTr-EDGDC0T0cnFCdFK3wg2Foj5-Ll9s/edit?usp=sharing
> > 
> > The completion time of the test goes up somewhat exponentially with
> > the number of threads, for the rwsem case, where as for percpu-rwsem
> > it is the same. I could add this data to some of the documentation as
> > well.
> 
> Actually, the completion time looks to be pretty close to linear in the
> number of CPUs.  Which is still really bad, don't get me wrong.

Sure, yes on second thought it is more linear than exponential :)

> Thank you for doing this, and it might be good to have some documentation
> on this.  In perfbook, I use counters to make this point, and perhaps
> I need to emphasize more that it also applies to other algorithms,
> including locking.  Me, I learned this lesson from a logic analyzer
> back in the very early 1990s.  This was back in the days before on-CPU
> caches when a logic analyzer could actually tell you something about
> the detailed execution.  ;-)
> 
> The key point is that you can often closely approximate the performance
> of synchronization algorithms by counting the number of cache misses and
> the number of CPUs competing for each cache line.

Cool, thanks for that insight. It has been some years since I used a logic
analyzer for some bus protocol debugging, but those are fun!

> If you want to get the microbenchmark test code itself upstream,
> one approach might be to have a kernel/locking/lockperf.c similar to
> kernel/rcu/rcuperf.c.
> Thoughts?

That sounds great to me, there's no other locking performance tests in the
kernel. There's locking api selftests at boot (DEBUG_LOCKING_API_SELFTESTS)
which just tests whether lockdep catches locking issues, and there's
locktorture, but I believe none of these test for lock performance.

I think a lockperf.c could also test other things about locking mechanisms,
such as how they perform if the owner of the lock is currently running vs
sleeping, while another thread is trying to acquire etc. What do you think? I
can add this to my list to do. Right now I'm working on the list-RCU lockdep
checking I started to work on [1] and want to post another series soon.

Thanks a lot,

- Joel

[1] https://lkml.org/lkml/2019/6/1/495
    https://lore.kernel.org/patchwork/patch/1082846/
> 
> 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-06-09 21:25 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAEXW_YTzUsT8xCD=vkSR=mT+L7ot7tCESTWYVqNt_3SQeVDUEA@mail.gmail.com>
     [not found] ` <20190531135051.GL28207@linux.ibm.com>
     [not found]   ` <CAEXW_YReo2juN8A3CF+CKv8PcN_cH23gYWkLfkOJQqignyx85g@mail.gmail.com>
2019-06-09  0:24     ` Question about cacheline bounching with percpu-rwsem and rcu-sync Joel Fernandes
2019-06-09 12:22       ` Paul E. McKenney
2019-06-09 21:25         ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).