All of lore.kernel.org
 help / color / mirror / Atom feed
* A question on RCU vs. preempt-RCU
@ 2013-06-16  2:36 Tejun Heo
  2013-06-16  6:46 ` Rusty Russell
  2013-06-16 14:13 ` Paul E. McKenney
  0 siblings, 2 replies; 7+ messages in thread
From: Tejun Heo @ 2013-06-16  2:36 UTC (permalink / raw)
  To: Paul E. McKenney, Rusty Russell, Kent Overstreet
  Cc: linux-kernel, Linus Torvalds, Andrew Morton

Hello, guys.

Kent recently implemented a generic percpu reference counter.  It's
scheduled to be merged in the coming merge window and some part of
cgroup refcnting is already converted to it.

 https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/tree/include/linux/percpu-refcount.h?h=for-3.11

 https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/tree/lib/percpu-refcount.c?h=for-3.11

It's essentially a generalized form of module refcnting but uses
regular RCU instead of toggling preemption for local atomicity.

I've been running some performance tests with different preemption
levels and, with CONFIG_PREEMPT, the percpu ref could be slower by
around 10% or at the worst contrived case maybe even close to 20% when
compared to simple atomic_t on a single CPU (when hit by multiple CPUs
concurrently, it of course destroys atomic_t).  Most of the slow down
seems to come from the preempt tree RCU calls and there no longer
seems to be a way to opt out of that RCU implementation when
CONFIG_PREEMPT.

For most use cases, the trade-off should be fine.  With any kind of
cross-cpu traffic, which there usually will be, it should be an easy
win for the percpu-refcount even when CONFIG_PREEMPT; however, I've
been looking to replace the module ref with the generic one and the
performance degradation there has low but existing possibility of
being noticeable in some edge use cases.

We can convert the percpu-refcount to use preempt_disable/enable()
paired with call_rcu_sched() but IIUC that would have latency
implications from the callback processing side, right?  Given that
module ref killing would be very low-frequency, it shouldn't
contribute significant amount of callbacks but I'd like to avoid
providing two separate implementations if at all possible.

So, what would be the right thing to do here?  How bad would
converting percpu-refcount to sched-RCU by default be?  Would the
extra overhead on module ref be acceptable when CONFIG_PREEMPT?
What do you guys think?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A question on RCU vs. preempt-RCU
  2013-06-16  2:36 A question on RCU vs. preempt-RCU Tejun Heo
@ 2013-06-16  6:46 ` Rusty Russell
  2013-06-17 18:20   ` Tejun Heo
  2013-06-16 14:13 ` Paul E. McKenney
  1 sibling, 1 reply; 7+ messages in thread
From: Rusty Russell @ 2013-06-16  6:46 UTC (permalink / raw)
  To: Tejun Heo, Paul E. McKenney, Kent Overstreet
  Cc: linux-kernel, Linus Torvalds, Andrew Morton

Tejun Heo <tj@kernel.org> writes:
> I've been running some performance tests with different preemption
> levels and, with CONFIG_PREEMPT, the percpu ref could be slower by
> around 10% or at the worst contrived case maybe even close to 20% when
> compared to simple atomic_t on a single CPU (when hit by multiple CPUs
> concurrently, it of course destroys atomic_t).  Most of the slow down
> seems to come from the preempt tree RCU calls and there no longer
> seems to be a way to opt out of that RCU implementation when
> CONFIG_PREEMPT.
>
> For most use cases, the trade-off should be fine.  With any kind of
> cross-cpu traffic, which there usually will be, it should be an easy
> win for the percpu-refcount even when CONFIG_PREEMPT; however, I've
> been looking to replace the module ref with the generic one and the
> performance degradation there has low but existing possibility of
> being noticeable in some edge use cases.

I'm confused: is it actually 10% slower than the existing module
refcount code, or 10% slower than atomic inc?

> We can convert the percpu-refcount to use preempt_disable/enable()
> paired with call_rcu_sched() but IIUC that would have latency
> implications from the callback processing side, right?  Given that
> module ref killing would be very low-frequency, it shouldn't
> contribute significant amount of callbacks but I'd like to avoid
> providing two separate implementations if at all possible.
>
> So, what would be the right thing to do here?  How bad would
> converting percpu-refcount to sched-RCU by default be?  Would the
> extra overhead on module ref be acceptable when CONFIG_PREEMPT?
> What do you guys think?

CONFIG_PREEMPT, now with more preempt!  Sure, that has a cost, but
you're arguably fixing a bug.

If we want to improve CONFIG_PREEMPT performance, we can probably use a
trick I wanted to try long ago:

1) Use a per-cpu counter rather than a per-task counter for preempt.
2) Lay out preempt_counter so it covers NR_CPU pages, one per page.
3) When you want to preempt a CPU and counter isn't zero, make the page RO.
4) Handle preemption enable in the fault handler.

Then there's no branch in preempt_enable().

At a glance, the same trick could apply to t->rcu_read_unlock_special,
but I'd have to offload that to my RCU coprocessor.  Paul? :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A question on RCU vs. preempt-RCU
  2013-06-16  2:36 A question on RCU vs. preempt-RCU Tejun Heo
  2013-06-16  6:46 ` Rusty Russell
@ 2013-06-16 14:13 ` Paul E. McKenney
  2013-06-16 21:40   ` Tejun Heo
  1 sibling, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2013-06-16 14:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Rusty Russell, Kent Overstreet, linux-kernel, Linus Torvalds,
	Andrew Morton

On Sat, Jun 15, 2013 at 07:36:11PM -0700, Tejun Heo wrote:
> Hello, guys.
> 
> Kent recently implemented a generic percpu reference counter.  It's
> scheduled to be merged in the coming merge window and some part of
> cgroup refcnting is already converted to it.
> 
>  https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/tree/include/linux/percpu-refcount.h?h=for-3.11
> 
>  https://git.kernel.org/cgit/linux/kernel/git/tj/percpu.git/tree/lib/percpu-refcount.c?h=for-3.11
> 
> It's essentially a generalized form of module refcnting but uses
> regular RCU instead of toggling preemption for local atomicity.
> 
> I've been running some performance tests with different preemption
> levels and, with CONFIG_PREEMPT, the percpu ref could be slower by
> around 10% or at the worst contrived case maybe even close to 20% when
> compared to simple atomic_t on a single CPU (when hit by multiple CPUs
> concurrently, it of course destroys atomic_t).  Most of the slow down
> seems to come from the preempt tree RCU calls and there no longer
> seems to be a way to opt out of that RCU implementation when
> CONFIG_PREEMPT.

CONFIG_TREE_PREEMPT_RCU does have an increment, decrement (sort of),
and check in its rcu_read_lock() and rcu_read_unlock(), which will
add overhead that might well be noticeable compared to CONFIG_TREE_RCU's
zero-code implementation of rcu_read_lock() and rcu_read_unlock().

> For most use cases, the trade-off should be fine.  With any kind of
> cross-cpu traffic, which there usually will be, it should be an easy
> win for the percpu-refcount even when CONFIG_PREEMPT; however, I've
> been looking to replace the module ref with the generic one and the
> performance degradation there has low but existing possibility of
> being noticeable in some edge use cases.
> 
> We can convert the percpu-refcount to use preempt_disable/enable()
> paired with call_rcu_sched() but IIUC that would have latency
> implications from the callback processing side, right?  Given that
> module ref killing would be very low-frequency, it shouldn't
> contribute significant amount of callbacks but I'd like to avoid
> providing two separate implementations if at all possible.

The main source of longer latency from preempt_disable/enable()
(or rcu_read_{,un}lock_sched()) will be on the read side.
The callback-processing is very nearly identical.

> So, what would be the right thing to do here?  How bad would
> converting percpu-refcount to sched-RCU by default be?  Would the
> extra overhead on module ref be acceptable when CONFIG_PREEMPT?
> What do you guys think?

The big question is "how long are the RCU read-side critical sections?"
My guess is that module references can have arbitrarily long lifetimes,
which would argue strongly against use of RCU-sched.  But if the lifetimes
are always short (say, sub-microsecond), then RCU-sched should be fine.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A question on RCU vs. preempt-RCU
  2013-06-16 14:13 ` Paul E. McKenney
@ 2013-06-16 21:40   ` Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2013-06-16 21:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Rusty Russell, Kent Overstreet, linux-kernel, Linus Torvalds,
	Andrew Morton

Hello, Paul.

On Sun, Jun 16, 2013 at 07:13:35AM -0700, Paul E. McKenney wrote:
> CONFIG_TREE_PREEMPT_RCU does have an increment, decrement (sort of),
> and check in its rcu_read_lock() and rcu_read_unlock(), which will
> add overhead that might well be noticeable compared to CONFIG_TREE_RCU's
> zero-code implementation of rcu_read_lock() and rcu_read_unlock().

Yeah, I should have added one more data point.  I was testing atomic_t
vs. percpu-ref and saw the overhead and worrying that it would show
regression against preempt_disable/enable() implementation.

Just ran some tests and preempt_disable/enable() based implementation
is about 18% faster than rcu_read_lock/unlock() based one.

Compared to atomic_t, in a horribly contrived test case, normal RCU
would be slower by around 20% while the preemption one would be slower
by 7.5%.

> The main source of longer latency from preempt_disable/enable()
> (or rcu_read_{,un}lock_sched()) will be on the read side.
> The callback-processing is very nearly identical.

Ah, right.  I was completely confused there.  The goal of
CONFIG_TREE_PREEMPT_RCU is to allow preemption inside RCU read
critical sections.  I knew that at one point and completely forgot
about it, so using preemption based one is fine as long as the length
of critical section is short.

> The big question is "how long are the RCU read-side critical sections?"

Extremely short.  It's gonna be like five instructions.

> My guess is that module references can have arbitrarily long lifetimes,

Preemption is disabled only while the refcnt operations are actually
going on.

> which would argue strongly against use of RCU-sched.  But if the lifetimes
> are always short (say, sub-microsecond), then RCU-sched should be fine.

So, RCU-sched, it is.

Thanks a lot for the help!

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A question on RCU vs. preempt-RCU
  2013-06-16  6:46 ` Rusty Russell
@ 2013-06-17 18:20   ` Tejun Heo
  2013-06-18  5:21     ` Rusty Russell
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2013-06-17 18:20 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Paul E. McKenney, Kent Overstreet, linux-kernel, Linus Torvalds,
	Andrew Morton

Hello, Rusty.

On Sun, Jun 16, 2013 at 04:16:15PM +0930, Rusty Russell wrote:
> > For most use cases, the trade-off should be fine.  With any kind of
> > cross-cpu traffic, which there usually will be, it should be an easy
> > win for the percpu-refcount even when CONFIG_PREEMPT; however, I've
> > been looking to replace the module ref with the generic one and the
> > performance degradation there has low but existing possibility of
> > being noticeable in some edge use cases.
> 
> I'm confused: is it actually 10% slower than the existing module
> refcount code, or 10% slower than atomic inc?

Heh, sorry about the confusion.  I was comparing percpu_ref to
atomic_t and then worrying about the rcu flipping overhead as it
definitely seemed higher than flipping preemption.  As I wrote in a
reply to Paul, if I compare perpcu-ref with normal RCU against
RCU-sched, the performance difference is around 18% in favor of
RCU-sched.

> CONFIG_PREEMPT, now with more preempt!  Sure, that has a cost, but
> you're arguably fixing a bug.

It seems that using RCU-sched is the right flavor for perpcu_ref.  In
theory, we shouldn't see any performance degradation when converting
module ref to percpu_ref.

> If we want to improve CONFIG_PREEMPT performance, we can probably use a
> trick I wanted to try long ago:

So, this is a slight digression.

> 1) Use a per-cpu counter rather than a per-task counter for preempt.
> 2) Lay out preempt_counter so it covers NR_CPU pages, one per page.
> 3) When you want to preempt a CPU and counter isn't zero, make the page RO.
> 4) Handle preemption enable in the fault handler.
> 
> Then there's no branch in preempt_enable().

Buth yeah, interesting trick.  We'll be doing IPIs, flushing TLB and
taking faults until it hits zero.  It'll all depend on the frequency
of preemption but given that branches don't tend to be too expensive
on modern processors, maybe it'd be a bit too hairy for possibly
marginal gain?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A question on RCU vs. preempt-RCU
  2013-06-17 18:20   ` Tejun Heo
@ 2013-06-18  5:21     ` Rusty Russell
  2013-06-20  3:23       ` Steven Rostedt
  0 siblings, 1 reply; 7+ messages in thread
From: Rusty Russell @ 2013-06-18  5:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, Kent Overstreet, linux-kernel, Linus Torvalds,
	Andrew Morton

Tejun Heo <tj@kernel.org> writes:
> Buth yeah, interesting trick.  We'll be doing IPIs, flushing TLB and
> taking faults until it hits zero.  It'll all depend on the frequency
> of preemption but given that branches don't tend to be too expensive
> on modern processors, maybe it'd be a bit too hairy for possibly
> marginal gain?

Yeah, I'm not convinced either, but I am hoping some enthusiast will run
with the idea in hope of Fame and Glory :)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: A question on RCU vs. preempt-RCU
  2013-06-18  5:21     ` Rusty Russell
@ 2013-06-20  3:23       ` Steven Rostedt
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2013-06-20  3:23 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Tejun Heo, Paul E. McKenney, Kent Overstreet, linux-kernel,
	Linus Torvalds, Andrew Morton

On Tue, Jun 18, 2013 at 02:51:13PM +0930, Rusty Russell wrote:
> Tejun Heo <tj@kernel.org> writes:
> > Buth yeah, interesting trick.  We'll be doing IPIs, flushing TLB and
> > taking faults until it hits zero.  It'll all depend on the frequency
> > of preemption but given that branches don't tend to be too expensive
> > on modern processors, maybe it'd be a bit too hairy for possibly
> > marginal gain?
> 
> Yeah, I'm not convinced either, but I am hoping some enthusiast will run
> with the idea in hope of Fame and Glory :)

This actually looks like a trick I think would be fun to implement. But
as it being beneficial... the preempt count is used quite often, the
fact that this page will be constantly in the TLB would have a bigger
impact on general performance than saving the lousy branch.

-- Steve


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-06-20  3:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-16  2:36 A question on RCU vs. preempt-RCU Tejun Heo
2013-06-16  6:46 ` Rusty Russell
2013-06-17 18:20   ` Tejun Heo
2013-06-18  5:21     ` Rusty Russell
2013-06-20  3:23       ` Steven Rostedt
2013-06-16 14:13 ` Paul E. McKenney
2013-06-16 21:40   ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.