All of lore.kernel.org
 help / color / mirror / Atom feed
* per-cpu operation madness vs validation
@ 2011-07-26 21:06 Peter Zijlstra
  2011-07-26 21:32 ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2011-07-26 21:06 UTC (permalink / raw)
  To: Tejun Heo, Christoph Lameter
  Cc: Linus Torvalds, Thomas Gleixner, Paul E. McKenney, linux-kernel,
	Ingo Molnar

Hi all,

so recently our per-cpu ops have exploded.. who can say (without
looking) what the difference is between percpu_read() and
this_cpu_read() ?

Now Thomas recently did a fresh -rt and ran into the fun problem of
trying to reconstruct the requirements on a lot of the per-cpu grub
we've grown over the past few releases.

Does it require preempt-disable, bh disable, irq-disable, is it perhaps
covered by a lock, what!? I'm sure Thomas can point you to a few gems if
you're interested in specifics.

Now the reason is of course that -rt changes some things, even when
using migrate_disable() it might be multiple processes might try and
access the per-cpu variable, needing serialization.

The idea was to do something similar to rcu-lockdep, where each
rcu_dereference() was paired with a conditional that expresses the exact
conditions under which that dereference was good -- typically either
when holding the rcu_read_lock() or the lock used to serialize the
modifying side of things.

Even for mainline such validation is useful, suppose you intended the
variable to only be accessed by task context, and thus disabling
preemption is plenty to fully serialize things. However unbeknown to the
original author someone adds usage from IRQ context, and voila things
start breaking.

The point of course is, how are we going to go about doing this, I'm
sure adding a proper conditional to each and every per-cpu op is going
to be a herculean task, and most of it utterly boring.

One of the thing we could do is create a lock type that doesn't actually
generate any lock code on mainline (but does on rt) but is instead used
to annotate the serialization requirements of various of these things.
If such a lock isn't IRQ safe but used from IRQ context lockdep will
complain etc..



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-26 21:06 per-cpu operation madness vs validation Peter Zijlstra
@ 2011-07-26 21:32 ` Christoph Lameter
  2011-07-27 12:11   ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2011-07-26 21:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Tue, 26 Jul 2011, Peter Zijlstra wrote:

> Hi all,
>
> so recently our per-cpu ops have exploded.. who can say (without
> looking) what the difference is between percpu_read() and
> this_cpu_read() ?
>
> Now Thomas recently did a fresh -rt and ran into the fun problem of
> trying to reconstruct the requirements on a lot of the per-cpu grub
> we've grown over the past few releases.
>
> Does it require preempt-disable, bh disable, irq-disable, is it perhaps
> covered by a lock, what!? I'm sure Thomas can point you to a few gems if
> you're interested in specifics.

The this_cpu stuff attempts to make it more consistent

__this_cpu_xx stuff is to be run in preempt disable contexts or in context
where interrupts were disabled.

this_cpu_xx stuff is for context where preemption is possible.

irqsafe_cpu_xx stuff is for contexts where access is also possible from
an interrupt context.

Those all fold to the same operation on x86. The x86 cpu operations with
segment prefix are interrupt and preempt safe. The context issues come
into play in fallback scenarios where other architectures do not have
instructions that can perform the same things as x86.

> Now the reason is of course that -rt changes some things, even when
> using migrate_disable() it might be multiple processes might try and
> access the per-cpu variable, needing serialization.

On x86 these operations are rt safe by the pure fact that all these
operations are the same interrupt safe instruction. The difficulties come in for other
architectures. In those cases preemption or interrupts need to be
disabled.

> Even for mainline such validation is useful, suppose you intended the
> variable to only be accessed by task context, and thus disabling
> preemption is plenty to fully serialize things. However unbeknown to the
> original author someone adds usage from IRQ context, and voila things
> start breaking.

Right.

> The point of course is, how are we going to go about doing this, I'm
> sure adding a proper conditional to each and every per-cpu op is going
> to be a herculean task, and most of it utterly boring.

Basically we need to track the contexts in which references to a certain
per cpu variable are established. Then constraints can be enforced. For
example:

1. A variable used in an interupt context with __this_cpu ops requires
irqsafe_cpu_xxx when used in context where interrupts are enabled.

2. A variable used in a non-preemptible context with __this_cpu ops
requires this_cpu ops in a preemptible context.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-26 21:32 ` Christoph Lameter
@ 2011-07-27 12:11   ` Peter Zijlstra
  2011-07-27 13:01     ` Peter Zijlstra
  2011-07-27 14:16     ` Christoph Lameter
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2011-07-27 12:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Tue, 2011-07-26 at 16:32 -0500, Christoph Lameter wrote:

> Those all fold to the same operation on x86. The x86 cpu operations with
> segment prefix are interrupt and preempt safe. The context issues come
> into play in fallback scenarios where other architectures do not have
> instructions that can perform the same things as x86.

Right, but that doesn't help us much if anything.

> > Now the reason is of course that -rt changes some things, even when
> > using migrate_disable() it might be multiple processes might try and
> > access the per-cpu variable, needing serialization.
> 
> On x86 these operations are rt safe by the pure fact that all these
> operations are the same interrupt safe instruction.

The number of instructions has nothing what so ever to do with things.
And while the individual ops might be preempt/irq/nmi safe, that doesn't
say anything at all about the piece of code they're used in.

>  The difficulties come in for other
> architectures. In those cases preemption or interrupts need to be
> disabled.

Irrelevant.

> > The point of course is, how are we going to go about doing this, I'm
> > sure adding a proper conditional to each and every per-cpu op is going
> > to be a herculean task, and most of it utterly boring.
> 
> Basically we need to track the contexts in which references to a certain
> per cpu variable are established. Then constraints can be enforced. For
> example:
> 
> 1. A variable used in an interupt context with __this_cpu ops requires
> irqsafe_cpu_xxx when used in context where interrupts are enabled.
> 
> 2. A variable used in a non-preemptible context with __this_cpu ops
> requires this_cpu ops in a preemptible context.

And doesn't solve the larger issue of multiple per-cpu variables forming
a consistent piece of data.

Suppose there's two per-cpu variables, a and b, and we have an invariant
that says that a-b := 5. Then the following code:

preempt_disable();
__this_cpu_inc(a);
__this_cpu_inc(b);
preempt_enable();

is only correct when used from task context, an IRQ user can easily
observe our invariant failing and none of your above validations will
catch this.

Now move to -rt where the above will likely end up looking something
like:

migrate_disable();
__this_cpu_inc(a);
__this_cpu_inc(b);
migrate_enable();

and everybody can see the invariant failing.

Now clearly my example is very artificial but not far fetched at all,
there's lots of code like this, and -rt gets to try and unravel all
this. Basically preempt_disable()/local_bh_disable()/local_irq_disable()
are the next BKL, there is no context what so ever to reconstruct what
invariants certain pieces of code expect.

Hence my suggestion to do something like:

struct foo {
	percpu_lock_t lock;
	int a;
	int b;
}

DEFINE_PER_CPU(struct foo, foo);

percpu_lock(&foo.lock);
__this_cpu_inc(foo.a);
__this_cpu_inc(foo.b);
percpu_unlock(&foo.lock);

That would get us (aside from a shitload of work to make it so):

 - clear boundaries of where the data structure atomicy lie
 - validation, for if the above piece of code was also ran from IRQ
   context we could get lockdep complaining about IRQ unsafe locks used
   from IRQ context.

Now for !-rt percpu_lock will not emit more than
preempt_disable/local_bh_disable/local_irq_disable, depending on what
variant is used, and the data type percpu_lock_t would be empty (except
when enabling lockdep of course).

Possibly we could reduce all this percpu madness back to one form
(__this_cpu_*) and require that when used a lock of the percpu_lock_t is
taken.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 12:11   ` Peter Zijlstra
@ 2011-07-27 13:01     ` Peter Zijlstra
  2011-07-27 14:20       ` Christoph Lameter
  2011-07-27 14:16     ` Christoph Lameter
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2011-07-27 13:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Wed, 2011-07-27 at 14:11 +0200, Peter Zijlstra wrote:

> Hence my suggestion to do something like:
> 
> struct foo {
> 	percpu_lock_t lock;
> 	int a;
> 	int b;
> }
> 
> DEFINE_PER_CPU(struct foo, foo);
> 
> percpu_lock(&foo.lock);
> __this_cpu_inc(foo.a);
> __this_cpu_inc(foo.b);
> percpu_unlock(&foo.lock);
> 
> That would get us (aside from a shitload of work to make it so):
> 
>  - clear boundaries of where the data structure atomicy lie
>  - validation, for if the above piece of code was also ran from IRQ
>    context we could get lockdep complaining about IRQ unsafe locks used
>    from IRQ context.
> 
> Now for !-rt percpu_lock will not emit more than
> preempt_disable/local_bh_disable/local_irq_disable, depending on what
> variant is used, and the data type percpu_lock_t would be empty (except
> when enabling lockdep of course).
> 
> Possibly we could reduce all this percpu madness back to one form
> (__this_cpu_*) and require that when used a lock of the percpu_lock_t is
> taken.

get_cpu_var()/put_cpu_var() were supposed to provide such delineation as
well, but you've been actively destroying things like that with the
recent per-cpu work.

Also, I think we can mostly deprecate preempt_disable, local_bh_disable
and local_irq_disable when we have percpu_lock_t, or is local_lock_t a
better name?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 12:11   ` Peter Zijlstra
  2011-07-27 13:01     ` Peter Zijlstra
@ 2011-07-27 14:16     ` Christoph Lameter
  2011-07-27 16:20       ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2011-07-27 14:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 27 Jul 2011, Peter Zijlstra wrote:

> > On x86 these operations are rt safe by the pure fact that all these
> > operations are the same interrupt safe instruction.
>
> The number of instructions has nothing what so ever to do with things.

Of course it has. If its one instruction then that instruction is by
design atomic vs interrupts and preemption. No interrupt disable or
preemption disabling is necessary to ensure that that the RMV operation is
safe from interrupts or preemption.

> And while the individual ops might be preempt/irq/nmi safe, that doesn't
> say anything at all about the piece of code they're used in.

Sure only the individual RMV operation is safe. If there are multiple
operations then there are the usual issues.

> And doesn't solve the larger issue of multiple per-cpu variables forming
> a consistent piece of data.

That is a different issue from single atomic operations and AFAICT that
whas what we discussed here.

> Suppose there's two per-cpu variables, a and b, and we have an invariant
> that says that a-b := 5. Then the following code:
>
> preempt_disable();
> __this_cpu_inc(a);
> __this_cpu_inc(b);
> preempt_enable();
>
> is only correct when used from task context, an IRQ user can easily
> observe our invariant failing and none of your above validations will
> catch this.

Well yes these are multiple operations that have a sort of atomicity
requirement together. You want both variables to be updated together and
therefore the individual RMV operations wont do the trick. The algorithm
would have to be changed to take advantage of the this_cpu ops or you will
have to take the penalty of higher synchronization costs.

> Now move to -rt where the above will likely end up looking something
> like:
>
> migrate_disable();
> __this_cpu_inc(a);
> __this_cpu_inc(b);
> migrate_enable();
>
> and everybody can see the invariant failing.

Obviously. The this_cpu ops etc are not inteded to address large sections
that access per cpu data and want to make sure that others see the changes
in an atomic fashion.

> Now clearly my example is very artificial but not far fetched at all,

Sure but these issues require a different way of structuring the
algorithm. That is why I have introduced longer cmpxchg operations.

What you can do is what the slub fastpath does in order to allow updates
without caring about preemption or interrupts.

1. You get a reference to the current per cpu data.
2. You obtain a transaction id that is cpu specific and operation specifc.
3. Speculatively access an object and work with the information of that
object. Do not change anything.
4. Calculate the next transaction id
4. Replace pointer and transaction id with a new object pointer and new
tranaction id. If there is no match restart at 1.

Otherwise you are done. Any interrupt or preemption section that also does
operation on the critical data also would increment the transaction id and
cause a retry of the section.

So now you have a per cpu access that is fully immunte from interrupt and
preemption issues.

> Hence my suggestion to do something like:
>
> struct foo {
> 	percpu_lock_t lock;
> 	int a;
> 	int b;
> }
>
> DEFINE_PER_CPU(struct foo, foo);
>
> percpu_lock(&foo.lock);
> __this_cpu_inc(foo.a);
> __this_cpu_inc(foo.b);
> percpu_unlock(&foo.lock);

This is too expensive for my taste. If you adopt the above transaction id
scheme then you will have rt performance on per cpu sections that is
competitive with a non rt kernel. The cost involved is that the per cpu
sections have to be done differently. But there is a win both for RT and
non RT kernels.

> Possibly we could reduce all this percpu madness back to one form
> (__this_cpu_*) and require that when used a lock of the percpu_lock_t is
> taken.

percpu sections are very performance critical components of the kernel.
Taking locks could be avoided with serialization through transaction ids
f.e.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 13:01     ` Peter Zijlstra
@ 2011-07-27 14:20       ` Christoph Lameter
  2011-07-27 16:20         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2011-07-27 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Wed, 27 Jul 2011, Peter Zijlstra wrote:

> > Possibly we could reduce all this percpu madness back to one form
> > (__this_cpu_*) and require that when used a lock of the percpu_lock_t is
> > taken.
>
> get_cpu_var()/put_cpu_var() were supposed to provide such delineation as
> well, but you've been actively destroying things like that with the
> recent per-cpu work.

The per cpu work is *not* focused on sections that access per cpu data, so
how could it destroy that? Nothing is changed there so far. The this_cpu
ops are introducing per cpu atomic operations that are safe and cheap
regardless of the execution context. The primary initial motivation was
incrementing per cpu counters without having to disabling interrupts
and/or preemption and it grew from there.

The atomic per cpu operations (like the this_cpu_cmpxchg) allow the
construction of cheap sections that would satisfy your goals too (with
some work and the use of transaction ids instead of locking). See the slub
fastpath f.e.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 14:16     ` Christoph Lameter
@ 2011-07-27 16:20       ` Peter Zijlstra
  2011-07-27 17:13         ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2011-07-27 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 2011-07-27 at 09:16 -0500, Christoph Lameter wrote:

> Well yes these are multiple operations that have a sort of atomicity
> requirement together. You want both variables to be updated together and
> therefore the individual RMV operations wont do the trick. The algorithm
> would have to be changed to take advantage of the this_cpu ops or you will
> have to take the penalty of higher synchronization costs.

You're still not getting it, synchronization cost isn't the point.

> > Now clearly my example is very artificial but not far fetched at all,
> 
> Sure but these issues require a different way of structuring the
> algorithm. That is why I have introduced longer cmpxchg operations.

They require nothing of the sort.

> What you can do is what the slub fastpath does in order to allow updates
> without caring about preemption or interrupts.

> 1. You get a reference to the current per cpu data.
> 2. You obtain a transaction id that is cpu specific and operation specifc.
> 3. Speculatively access an object and work with the information of that
> object. Do not change anything.

Pretty hard when the whole point if changing stuff.

> 4. Calculate the next transaction id
> 4. Replace pointer and transaction id with a new object pointer and new
> tranaction id. If there is no match restart at 1.

Which is utter crap, ease up on the the drugs. That mandates a
read-copy-update like approach to all per-cpu data.

> Otherwise you are done. Any interrupt or preemption section that also does
> operation on the critical data also would increment the transaction id and
> cause a retry of the section.
> 
> So now you have a per cpu access that is fully immunte from interrupt and
> preemption issues.

And suffers from allocation issues.. allocating a new piece of data,
copying over the old one, modifying it, installing the new pointer,
managing references to the old pointer etc.. You're on crack, that's
often way too much trouble for nearly no gain.

> > Hence my suggestion to do something like:
> >
> > struct foo {
> > 	percpu_lock_t lock;
> > 	int a;
> > 	int b;
> > }
> >
> > DEFINE_PER_CPU(struct foo, foo);
> >
> > percpu_lock(&foo.lock);
> > __this_cpu_inc(foo.a);
> > __this_cpu_inc(foo.b);
> > percpu_unlock(&foo.lock);
> 
> This is too expensive for my taste.

It doesn't cost a cycle nor byte extra over what mainline does now, how
is it too expensive? Its just a clearer way of writing the code since it
cleanly couples locks to data instead of using blanket preempt/bh/irq
disable stmts and assuming things -- and yes preempt/bh/irq are locks.

> percpu sections are very performance critical components of the kernel.
> Taking locks could be avoided with serialization through transaction ids
> f.e.

Not the fucking point! This isn't about making things go faster, this is
about breaking up the horrid implicit preempt/bh/irq assumptions all
over the place. It really is no different from killing the BKL.

You'll need this even if you're going to rewrite the kernel with your
cmpxchg_double wankery simply because you need to know how stuff is
supposed to work.

All this is simply about de-obfuscating all per-cpu assumptions. This is
about verification and traceability/debuggability.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 14:20       ` Christoph Lameter
@ 2011-07-27 16:20         ` Peter Zijlstra
  2011-07-27 16:29           ` Peter Zijlstra
  2011-07-27 16:53           ` Christoph Lameter
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2011-07-27 16:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Wed, 2011-07-27 at 09:20 -0500, Christoph Lameter wrote:
> On Wed, 27 Jul 2011, Peter Zijlstra wrote:
> 
> > > Possibly we could reduce all this percpu madness back to one form
> > > (__this_cpu_*) and require that when used a lock of the percpu_lock_t is
> > > taken.
> >
> > get_cpu_var()/put_cpu_var() were supposed to provide such delineation as
> > well, but you've been actively destroying things like that with the
> > recent per-cpu work.
> 
> The per cpu work is *not* focused on sections that access per cpu data, so
> how could it destroy that? Nothing is changed there so far. The this_cpu
> ops are introducing per cpu atomic operations that are safe and cheap
> regardless of the execution context. The primary initial motivation was
> incrementing per cpu counters without having to disabling interrupts
> and/or preemption and it grew from there.

I think you need to look at 20b876918c065818b3574a426d418f68b4f8ad19 and
try again. You removed get_cpu_var()/put_cpu_var() and replaced it with
naked preempt_disable()/preempt_enable(). That's loosing information
right there.

Also, I think if you do s/__this_cpu/this_cpu/ on that function you can
simply drop the preempt_enable/disable things.

> The atomic per cpu operations (like the this_cpu_cmpxchg) allow the
> construction of cheap sections that would satisfy your goals too (with
> some work and the use of transaction ids instead of locking). See the slub
> fastpath f.e.

The slub fastpath is a shining example of crap. Its horrid code and
there was nothing wrong with actually disabling preemption over it, even
for -rt. Sure you made it go faster, but thats not the point.

Most of the slub problems with -rt are in the slow path where you still
have per-cpu assumptions and keep preempt/irqs disabled.

I don't think it is possible, nor desirable, to wreck all per-cpu usage
in the kernel that way.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 16:20         ` Peter Zijlstra
@ 2011-07-27 16:29           ` Peter Zijlstra
  2011-07-27 17:04             ` Christoph Lameter
  2011-07-27 16:53           ` Christoph Lameter
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2011-07-27 16:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Wed, 2011-07-27 at 18:20 +0200, Peter Zijlstra wrote:
> > > get_cpu_var()/put_cpu_var() were supposed to provide such delineation as
> > > well, but you've been actively destroying things like that with the
> > > recent per-cpu work.
> > 
> > The per cpu work is *not* focused on sections that access per cpu data, so
> > how could it destroy that? Nothing is changed there so far. The this_cpu
> > ops are introducing per cpu atomic operations that are safe and cheap
> > regardless of the execution context. The primary initial motivation was
> > incrementing per cpu counters without having to disabling interrupts
> > and/or preemption and it grew from there.
> 
> I think you need to look at 20b876918c065818b3574a426d418f68b4f8ad19 and
> try again. You removed get_cpu_var()/put_cpu_var() and replaced it with
> naked preempt_disable()/preempt_enable(). That's loosing information
> right there. 

Also things like the below hunk are just plain ugly and obfuscate the
code to safe one load at best. I'm sorely tempted to revert such crap.

@@ -1468,14 +1465,12 @@ static void x86_pmu_start_txn(struct pmu *pmu)
  */
 static void x86_pmu_cancel_txn(struct pmu *pmu)
 {
-       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
-
-       cpuc->group_flag &= ~PERF_EVENT_TXN;
+       __this_cpu_and(cpu_hw_events.group_flag, ~PERF_EVENT_TXN);
        /*
         * Truncate the collected events.
         */
-       cpuc->n_added -= cpuc->n_txn;
-       cpuc->n_events -= cpuc->n_txn;
+       __this_cpu_sub(cpu_hw_events.n_added, __this_cpu_read(cpu_hw_events.n_txn));
+       __this_cpu_sub(cpu_hw_events.n_events, __this_cpu_read(cpu_hw_events.n_txn));
        perf_pmu_enable(pmu);
 }


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 16:20         ` Peter Zijlstra
  2011-07-27 16:29           ` Peter Zijlstra
@ 2011-07-27 16:53           ` Christoph Lameter
  1 sibling, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2011-07-27 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar

On Wed, 27 Jul 2011, Peter Zijlstra wrote:

> > regardless of the execution context. The primary initial motivation was
> > incrementing per cpu counters without having to disabling interrupts
> > and/or preemption and it grew from there.
>
> I think you need to look at 20b876918c065818b3574a426d418f68b4f8ad19 and
> try again. You removed get_cpu_var()/put_cpu_var() and replaced it with
> naked preempt_disable()/preempt_enable(). That's loosing information
> right there.

It avoided storing the cpu variable.

> Also, I think if you do s/__this_cpu/this_cpu/ on that function you can
> simply drop the preempt_enable/disable things.

Ok that would be good. Could you do a patch like that?

> > The atomic per cpu operations (like the this_cpu_cmpxchg) allow the
> > construction of cheap sections that would satisfy your goals too (with
> > some work and the use of transaction ids instead of locking). See the slub
> > fastpath f.e.
>
> The slub fastpath is a shining example of crap. Its horrid code and
> there was nothing wrong with actually disabling preemption over it, even
> for -rt. Sure you made it go faster, but thats not the point.

The faster is the main point. Serialization overhead is a major cause of
slowdown. Sure it may look unusual to you but over time this will clear
up and become easier to use.

> Most of the slub problems with -rt are in the slow path where you still
> have per-cpu assumptions and keep preempt/irqs disabled.

There are patches pending that continue to remove these things from the
slowpath. With the proposed changes pending for 3.1 we will have a free
slowpath that does not require irq disable or preempt disable.

> I don't think it is possible, nor desirable, to wreck all per-cpu usage
> in the kernel that way.

What in the world are you talking about? No major change to basic
get_cpu/put_cpu and friends is necessary. The __this_cpu ops is a new
thing true but it also replaces open coded increments in preempt sections
that then have per cpu increments in non preempt sections.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 16:29           ` Peter Zijlstra
@ 2011-07-27 17:04             ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2011-07-27 17:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 27 Jul 2011, Peter Zijlstra wrote:

> Also things like the below hunk are just plain ugly and obfuscate the
> code to safe one load at best. I'm sorely tempted to revert such crap.
>
> @@ -1468,14 +1465,12 @@ static void x86_pmu_start_txn(struct pmu *pmu)
>   */
>  static void x86_pmu_cancel_txn(struct pmu *pmu)
>  {
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -
> -       cpuc->group_flag &= ~PERF_EVENT_TXN;
> +       __this_cpu_and(cpu_hw_events.group_flag, ~PERF_EVENT_TXN);
>         /*
>          * Truncate the collected events.
>          */
> -       cpuc->n_added -= cpuc->n_txn;
> -       cpuc->n_events -= cpuc->n_txn;
> +       __this_cpu_sub(cpu_hw_events.n_added, __this_cpu_read(cpu_hw_events.n_txn));
> +       __this_cpu_sub(cpu_hw_events.n_events, __this_cpu_read(cpu_hw_events.n_txn));
>         perf_pmu_enable(pmu);
>  }

Yea that is pretty ugly and went overboard. Too many segment
overrides there. The following patch will make that easier to read. We
could also remove the whole hunk. __this_cpu ops there only causes the
generation of simple xadds avoiding register operations.

---
 arch/x86/kernel/cpu/perf_event.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c	2011-07-27 11:59:35.000000000 -0500
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c	2011-07-27 12:03:41.000000000 -0500
@@ -1612,12 +1612,14 @@ static void x86_pmu_start_txn(struct pmu
  */
 static void x86_pmu_cancel_txn(struct pmu *pmu)
 {
+	int x = __this_cpu_read(cpu_hw_events.n_txn);
+
 	__this_cpu_and(cpu_hw_events.group_flag, ~PERF_EVENT_TXN);
 	/*
 	 * Truncate the collected events.
 	 */
-	__this_cpu_sub(cpu_hw_events.n_added, __this_cpu_read(cpu_hw_events.n_txn));
-	__this_cpu_sub(cpu_hw_events.n_events, __this_cpu_read(cpu_hw_events.n_txn));
+	__this_cpu_sub(cpu_hw_events.n_added, x);
+	__this_cpu_sub(cpu_hw_events.n_events, x);
 	perf_pmu_enable(pmu);
 }






^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 16:20       ` Peter Zijlstra
@ 2011-07-27 17:13         ` Christoph Lameter
  2011-07-27 17:33           ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2011-07-27 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Linus Torvalds, Thomas Gleixner, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 27 Jul 2011, Peter Zijlstra wrote:

> On Wed, 2011-07-27 at 09:16 -0500, Christoph Lameter wrote:
>
> > Well yes these are multiple operations that have a sort of atomicity
> > requirement together. You want both variables to be updated together and
> > therefore the individual RMV operations wont do the trick. The algorithm
> > would have to be changed to take advantage of the this_cpu ops or you will
> > have to take the penalty of higher synchronization costs.
>
> You're still not getting it, synchronization cost isn't the point.

It is for me.

> > What you can do is what the slub fastpath does in order to allow updates
> > without caring about preemption or interrupts.
>
> > 1. You get a reference to the current per cpu data.
> > 2. You obtain a transaction id that is cpu specific and operation specifc.
> > 3. Speculatively access an object and work with the information of that
> > object. Do not change anything.
>
> Pretty hard when the whole point if changing stuff.

Essentially you have a pointer to play with. Not unlike a special form of
RCU.

> > 4. Calculate the next transaction id
> > 4. Replace pointer and transaction id with a new object pointer and new
> > tranaction id. If there is no match restart at 1.
>
> Which is utter crap, ease up on the the drugs. That mandates a
> read-copy-update like approach to all per-cpu data.

If one wants performance you can get it that way.

> > So now you have a per cpu access that is fully immunte from interrupt and
> > preemption issues.
>
> And suffers from allocation issues.. allocating a new piece of data,
> copying over the old one, modifying it, installing the new pointer,
> managing references to the old pointer etc.. You're on crack, that's
> often way too much trouble for nearly no gain.

Maybe you just do not know how to handle it? The allocators (even the page
allocator) have been playing tricks with per cpu data for a very long
time.

> > percpu sections are very performance critical components of the kernel.
> > Taking locks could be avoided with serialization through transaction ids
> > f.e.
>
> Not the fucking point! This isn't about making things go faster, this is
> about breaking up the horrid implicit preempt/bh/irq assumptions all
> over the place. It really is no different from killing the BKL.

There is no breaking of anything. These assumptions already exist
throughout the kernel and the issues with access to variables in
different preemption context/interrupt context already exist. Nothing was
changed there.

> You'll need this even if you're going to rewrite the kernel with your
> cmpxchg_double wankery simply because you need to know how stuff is
> supposed to work.
>
> All this is simply about de-obfuscating all per-cpu assumptions. This is
> about verification and traceability/debuggability.

Ok verification  and traceability are good but they should not
be in the way of making core functionality high performance and low
latency.

The key issue is that the -rt kernel has always had grave issues with
performance when it comes to per cpu data access. Solving that by forcing
the kernel to go slow it not the right approach.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 17:13         ` Christoph Lameter
@ 2011-07-27 17:33           ` Thomas Gleixner
  2011-07-27 20:48             ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2011-07-27 17:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Tejun Heo, Linus Torvalds, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 27 Jul 2011, Christoph Lameter wrote:
> On Wed, 27 Jul 2011, Peter Zijlstra wrote:
> > All this is simply about de-obfuscating all per-cpu assumptions. This is
> > about verification and traceability/debuggability.
> 
> Ok verification  and traceability are good but they should not
> be in the way of making core functionality high performance and low
> latency.
> 
> The key issue is that the -rt kernel has always had grave issues with
> performance when it comes to per cpu data access. Solving that by forcing
> the kernel to go slow it not the right approach.

Nobody want's the kernel to go slow. All we want and we consider that
also a benefit for mainline is: proper annotation of the per cpu data
access, like we have for RCU and for locking.

Right now everything else than the "atomic" this_cpu stuff needs
protection of some sort, but it's nowhere documented and we have no
way to prove the correctness.

That has absolutely nothing to do with -rt. We already had cases in
mainline where per cpu data structures were accessed without or with
wrong protections because the logic changed over time or people made
the wrong assumptions.

For -rt this lack of documentation and the lack of verification,
debugability and traceability is a major PITA, but that's true for
non-rt as well, just the PITA is gradually smaller and the bugs which
are there today are just extremly hard to trigger.

And Peters idea of per_cpu_lock*() annotations will boil down to the
exact same thing which is there today when you compile the kernel w/o
lockdep enabled for per_cpu data correctness. We don't want to change
anything or impose any slowness, we just want a proper way to document
and verify that maze. That's really not too much of a request.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 17:33           ` Thomas Gleixner
@ 2011-07-27 20:48             ` Christoph Lameter
  2011-07-27 22:17               ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Lameter @ 2011-07-27 20:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Tejun Heo, Linus Torvalds, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 27 Jul 2011, Thomas Gleixner wrote:

> > The key issue is that the -rt kernel has always had grave issues with
> > performance when it comes to per cpu data access. Solving that by forcing
> > the kernel to go slow it not the right approach.
>
> Nobody want's the kernel to go slow. All we want and we consider that
> also a benefit for mainline is: proper annotation of the per cpu data
> access, like we have for RCU and for locking.

I love that stuff.

> For -rt this lack of documentation and the lack of verification,
> debugability and traceability is a major PITA, but that's true for
> non-rt as well, just the PITA is gradually smaller and the bugs which
> are there today are just extremly hard to trigger.

Right. Sure wish there would be better checks. Or things would not have so
many flavors.

> And Peters idea of per_cpu_lock*() annotations will boil down to the
> exact same thing which is there today when you compile the kernel w/o
> lockdep enabled for per_cpu data correctness. We don't want to change
> anything or impose any slowness, we just want a proper way to document
> and verify that maze. That's really not too much of a request.

No problem with that. The per cpu atomic ops are made to mostly stand on
their own. However, the correctness is affected by placementin the per cpu
sections that may start with get_cpu() or a preempt_disable().

The reason that I switched from get_cpu() to preempt_disable() in some
functions was because the this_cpu operations eliminated the need to pass
the cpu number to a function. Thus the cpu variable is not needed anymore.
Preempt_disable() doesnt provide it and so I thought that was the proper
function to use.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 20:48             ` Christoph Lameter
@ 2011-07-27 22:17               ` Thomas Gleixner
  2011-07-28 13:50                 ` Christoph Lameter
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2011-07-27 22:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Tejun Heo, Linus Torvalds, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Wed, 27 Jul 2011, Christoph Lameter wrote:
> On Wed, 27 Jul 2011, Thomas Gleixner wrote:
> 
> > > The key issue is that the -rt kernel has always had grave issues with
> > > performance when it comes to per cpu data access. Solving that by forcing
> > > the kernel to go slow it not the right approach.
> >
> > Nobody want's the kernel to go slow. All we want and we consider that
> > also a benefit for mainline is: proper annotation of the per cpu data
> > access, like we have for RCU and for locking.
> 
> I love that stuff.
> 
> > For -rt this lack of documentation and the lack of verification,
> > debugability and traceability is a major PITA, but that's true for
> > non-rt as well, just the PITA is gradually smaller and the bugs which
> > are there today are just extremly hard to trigger.
> 
> Right. Sure wish there would be better checks. Or things would not have so
> many flavors.
> 
> > And Peters idea of per_cpu_lock*() annotations will boil down to the
> > exact same thing which is there today when you compile the kernel w/o
> > lockdep enabled for per_cpu data correctness. We don't want to change
> > anything or impose any slowness, we just want a proper way to document
> > and verify that maze. That's really not too much of a request.
> 
> No problem with that. The per cpu atomic ops are made to mostly stand on
> their own. However, the correctness is affected by placementin the per cpu
> sections that may start with get_cpu() or a preempt_disable().
> 
> The reason that I switched from get_cpu() to preempt_disable() in some
> functions was because the this_cpu operations eliminated the need to pass
> the cpu number to a function. Thus the cpu variable is not needed anymore.
> Preempt_disable() doesnt provide it and so I thought that was the proper
> function to use.

Yes, you had no other choice, but that does not make them more obvious
in terms of debugability and verification.

We just have to understand, that preempt_disable, bh_disable,
irq_disable are cpu local BKLs with very subtle semantics which are
pretty close to the original BKL horror. All these mechanisms are per
cpu locks in fact and we need to get some annotation in place which
will help understandability and debugability in the first place. The
side effect that it will help RT to deal with that is - of course
desired from our side - but not the primary goal of that excercise.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: per-cpu operation madness vs validation
  2011-07-27 22:17               ` Thomas Gleixner
@ 2011-07-28 13:50                 ` Christoph Lameter
  0 siblings, 0 replies; 16+ messages in thread
From: Christoph Lameter @ 2011-07-28 13:50 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Tejun Heo, Linus Torvalds, Paul E. McKenney,
	linux-kernel, Ingo Molnar, eric.dumazet

On Thu, 28 Jul 2011, Thomas Gleixner wrote:

> We just have to understand, that preempt_disable, bh_disable,
> irq_disable are cpu local BKLs with very subtle semantics which are
> pretty close to the original BKL horror. All these mechanisms are per
> cpu locks in fact and we need to get some annotation in place which
> will help understandability and debugability in the first place. The
> side effect that it will help RT to deal with that is - of course
> desired from our side - but not the primary goal of that excercise.

Well the use of the per cpu atomic operations instead of disabling
preemption or scheduling helps RT too by removing the need to do so from
critical OS paths. F.e. we were able to remove most of that from counter
increments. There is still some more work to be done with these
byte/packet counters there as well.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-07-28 13:51 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-26 21:06 per-cpu operation madness vs validation Peter Zijlstra
2011-07-26 21:32 ` Christoph Lameter
2011-07-27 12:11   ` Peter Zijlstra
2011-07-27 13:01     ` Peter Zijlstra
2011-07-27 14:20       ` Christoph Lameter
2011-07-27 16:20         ` Peter Zijlstra
2011-07-27 16:29           ` Peter Zijlstra
2011-07-27 17:04             ` Christoph Lameter
2011-07-27 16:53           ` Christoph Lameter
2011-07-27 14:16     ` Christoph Lameter
2011-07-27 16:20       ` Peter Zijlstra
2011-07-27 17:13         ` Christoph Lameter
2011-07-27 17:33           ` Thomas Gleixner
2011-07-27 20:48             ` Christoph Lameter
2011-07-27 22:17               ` Thomas Gleixner
2011-07-28 13:50                 ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.