All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] slab fixes for 3.2-rc4
@ 2011-11-29 18:02 Pekka Enberg
  2011-11-29 19:29 ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Pekka Enberg @ 2011-11-29 18:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Christoph Lameter, linux-kernel

Hi Linus,

Here's few important fixes to SLUB issues introduced in the merge window.

The most important ones are from Christoph Lameter and Eric Dumazet that fix
stability issues on non-x86 architectures. The patches from Shaohua Li fix a
performance regression in SLUB that was also introduced in the merge window.

 			Pekka

The following changes since commit 883381d9f1c5a6329bbb796e23ae52c939940310:

   Merge branch 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2011-11-29 08:59:12 -0800)

are available in the git repository at:

   git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git slab/urgent

Christoph Lameter (1):
       slub: use irqsafe_cpu_cmpxchg for put_cpu_partial

Eric Dumazet (1):
       slub: avoid potential NULL dereference or corruption

Shaohua Li (2):
       slub: use correct parameter to add a page to partial list tail
       slub: move discard_slab out of node lock

  mm/slub.c |   42 ++++++++++++++++++++++++++----------------
  1 files changed, 26 insertions(+), 16 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-11-29 18:02 [GIT PULL] slab fixes for 3.2-rc4 Pekka Enberg
@ 2011-11-29 19:29 ` Linus Torvalds
  2011-11-29 19:38   ` Linus Torvalds
  2011-12-20  9:47   ` Pekka Enberg
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2011-11-29 19:29 UTC (permalink / raw)
  To: Pekka Enberg, Ingo Molnar
  Cc: Andrew Morton, Christoph Lameter, linux-kernel, linux-arch

On Tue, Nov 29, 2011 at 10:02 AM, Pekka Enberg <penberg@kernel.org> wrote:
>
> The most important ones are from Christoph Lameter and Eric Dumazet that fix
> stability issues on non-x86 architectures.
>
> Christoph Lameter (1):
>      slub: use irqsafe_cpu_cmpxchg for put_cpu_partial

Quite frankly, I think "this_cpu_cmpxchg()" is pure crap, and this fix
was wrong, wrong, WRONG.

Dammit, the emulation is just bad.

The *only* architecture that implements "this_cpu_cmpxchg()" specially
seems to be x86.

So every other architecture uses the emulated one, AND THE EMULATED
ONE DOES NOT MATCH THE X86 SEMANTICS!

Sorry for shouting, but that's just *incredibly* stupid.

What's extra stupid is that s390 actually separately implements the
irq-safe version, and does it with code that seems to be generic, and
actually uses "cmpxchg()" like the regular "this_cpu_cmpxchg()" name
implies!

In other words, I would suggest that:

 - we get rid of that totally crazy idiotic "irqsafe" crap. The normal
one should be irq-safe. It's named "cmpxchg", for chrissake!

 - we make the non-arch-specific "this_cpu_cmpxchg()" use the correct
s390 routines instead of the pure and utter *shit* it uses now.

Yes, I'm upset. This "random percpu crap for SLUB" has been going on
too f*cking long, and it has been *continually* plagued by just pure
sh*t like this.

Seriously. Stop taking patches from Christoph in this area until they
have been vetted for sanity and completeness. This kind of ugly "let's
mis-name our functions in really subtle ways so that they work on x86
but nowhere else" needs to stop.

I pulled this crap because clearly it fixes a bug, but just stop it.
Show some TASTE!

                            Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-11-29 19:29 ` Linus Torvalds
@ 2011-11-29 19:38   ` Linus Torvalds
  2011-12-20  9:47   ` Pekka Enberg
  1 sibling, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2011-11-29 19:38 UTC (permalink / raw)
  To: Pekka Enberg, Ingo Molnar, Thomas Gleixner
  Cc: Andrew Morton, Christoph Lameter, linux-kernel, linux-arch

On Tue, Nov 29, 2011 at 11:29 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Dammit, the emulation is just bad.

Btw, wasn't the "crazy random thiscpu ops" one of the topics at the
KS? I thought we agreed to clean these things up and remove most of
them? Did it just get dropped, or is it pending somwhere?

              Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-11-29 19:29 ` Linus Torvalds
  2011-11-29 19:38   ` Linus Torvalds
@ 2011-12-20  9:47   ` Pekka Enberg
  2011-12-20 16:23     ` Tejun Heo
  2011-12-20 16:26     ` Christoph Lameter
  1 sibling, 2 replies; 32+ messages in thread
From: Pekka Enberg @ 2011-12-20  9:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel,
	linux-arch, Tejun Heo, Thomas Gleixner

Hi,

(I'm CC'ing Tejun and Thomas.)

On Tue, Nov 29, 2011 at 10:02 AM, Pekka Enberg <penberg@kernel.org> wrote:
>> The most important ones are from Christoph Lameter and Eric Dumazet that fix
>> stability issues on non-x86 architectures.
>>
>> Christoph Lameter (1):
>>      slub: use irqsafe_cpu_cmpxchg for put_cpu_partial

On Tue, Nov 29, 2011 at 9:29 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Quite frankly, I think "this_cpu_cmpxchg()" is pure crap, and this fix
> was wrong, wrong, WRONG.
>
> Dammit, the emulation is just bad.
>
> The *only* architecture that implements "this_cpu_cmpxchg()" specially
> seems to be x86.
>
> So every other architecture uses the emulated one, AND THE EMULATED
> ONE DOES NOT MATCH THE X86 SEMANTICS!
>
> Sorry for shouting, but that's just *incredibly* stupid.
>
> What's extra stupid is that s390 actually separately implements the
> irq-safe version, and does it with code that seems to be generic, and
> actually uses "cmpxchg()" like the regular "this_cpu_cmpxchg()" name
> implies!
>
> In other words, I would suggest that:
>
>  - we get rid of that totally crazy idiotic "irqsafe" crap. The normal
> one should be irq-safe. It's named "cmpxchg", for chrissake!
>
>  - we make the non-arch-specific "this_cpu_cmpxchg()" use the correct
> s390 routines instead of the pure and utter *shit* it uses now.
>
> Yes, I'm upset. This "random percpu crap for SLUB" has been going on
> too f*cking long, and it has been *continually* plagued by just pure
> sh*t like this.
>
> Seriously. Stop taking patches from Christoph in this area until they
> have been vetted for sanity and completeness. This kind of ugly "let's
> mis-name our functions in really subtle ways so that they work on x86
> but nowhere else" needs to stop.

So, I actually looked into doing something like this and wasn't
actually able to understand the purpose of the various percpu
variants. It seems rather obvious that we can just drop the
non-irqsafe cmpxchg() variant but what about the rest of the percpu
ops? Why do we have preempt safe, irqsafe, and unsafe variants? How
are they supposed to be used?

To illustrate the issue, for "per cpu add" we have:

__this_cpu_add()
this_cpu_add()
irqsafe_cpu_add()
percpu_add()

Why do we need all of them?

                        Pekka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20  9:47   ` Pekka Enberg
@ 2011-12-20 16:23     ` Tejun Heo
  2011-12-20 16:31       ` Christoph Lameter
  2011-12-20 19:28       ` Linus Torvalds
  2011-12-20 16:26     ` Christoph Lameter
  1 sibling, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2011-12-20 16:23 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Christoph Lameter,
	linux-kernel, linux-arch, Thomas Gleixner

Hello, Pekka.

On Tue, Dec 20, 2011 at 11:47:26AM +0200, Pekka Enberg wrote:
> So, I actually looked into doing something like this and wasn't
> actually able to understand the purpose of the various percpu
> variants. It seems rather obvious that we can just drop the
> non-irqsafe cmpxchg() variant but what about the rest of the percpu
> ops? Why do we have preempt safe, irqsafe, and unsafe variants? How
> are they supposed to be used?
> 
> To illustrate the issue, for "per cpu add" we have:
> 
> __this_cpu_add()
> this_cpu_add()
> irqsafe_cpu_add()

Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and
generic this_cpu_* operations" should explain the above three.  In
short,

  __this_cpu_add()	: synchronization is caller's responsibility
  this_cpu_add()	: protected against preemption
  irqsafe_cpu_add()	: protected against irq

> percpu_add()

This is __this_cpu_add() + preemption disabled check.  Should be
removed.  Christoph, is there a use case where __this_cpu_XXX() is
used without preemption disabled?  Why doesn't it have preemption
check?

> Why do we need all of them?

It would great if we can drop the preempt safe one.  For x86, it
doesn't make any difference.  For archs which can't do it in single
instruction && irq on/off is expensive, it can be bad.  I don't know
how bad tho.

percpu API needs to be cleaned up.  There are quite a few duplicates -
some are from the days when static and dynamic percpu memories were
different, some got added during the this_cpu_*() stuff.  It has been
on the todo list for a while now but I never got around to do it.

If I'm not missing something, all we need are,

* per_cpu_ptr()

* get_cpu_var(), put_cpu_var() - it would be more consistent if
  they're get_cpu_ptr() and put_cpu_ptr().

* [__]this_cpu_ptr()

* Hopefully, smaller subset of this_cpu_XXX() ops.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20  9:47   ` Pekka Enberg
  2011-12-20 16:23     ` Tejun Heo
@ 2011-12-20 16:26     ` Christoph Lameter
  2011-12-21  8:06       ` Pekka Enberg
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2011-12-20 16:26 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel,
	linux-arch, Tejun Heo, Thomas Gleixner

On Tue, 20 Dec 2011, Pekka Enberg wrote:

> To illustrate the issue, for "per cpu add" we have:
>
> __this_cpu_add()
> this_cpu_add()
> irqsafe_cpu_add()
> percpu_add()
>
> Why do we need all of them?

These are all operations that frequently occur in hot paths of the OS.

On x86 all variants map to the same instruction. These will be issueing
exactly one ASM instruction that is therefore irqsafe and preempt safe.
The single instruction will perform the relocation of the pointer relative
to the current cpu area and then perform the RMV operation without the
cost of an atomic operation.

For non-x86 we have the issue that typically separate instruction have to
be used to perform the relocation relative to the current per cpu area and
the RMW operation. The problem is that an interrupt or reschedule
operation can occur between the address calculation and the RMW operation.
The RMW operation may occur on the wrong processors per cpu
data. So we need some way of preventing the change of processors or
interrupts.

The __this_cpu_add() variant simply does nothing to prevent this. Just
assumes that the caller has taken a lock or disabling interrupts that
provides sufficient measures to prevent the rescheduling on a different
processor.

The this_cpu_add() variant disables preemption, then does the operations
and then reenables preemption. This is usually sufficient since most per
cpu data is not accessed from an interrupt context.

The irqsafe_cpu_add() variant disables interrupts, does the operations and
then reenabled interrupts. It is needed if counters etc are modified from
an interrupt context.

percpu_add() is an older variant of the per cpu operations that is (or
was?) x86 specific. this_cpu_xxx() operations are used in core code.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20 16:23     ` Tejun Heo
@ 2011-12-20 16:31       ` Christoph Lameter
  2011-12-20 19:28       ` Linus Torvalds
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2011-12-20 16:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pekka Enberg, Linus Torvalds, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Tue, 20 Dec 2011, Tejun Heo wrote:

> This is __this_cpu_add() + preemption disabled check.  Should be
> removed.  Christoph, is there a use case where __this_cpu_XXX() is
> used without preemption disabled?  Why doesn't it have preemption
> check?

We discussed this before and said that it would be possible to add a
preemption check there. We would need to verify that there is no use case
of __this_cpu_XXX operations in preemptable context. There used to be a
case where we did not care about the races for the vmstat counters but
that seems to have been changed. Not aware of any other use case like
that.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20 16:23     ` Tejun Heo
  2011-12-20 16:31       ` Christoph Lameter
@ 2011-12-20 19:28       ` Linus Torvalds
  2011-12-20 20:28         ` Tejun Heo
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2011-12-20 19:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, Christoph Lameter,
	linux-kernel, linux-arch, Thomas Gleixner

On Tue, Dec 20, 2011 at 8:23 AM, Tejun Heo <tj@kernel.org> wrote:
>>
>> To illustrate the issue, for "per cpu add" we have:
>>
>> __this_cpu_add()
>> this_cpu_add()
>> irqsafe_cpu_add()
>
> Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and
> generic this_cpu_* operations" should explain the above three.

I don't think that's relevant.

Sure, they have semantics, but the semantics are stupid and wrong.
Whether they are documented or not isn't even the issue.

There are also too many of them to begin with, and they are generally pointless.

Just grep for "this_cpu_" and notice that we basically have *more*
lines in the header files to defile the crap, than we have lines in
the rest of the kernel to *use* it.

If that doesn't show how crappy the idiotic interface is, I don't know
what would.

So I would suggest:

 - get rid of *all* of them, leave exactly one version (this_cpu_xyz())

 - ask yourself which of the ops we even need. "this_cpu_write()"
needs to just go away. There's no excuse for it. Are the other ones
needed? I can see "add" and "cmpxchg". Is there *any* reason to
support anything else?

 - afaik, there is exactly *one* user of the "this_cpu_cmpxchg", and
*one* user of "irqsafe_cpu_cmpxchg". And those are both inside
CONFIG_CMPXCHG_LOCAL. Yet we have completely *INSANE* semantics for
the cmpxchg operations.

 - there's the magic "irqsafe_cmpxchg_double". Again, there is exactly
*one* user of it, but just grep for "cpu_cmpxchg_double" and notice
how we define all the crazy variations of it etc too. Again, *INSANE*.

In general, just grep for "this_cpu_" and notice that we basically
have *more* lines in the header files to defile the crap, than we have
lines in the rest of the ernel to *use* it.

Seriously, the whole piece of crap needs to go. Everything. It's shit.
It's unsalvageable.

I would suggest that we support exactly two operations, and nothing more.

 - this_cpu_cmpxchg (and the "double") version of it:

   This operation needs to be irq-safe and preempt-safe
*by*definition*. The whole concept of "cmpxchg" doesn't make sense
without that. Having three different versions of it with totally
insane semantics just NEEDS TO GO AWAY. There should be one version,
and it damn well is safe to use and does a totally unambiguous
"cmpxchg". Nothing else.

 - this_cpu_add()

   This operation is potentially useful for "sloppy statistics" where
the point is to do things quickly without requiring atomic accesses.
As such, it at least has *some* excuse for having the "irqsafe" vs
"preempt_safe" vs "regular" semantics. And it does have several users,
so.. However, I'd like to do something sane about it, and the
non-preempt version absolutely *needs* to have a debug check to verify
it, Thomas was already complaining about this.

But get rid of everything else. Stop using the "inc" ones, replace
them by "add 1". Don't have millions of lines of duplicate definitions
for crap that nobody cares about.

Also, get rid of the 1-byte and 2-byte versions. I doubt anybody
actually uses them, and it's not even a portable thing to do well (ie
the whole alpha issue with non-atomic byte and word accesses). So
again, it's just noise in the header files that makes it hard to
understand how they work because they are so verbose and do so many
stupid things.

Being "generic" is not actually a good thing. Not when we're talking
about random details like this.

Hmm?

                    Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20 19:28       ` Linus Torvalds
@ 2011-12-20 20:28         ` Tejun Heo
  2011-12-21  8:08           ` Pekka Enberg
  2011-12-21 15:16           ` Christoph Lameter
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2011-12-20 20:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, Christoph Lameter,
	linux-kernel, linux-arch, Thomas Gleixner

Hello, Linus.

On Tue, Dec 20, 2011 at 11:28:25AM -0800, Linus Torvalds wrote:
> > Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and
> > generic this_cpu_* operations" should explain the above three.
> 
> I don't think that's relevant.
> 
> Sure, they have semantics, but the semantics are stupid and wrong.
> Whether they are documented or not isn't even the issue.

I was trying to point Pekka to documentation so that at least the
existing semantics are clear.

> Being "generic" is not actually a good thing. Not when we're talking
> about random details like this.

Yeah, I generally agree that reducing the API would be great.  Given
the usage, I think (or at least hope) dropping preemption protected
ones wouldn't hurt much but it might be worthwhile to keep
__this_cpu_*() - the ones which expect the caller to take care of
synchronization - w/ assertion on irq disabled.

Christoph, what do you think?  What would be the minimal set that you
can work with?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20 16:26     ` Christoph Lameter
@ 2011-12-21  8:06       ` Pekka Enberg
  2011-12-21 15:20         ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Pekka Enberg @ 2011-12-21  8:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel,
	linux-arch, Tejun Heo, Thomas Gleixner

On Tue, 20 Dec 2011, Pekka Enberg wrote:
>> To illustrate the issue, for "per cpu add" we have:
>>
>> __this_cpu_add()
>> this_cpu_add()
>> irqsafe_cpu_add()
>> percpu_add()
>>
>> Why do we need all of them?

On Tue, Dec 20, 2011 at 6:26 PM, Christoph Lameter <cl@linux.com> wrote:
> These are all operations that frequently occur in hot paths of the OS.

It's a provably difficult API to use that has pretty much zero
debugging code. That's a problem.

I still don't understand why we'd want separate preempt safe and
irqsafe variants. It should be enough to have only unsafe and safe
variants where the latter would always do the right thing.

                                Pekka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20 20:28         ` Tejun Heo
@ 2011-12-21  8:08           ` Pekka Enberg
  2011-12-21 17:09             ` Tejun Heo
  2011-12-21 15:16           ` Christoph Lameter
  1 sibling, 1 reply; 32+ messages in thread
From: Pekka Enberg @ 2011-12-21  8:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Christoph Lameter,
	linux-kernel, linux-arch, Thomas Gleixner

On Tue, Dec 20, 2011 at 11:28:25AM -0800, Linus Torvalds wrote:
>> > Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and
>> > generic this_cpu_* operations" should explain the above three.
>>
>> I don't think that's relevant.
>>
>> Sure, they have semantics, but the semantics are stupid and wrong.
>> Whether they are documented or not isn't even the issue.

On Tue, Dec 20, 2011 at 10:28 PM, Tejun Heo <tj@kernel.org> wrote:
> I was trying to point Pekka to documentation so that at least the
> existing semantics are clear.

Sure but well-defined semantics alone are not sufficient for a
reasonable API. It's not at all obvious which one of the four variants
to pick when writing code. I don't see any evidence that people
actually understand the API. On the contrary, I see bugs caused by API
confusion in mm/slub.c itself!

                        Pekka

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-20 20:28         ` Tejun Heo
  2011-12-21  8:08           ` Pekka Enberg
@ 2011-12-21 15:16           ` Christoph Lameter
  2011-12-21 17:05             ` Tejun Heo
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2011-12-21 15:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Tue, 20 Dec 2011, Tejun Heo wrote:

> the usage, I think (or at least hope) dropping preemption protected
> ones wouldn't hurt much but it might be worthwhile to keep
> __this_cpu_*() - the ones which expect the caller to take care of
> synchronization - w/ assertion on irq disabled.

__this_cpu ops are generally the most useless. You can basically do the
same thing by open coding it. But then on x86 you'd miss out on generating
a simple inc seg:var instruction that does not impact registers. Plus you
avoid the necessity of calculating the address first. Instead of one
instruction you'd have 5.

Dropping preemption protected ones is going to be difficult given their
use in key subsystems.

> > Christoph, what do you think?  What would be the minimal set that you
> can work with?

If you just talking about the slub allocator and the this_cpu_cmpxchg
variants there then the irqsafe variants of cmpxchg and cmpxchg_double are
sufficient there.

However, the this_cpu ops are widely used in many subsystems for keeping
statistics. Their main role is to keep the overhead of incrementing/adding
to counters as minimal as possible. Changes there would cause instructions
to be generated that are longer in size and also would cause higher
latency of execution. Generally the irqsafe variants are not needed for
counters so we may be able to toss those.

this_cpu ops are not sloppy unless one intentionally uses __this_cpu_xxx
in a non preempt safe context which was the case for the vmstat counters
for awhile.

The amount of this_cpu functions may be excessive because I tried to cover
all possible use cases rather than actuallly used forms in the kernel. So
a lot of things could be weeded out. this_cpu ops is a way to experiment
with different forms of synchronization that are particular important for
fastpaths implementing per cpu caching. This could be of relevance to many
of the allocators in the future.

The way that the cmpxchg things are used is also similar to transactional
memory that is becoming available in the next generation of processors by
Intel and that is already available in the current generation of powerpc
processors by IBM. It is a way to avoid locking overhead.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-21  8:06       ` Pekka Enberg
@ 2011-12-21 15:20         ` Christoph Lameter
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Lameter @ 2011-12-21 15:20 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel,
	linux-arch, Tejun Heo, Thomas Gleixner

On Wed, 21 Dec 2011, Pekka Enberg wrote:

> I still don't understand why we'd want separate preempt safe and
> irqsafe variants. It should be enough to have only unsafe and safe
> variants where the latter would always do the right thing.

The effort to make something irqsafe is higher than making it preempt
safe. If that difference is not important then we could just have safe and
unsafe variants. Traditionally counter operations were only preempt safe
though. So making those irqsafe would increse the overhead.





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-21 15:16           ` Christoph Lameter
@ 2011-12-21 17:05             ` Tejun Heo
  2011-12-22  2:19               ` Linus Torvalds
  2011-12-22 14:58               ` Christoph Lameter
  0 siblings, 2 replies; 32+ messages in thread
From: Tejun Heo @ 2011-12-21 17:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

Hello, Christoph.

On Wed, Dec 21, 2011 at 09:16:24AM -0600, Christoph Lameter wrote:
> __this_cpu ops are generally the most useless. You can basically do the
> same thing by open coding it. But then on x86 you'd miss out on generating
> a simple inc seg:var instruction that does not impact registers. Plus you
> avoid the necessity of calculating the address first. Instead of one
> instruction you'd have 5.
>
> Dropping preemption protected ones is going to be difficult given their
> use in key subsystems.

The thing is that irqsafe ones are the "complete" ones.  We can use
irqsafe ones instead of preempt safe ones but not the other way.  This
matters only if flipping irq is noticeably more expensive than
inc/dec'ing preempt count but I suspect there are enough such
machines.  (cc'ing arch) Does anyone have better insight here?  How
much more expensive are local irq save/restore compared to inc/dec'ing
preempt count on various archs?

> > > Christoph, what do you think?  What would be the minimal set that you
> > can work with?
> 
> If you just talking about the slub allocator and the this_cpu_cmpxchg
> variants there then the irqsafe variants of cmpxchg and cmpxchg_double are
> sufficient there.
> 
> However, the this_cpu ops are widely used in many subsystems for keeping
> statistics. Their main role is to keep the overhead of incrementing/adding
> to counters as minimal as possible. Changes there would cause instructions
> to be generated that are longer in size and also would cause higher
> latency of execution. Generally the irqsafe variants are not needed for
> counters so we may be able to toss those.
> 
> this_cpu ops are not sloppy unless one intentionally uses __this_cpu_xxx
> in a non preempt safe context which was the case for the vmstat counters
> for awhile.
> 
> The amount of this_cpu functions may be excessive because I tried to cover
> all possible use cases rather than actuallly used forms in the kernel. So
> a lot of things could be weeded out. this_cpu ops is a way to experiment
> with different forms of synchronization that are particular important for
> fastpaths implementing per cpu caching. This could be of relevance to many
> of the allocators in the future.
> 
> The way that the cmpxchg things are used is also similar to transactional
> memory that is becoming available in the next generation of processors by
> Intel and that is already available in the current generation of powerpc
> processors by IBM. It is a way to avoid locking overhead.

Hmmm... how about removing the ones which aren't currently in use?
percpu API in general needs a lot more clean up but I think that would
be a good starting point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-21  8:08           ` Pekka Enberg
@ 2011-12-21 17:09             ` Tejun Heo
  0 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2011-12-21 17:09 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Christoph Lameter,
	linux-kernel, linux-arch, Thomas Gleixner

Hello,

On Wed, Dec 21, 2011 at 10:08:48AM +0200, Pekka Enberg wrote:
> Sure but well-defined semantics alone are not sufficient for a
> reasonable API. It's not at all obvious which one of the four variants
> to pick when writing code. I don't see any evidence that people
> actually understand the API. On the contrary, I see bugs caused by API
> confusion in mm/slub.c itself!

Sure, I agree.  If nobody beats me to it, I'll try to clean it up for
the merge window after the coming one (ie. 3.4).  But please feel free
to submit patches to gut out cruft and clean up.  I'll be happy to
apply them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-21 17:05             ` Tejun Heo
@ 2011-12-22  2:19               ` Linus Torvalds
  2011-12-22 16:05                 ` Tejun Heo
  2011-12-28 10:25                 ` Benjamin Herrenschmidt
  2011-12-22 14:58               ` Christoph Lameter
  1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 2011-12-22  2:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Wed, Dec 21, 2011 at 9:05 AM, Tejun Heo <tj@kernel.org> wrote:
>
> machines.  (cc'ing arch) Does anyone have better insight here?  How
> much more expensive are local irq save/restore compared to inc/dec'ing
> preempt count on various archs?

I think powerpc does sw irq disable, so it's pretty much the same.

However, on *most* architectures, if the stupid generic "disable
interrupts and then do the op" is too expensive, they could easily
just do an architecture-specific "safe" version. Especially if they
only need to do cmpxchg and add.

Many architectures can do those safely with load-locked,
store-conditional (ie atomic_add), and do so quickly. Sure, there are
broken cases wher ethat is really slow (original alpha is an example
of that), but I think it's fairly rare.

So both arm (in v6+) and powerpc should be able to just use the
"atomic_add" version, with no real downside.

So I really suspect that we could just say: "make the irq-safe version
be the *only* version", and no architecture will really care. Sure, it
can be more expensive, but it usually isn't. Only when done badly and
stupidly is it nasty.

                  Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-21 17:05             ` Tejun Heo
  2011-12-22  2:19               ` Linus Torvalds
@ 2011-12-22 14:58               ` Christoph Lameter
  2011-12-22 16:08                 ` Tejun Heo
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2011-12-22 14:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Wed, 21 Dec 2011, Tejun Heo wrote:

> The thing is that irqsafe ones are the "complete" ones.  We can use
> irqsafe ones instead of preempt safe ones but not the other way.  This
> matters only if flipping irq is noticeably more expensive than
> inc/dec'ing preempt count but I suspect there are enough such
> machines.  (cc'ing arch) Does anyone have better insight here?  How
> much more expensive are local irq save/restore compared to inc/dec'ing
> preempt count on various archs?

Well that would be a pretty nice simplification of the API.
Replace the fallback code for the preempt safe ones with the
irqsafe fallbacks, then drop the irqsafe variants from percpu.h.

> > The way that the cmpxchg things are used is also similar to transactional
> > memory that is becoming available in the next generation of processors by
> > Intel and that is already available in the current generation of powerpc
> > processors by IBM. It is a way to avoid locking overhead.
>
> Hmmm... how about removing the ones which aren't currently in use?

Yep. Could easily be done. We can resurrect the stuff as needed when other
variants become necessary. In particular the _and and _or etc stuff was
just added to be backward compatible with the old per cpu and local_t
interfaces. There may be no use cases left.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22  2:19               ` Linus Torvalds
@ 2011-12-22 16:05                 ` Tejun Heo
  2011-12-28 10:25                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2011-12-22 16:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

Hello, Linus.

On Wed, Dec 21, 2011 at 06:19:34PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2011 at 9:05 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > machines.  (cc'ing arch) Does anyone have better insight here?  How
> > much more expensive are local irq save/restore compared to inc/dec'ing
> > preempt count on various archs?
> 
> I think powerpc does sw irq disable, so it's pretty much the same.
> 
> However, on *most* architectures, if the stupid generic "disable
> interrupts and then do the op" is too expensive, they could easily
> just do an architecture-specific "safe" version. Especially if they
> only need to do cmpxchg and add.
> 
> Many architectures can do those safely with load-locked,
> store-conditional (ie atomic_add), and do so quickly. Sure, there are
> broken cases wher ethat is really slow (original alpha is an example
> of that), but I think it's fairly rare.
>
> So both arm (in v6+) and powerpc should be able to just use the
> "atomic_add" version, with no real downside.

Hmmm... right, if actual atomic ops are used we don't need to
guarantee that percpu address translation and operation happen on the
same CPU.  The end result would be correct regardless.  Atomic ops
wouldn't be as efficient as local single op as in x86 but it shouldn't
be too expensive when vast majority of operations are happening on
cachelines which are already exclusively owned.

> So I really suspect that we could just say: "make the irq-safe version
> be the *only* version", and no architecture will really care. Sure, it
> can be more expensive, but it usually isn't. Only when done badly and
> stupidly is it nasty.

Yeah, I don't think there's any cross-architecture way of making
this_cpu ops universally faster than explicit irq/preempt_disable -
get_addr - do one or more ops - irq/preempt_enable.  Many archs can't
even load address offset with single instruction after all.  Even on
x86, with enough number of ops, the segment prefix and full address on
each op would become more expensive than explicit sequence.  Given
that, I think just providing irqsafe version and let all other cases
use the explicit ones is a good tradeoff.

Alright, I'll do that for the 3.4 merge window.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22 14:58               ` Christoph Lameter
@ 2011-12-22 16:08                 ` Tejun Heo
  2011-12-22 17:58                   ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Tejun Heo @ 2011-12-22 16:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

Hello, Christoph.

On Thu, Dec 22, 2011 at 08:58:43AM -0600, Christoph Lameter wrote:
> Well that would be a pretty nice simplification of the API.
> Replace the fallback code for the preempt safe ones with the
> irqsafe fallbacks, then drop the irqsafe variants from percpu.h.

Yeah, it seems we're going that direction.

> > > The way that the cmpxchg things are used is also similar to transactional
> > > memory that is becoming available in the next generation of processors by
> > > Intel and that is already available in the current generation of powerpc
> > > processors by IBM. It is a way to avoid locking overhead.
> >
> > Hmmm... how about removing the ones which aren't currently in use?
> 
> Yep. Could easily be done. We can resurrect the stuff as needed when other
> variants become necessary. In particular the _and and _or etc stuff was
> just added to be backward compatible with the old per cpu and local_t
> interfaces. There may be no use cases left.

Yeap, and that one too.  Maybe we can finally kill the duplicate
confusing static/dynamic accessors too.  I'm planning to get to it in
several weeks but if anyone can beat me to it, please go ahead.

Thank you.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22 16:08                 ` Tejun Heo
@ 2011-12-22 17:58                   ` Christoph Lameter
  2011-12-22 18:03                     ` Ingo Molnar
                                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Christoph Lameter @ 2011-12-22 17:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Thu, 22 Dec 2011, Tejun Heo wrote:

> Yeap, and that one too.  Maybe we can finally kill the duplicate
> confusing static/dynamic accessors too.  I'm planning to get to it in
> several weeks but if anyone can beat me to it, please go ahead.

That would be great. I looked at _and and _or and they both still have one
use case (_xor has none though). But its easy to get rid of the irqsafe
variants once we are willing to take the additional overhead that comes
with disabling interrupts for the fallback cases.


Subject: [percpu] Remove irqsafe_cpu_xxx variants

We simply say that regular this_cpu use must be safe regardless of preemption
and interrupt state. That has no material change for x86 and s390 implementations
of this_cpu operations. However, arches that do not provide their own implementation
for this_cpu operations will now get code generated that disables interrupts
instead of preemption.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/s390/include/asm/percpu.h     |   50 ++++-----
 arch/x86/include/asm/percpu.h      |   28 -----
 include/linux/netdevice.h          |    4
 include/linux/netfilter/x_tables.h |    4
 include/linux/percpu.h             |  190 ++++---------------------------------
 include/net/snmp.h                 |   14 +-
 mm/slub.c                          |    6 -
 net/caif/caif_dev.c                |    4
 net/caif/cffrml.c                  |    4
 9 files changed, 65 insertions(+), 239 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2011-12-22 11:48:28.000000000 -0600
@@ -172,10 +172,10 @@ extern phys_addr_t per_cpu_ptr_to_phys(v
  * equal char, int or long.  percpu_read() evaluates to a lvalue and
  * all others to void.
  *
- * These operations are guaranteed to be atomic w.r.t. preemption.
- * The generic versions use plain get/put_cpu_var().  Archs are
+ * These operations are guaranteed to be atomic.
+ * The generic versions disable interrupts.  Archs are
  * encouraged to implement single-instruction alternatives which don't
- * require preemption protection.
+ * require protection.
  */
 #ifndef percpu_read
 # define percpu_read(var)						\
@@ -347,9 +347,10 @@ do {									\

 #define _this_cpu_generic_to_op(pcp, val, op)				\
 do {									\
-	preempt_disable();						\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
 	*__this_cpu_ptr(&(pcp)) op val;					\
-	preempt_enable();						\
+	local_irq_restore(flags);					\
 } while (0)

 #ifndef this_cpu_write
@@ -447,10 +448,11 @@ do {									\
 #define _this_cpu_generic_add_return(pcp, val)				\
 ({									\
 	typeof(pcp) ret__;						\
-	preempt_disable();						\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
 	__this_cpu_add(pcp, val);					\
 	ret__ = __this_cpu_read(pcp);					\
-	preempt_enable();						\
+	local_irq_restore(flags);					\
 	ret__;								\
 })

@@ -476,10 +478,11 @@ do {									\

 #define _this_cpu_generic_xchg(pcp, nval)				\
 ({	typeof(pcp) ret__;						\
-	preempt_disable();						\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
 	ret__ = __this_cpu_read(pcp);					\
 	__this_cpu_write(pcp, nval);					\
-	preempt_enable();						\
+	local_irq_restore(flags);					\
 	ret__;								\
 })

@@ -501,12 +504,14 @@ do {									\
 #endif

 #define _this_cpu_generic_cmpxchg(pcp, oval, nval)			\
-({	typeof(pcp) ret__;						\
-	preempt_disable();						\
+({									\
+	typeof(pcp) ret__;						\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
 	ret__ = __this_cpu_read(pcp);					\
 	if (ret__ == (oval))						\
 		__this_cpu_write(pcp, nval);				\
-	preempt_enable();						\
+	local_irq_restore(flags);					\
 	ret__;								\
 })

@@ -538,10 +543,11 @@ do {									\
 #define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
 ({									\
 	int ret__;							\
-	preempt_disable();						\
+	unsigned long flags;						\
+	local_irq_save(flags);						\
 	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
 			oval1, oval2, nval1, nval2);			\
-	preempt_enable();						\
+	local_irq_restore(flags);					\
 	ret__;								\
 })

@@ -567,9 +573,9 @@ do {									\
 #endif

 /*
- * Generic percpu operations that do not require preemption handling.
+ * Generic percpu operations for context that are safe from preemption/interrupts.
  * Either we do not care about races or the caller has the
- * responsibility of handling preemptions issues. Arch code can still
+ * responsibility of handling preemption/interrupt issues. Arch code can still
  * override these instructions since the arch per cpu code may be more
  * efficient and may actually get race freeness for free (that is the
  * case for x86 for example).
@@ -802,156 +808,4 @@ do {									\
 	__pcpu_double_call_return_bool(__this_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
 #endif

-/*
- * IRQ safe versions of the per cpu RMW operations. Note that these operations
- * are *not* safe against modification of the same variable from another
- * processors (which one gets when using regular atomic operations)
- * They are guaranteed to be atomic vs. local interrupts and
- * preemption only.
- */
-#define irqsafe_cpu_generic_to_op(pcp, val, op)				\
-do {									\
-	unsigned long flags;						\
-	local_irq_save(flags);						\
-	*__this_cpu_ptr(&(pcp)) op val;					\
-	local_irq_restore(flags);					\
-} while (0)
-
-#ifndef irqsafe_cpu_add
-# ifndef irqsafe_cpu_add_1
-#  define irqsafe_cpu_add_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
-# endif
-# ifndef irqsafe_cpu_add_2
-#  define irqsafe_cpu_add_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
-# endif
-# ifndef irqsafe_cpu_add_4
-#  define irqsafe_cpu_add_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
-# endif
-# ifndef irqsafe_cpu_add_8
-#  define irqsafe_cpu_add_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=)
-# endif
-# define irqsafe_cpu_add(pcp, val) __pcpu_size_call(irqsafe_cpu_add_, (pcp), (val))
-#endif
-
-#ifndef irqsafe_cpu_sub
-# define irqsafe_cpu_sub(pcp, val)	irqsafe_cpu_add((pcp), -(val))
-#endif
-
-#ifndef irqsafe_cpu_inc
-# define irqsafe_cpu_inc(pcp)	irqsafe_cpu_add((pcp), 1)
-#endif
-
-#ifndef irqsafe_cpu_dec
-# define irqsafe_cpu_dec(pcp)	irqsafe_cpu_sub((pcp), 1)
-#endif
-
-#ifndef irqsafe_cpu_and
-# ifndef irqsafe_cpu_and_1
-#  define irqsafe_cpu_and_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
-# endif
-# ifndef irqsafe_cpu_and_2
-#  define irqsafe_cpu_and_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
-# endif
-# ifndef irqsafe_cpu_and_4
-#  define irqsafe_cpu_and_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
-# endif
-# ifndef irqsafe_cpu_and_8
-#  define irqsafe_cpu_and_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=)
-# endif
-# define irqsafe_cpu_and(pcp, val) __pcpu_size_call(irqsafe_cpu_and_, (val))
-#endif
-
-#ifndef irqsafe_cpu_or
-# ifndef irqsafe_cpu_or_1
-#  define irqsafe_cpu_or_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
-# endif
-# ifndef irqsafe_cpu_or_2
-#  define irqsafe_cpu_or_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
-# endif
-# ifndef irqsafe_cpu_or_4
-#  define irqsafe_cpu_or_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
-# endif
-# ifndef irqsafe_cpu_or_8
-#  define irqsafe_cpu_or_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=)
-# endif
-# define irqsafe_cpu_or(pcp, val) __pcpu_size_call(irqsafe_cpu_or_, (val))
-#endif
-
-#ifndef irqsafe_cpu_xor
-# ifndef irqsafe_cpu_xor_1
-#  define irqsafe_cpu_xor_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
-# endif
-# ifndef irqsafe_cpu_xor_2
-#  define irqsafe_cpu_xor_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
-# endif
-# ifndef irqsafe_cpu_xor_4
-#  define irqsafe_cpu_xor_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
-# endif
-# ifndef irqsafe_cpu_xor_8
-#  define irqsafe_cpu_xor_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=)
-# endif
-# define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val))
-#endif
-
-#define irqsafe_cpu_generic_cmpxchg(pcp, oval, nval)			\
-({									\
-	typeof(pcp) ret__;						\
-	unsigned long flags;						\
-	local_irq_save(flags);						\
-	ret__ = __this_cpu_read(pcp);					\
-	if (ret__ == (oval))						\
-		__this_cpu_write(pcp, nval);				\
-	local_irq_restore(flags);					\
-	ret__;								\
-})
-
-#ifndef irqsafe_cpu_cmpxchg
-# ifndef irqsafe_cpu_cmpxchg_1
-#  define irqsafe_cpu_cmpxchg_1(pcp, oval, nval)	irqsafe_cpu_generic_cmpxchg(pcp, oval, nval)
-# endif
-# ifndef irqsafe_cpu_cmpxchg_2
-#  define irqsafe_cpu_cmpxchg_2(pcp, oval, nval)	irqsafe_cpu_generic_cmpxchg(pcp, oval, nval)
-# endif
-# ifndef irqsafe_cpu_cmpxchg_4
-#  define irqsafe_cpu_cmpxchg_4(pcp, oval, nval)	irqsafe_cpu_generic_cmpxchg(pcp, oval, nval)
-# endif
-# ifndef irqsafe_cpu_cmpxchg_8
-#  define irqsafe_cpu_cmpxchg_8(pcp, oval, nval)	irqsafe_cpu_generic_cmpxchg(pcp, oval, nval)
-# endif
-# define irqsafe_cpu_cmpxchg(pcp, oval, nval)		\
-	__pcpu_size_call_return2(irqsafe_cpu_cmpxchg_, (pcp), oval, nval)
-#endif
-
-#define irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-({									\
-	int ret__;							\
-	unsigned long flags;						\
-	local_irq_save(flags);						\
-	ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2,		\
-			oval1, oval2, nval1, nval2);			\
-	local_irq_restore(flags);					\
-	ret__;								\
-})
-
-#ifndef irqsafe_cpu_cmpxchg_double
-# ifndef irqsafe_cpu_cmpxchg_double_1
-#  define irqsafe_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-# endif
-# ifndef irqsafe_cpu_cmpxchg_double_2
-#  define irqsafe_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-# endif
-# ifndef irqsafe_cpu_cmpxchg_double_4
-#  define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-# endif
-# ifndef irqsafe_cpu_cmpxchg_double_8
-#  define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-	irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
-# endif
-# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)	\
-	__pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
-#endif
-
 #endif /* __LINUX_PERCPU_H */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/mm/slub.c	2011-12-22 11:48:28.000000000 -0600
@@ -1978,7 +1978,7 @@ int put_cpu_partial(struct kmem_cache *s
 		page->pobjects = pobjects;
 		page->next = oldpage;

-	} while (irqsafe_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage);
+	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage);
 	stat(s, CPU_PARTIAL_FREE);
 	return pobjects;
 }
@@ -2304,7 +2304,7 @@ redo:
 		 * Since this is without lock semantics the protection is only against
 		 * code executing on this cpu *not* from access by other cpus.
 		 */
-		if (unlikely(!irqsafe_cpu_cmpxchg_double(
+		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				object, tid,
 				get_freepointer_safe(s, object), next_tid(tid)))) {
@@ -2534,7 +2534,7 @@ redo:
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);

-		if (unlikely(!irqsafe_cpu_cmpxchg_double(
+		if (unlikely(!this_cpu_cmpxchg_double(
 				s->cpu_slab->freelist, s->cpu_slab->tid,
 				c->freelist, tid,
 				object, next_tid(tid)))) {
Index: linux-2.6/net/caif/caif_dev.c
===================================================================
--- linux-2.6.orig/net/caif/caif_dev.c	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/net/caif/caif_dev.c	2011-12-22 11:48:28.000000000 -0600
@@ -69,12 +69,12 @@ static struct caif_device_entry_list *ca

 static void caifd_put(struct caif_device_entry *e)
 {
-	irqsafe_cpu_dec(*e->pcpu_refcnt);
+	this_cpu_dec(*e->pcpu_refcnt);
 }

 static void caifd_hold(struct caif_device_entry *e)
 {
-	irqsafe_cpu_inc(*e->pcpu_refcnt);
+	this_cpu_inc(*e->pcpu_refcnt);
 }

 static int caifd_refcnt_read(struct caif_device_entry *e)
Index: linux-2.6/net/caif/cffrml.c
===================================================================
--- linux-2.6.orig/net/caif/cffrml.c	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/net/caif/cffrml.c	2011-12-22 11:48:28.000000000 -0600
@@ -177,14 +177,14 @@ void cffrml_put(struct cflayer *layr)
 {
 	struct cffrml *this = container_obj(layr);
 	if (layr != NULL && this->pcpu_refcnt != NULL)
-		irqsafe_cpu_dec(*this->pcpu_refcnt);
+		this_cpu_dec(*this->pcpu_refcnt);
 }

 void cffrml_hold(struct cflayer *layr)
 {
 	struct cffrml *this = container_obj(layr);
 	if (layr != NULL && this->pcpu_refcnt != NULL)
-		irqsafe_cpu_inc(*this->pcpu_refcnt);
+		this_cpu_inc(*this->pcpu_refcnt);
 }

 int cffrml_refcnt_read(struct cflayer *layr)
Index: linux-2.6/include/linux/netdevice.h
===================================================================
--- linux-2.6.orig/include/linux/netdevice.h	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/include/linux/netdevice.h	2011-12-22 11:48:28.000000000 -0600
@@ -2115,7 +2115,7 @@ extern void netdev_run_todo(void);
  */
 static inline void dev_put(struct net_device *dev)
 {
-	irqsafe_cpu_dec(*dev->pcpu_refcnt);
+	this_cpu_dec(*dev->pcpu_refcnt);
 }

 /**
@@ -2126,7 +2126,7 @@ static inline void dev_put(struct net_de
  */
 static inline void dev_hold(struct net_device *dev)
 {
-	irqsafe_cpu_inc(*dev->pcpu_refcnt);
+	this_cpu_inc(*dev->pcpu_refcnt);
 }

 /* Carrier loss detection, dial on demand. The functions netif_carrier_on
Index: linux-2.6/include/linux/netfilter/x_tables.h
===================================================================
--- linux-2.6.orig/include/linux/netfilter/x_tables.h	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/include/linux/netfilter/x_tables.h	2011-12-22 11:48:28.000000000 -0600
@@ -471,7 +471,7 @@ DECLARE_PER_CPU(seqcount_t, xt_recseq);
  *
  * Begin packet processing : all readers must wait the end
  * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add())
+ * 2) softirqs must be disabled too (or we should use this_cpu_add())
  * Returns :
  *  1 if no recursion on this cpu
  *  0 if recursion detected
@@ -503,7 +503,7 @@ static inline unsigned int xt_write_recs
  *
  * End packet processing : all readers can proceed
  * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add())
+ * 2) softirqs must be disabled too (or we should use this_cpu_add())
  */
 static inline void xt_write_recseq_end(unsigned int addend)
 {
Index: linux-2.6/include/net/snmp.h
===================================================================
--- linux-2.6.orig/include/net/snmp.h	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/include/net/snmp.h	2011-12-22 11:48:28.000000000 -0600
@@ -129,33 +129,33 @@ struct linux_xfrm_mib {
 			__this_cpu_inc(mib[0]->mibs[field])

 #define SNMP_INC_STATS_USER(mib, field)	\
-			irqsafe_cpu_inc(mib[0]->mibs[field])
+			this_cpu_inc(mib[0]->mibs[field])

 #define SNMP_INC_STATS_ATOMIC_LONG(mib, field)	\
 			atomic_long_inc(&mib->mibs[field])

 #define SNMP_INC_STATS(mib, field)	\
-			irqsafe_cpu_inc(mib[0]->mibs[field])
+			this_cpu_inc(mib[0]->mibs[field])

 #define SNMP_DEC_STATS(mib, field)	\
-			irqsafe_cpu_dec(mib[0]->mibs[field])
+			this_cpu_dec(mib[0]->mibs[field])

 #define SNMP_ADD_STATS_BH(mib, field, addend)	\
 			__this_cpu_add(mib[0]->mibs[field], addend)

 #define SNMP_ADD_STATS_USER(mib, field, addend)	\
-			irqsafe_cpu_add(mib[0]->mibs[field], addend)
+			this_cpu_add(mib[0]->mibs[field], addend)

 #define SNMP_ADD_STATS(mib, field, addend)	\
-			irqsafe_cpu_add(mib[0]->mibs[field], addend)
+			this_cpu_add(mib[0]->mibs[field], addend)
 /*
  * Use "__typeof__(*mib[0]) *ptr" instead of "__typeof__(mib[0]) ptr"
  * to make @ptr a non-percpu pointer.
  */
 #define SNMP_UPD_PO_STATS(mib, basefield, addend)	\
 	do { \
-		irqsafe_cpu_inc(mib[0]->mibs[basefield##PKTS]);		\
-		irqsafe_cpu_add(mib[0]->mibs[basefield##OCTETS], addend);	\
+		this_cpu_inc(mib[0]->mibs[basefield##PKTS]);		\
+		this_cpu_add(mib[0]->mibs[basefield##OCTETS], addend);	\
 	} while (0)
 #define SNMP_UPD_PO_STATS_BH(mib, basefield, addend)	\
 	do { \
Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2011-12-22 11:48:28.000000000 -0600
@@ -414,22 +414,6 @@ do {									\
 #define this_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
 #define this_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)

-#define irqsafe_cpu_add_1(pcp, val)	percpu_add_op((pcp), val)
-#define irqsafe_cpu_add_2(pcp, val)	percpu_add_op((pcp), val)
-#define irqsafe_cpu_add_4(pcp, val)	percpu_add_op((pcp), val)
-#define irqsafe_cpu_and_1(pcp, val)	percpu_to_op("and", (pcp), val)
-#define irqsafe_cpu_and_2(pcp, val)	percpu_to_op("and", (pcp), val)
-#define irqsafe_cpu_and_4(pcp, val)	percpu_to_op("and", (pcp), val)
-#define irqsafe_cpu_or_1(pcp, val)	percpu_to_op("or", (pcp), val)
-#define irqsafe_cpu_or_2(pcp, val)	percpu_to_op("or", (pcp), val)
-#define irqsafe_cpu_or_4(pcp, val)	percpu_to_op("or", (pcp), val)
-#define irqsafe_cpu_xor_1(pcp, val)	percpu_to_op("xor", (pcp), val)
-#define irqsafe_cpu_xor_2(pcp, val)	percpu_to_op("xor", (pcp), val)
-#define irqsafe_cpu_xor_4(pcp, val)	percpu_to_op("xor", (pcp), val)
-#define irqsafe_cpu_xchg_1(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define irqsafe_cpu_xchg_2(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define irqsafe_cpu_xchg_4(pcp, nval)	percpu_xchg_op(pcp, nval)
-
 #ifndef CONFIG_M386
 #define __this_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val)
 #define __this_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val)
@@ -445,9 +429,6 @@ do {									\
 #define this_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 #define this_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)

-#define irqsafe_cpu_cmpxchg_1(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define irqsafe_cpu_cmpxchg_2(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-#define irqsafe_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 #endif /* !CONFIG_M386 */

 #ifdef CONFIG_X86_CMPXCHG64
@@ -467,7 +448,6 @@ do {									\

 #define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
 #define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
-#define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
 #endif /* CONFIG_X86_CMPXCHG64 */

 /*
@@ -495,13 +475,6 @@ do {									\
 #define this_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)

-#define irqsafe_cpu_add_8(pcp, val)	percpu_add_op((pcp), val)
-#define irqsafe_cpu_and_8(pcp, val)	percpu_to_op("and", (pcp), val)
-#define irqsafe_cpu_or_8(pcp, val)	percpu_to_op("or", (pcp), val)
-#define irqsafe_cpu_xor_8(pcp, val)	percpu_to_op("xor", (pcp), val)
-#define irqsafe_cpu_xchg_8(pcp, nval)	percpu_xchg_op(pcp, nval)
-#define irqsafe_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
-
 /*
  * Pretty complex macro to generate cmpxchg16 instruction.  The instruction
  * is not supported on early AMD64 processors so we must be able to emulate
@@ -532,7 +505,6 @@ do {									\

 #define __this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
 #define this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)		percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)
-#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2)	percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2)

 #endif

Index: linux-2.6/arch/s390/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/percpu.h	2011-12-22 11:48:21.000000000 -0600
+++ linux-2.6/arch/s390/include/asm/percpu.h	2011-12-22 11:50:22.000000000 -0600
@@ -19,7 +19,7 @@
 #define ARCH_NEEDS_WEAK_PER_CPU
 #endif

-#define arch_irqsafe_cpu_to_op(pcp, val, op)				\
+#define arch_this_cpu_to_op(pcp, val, op)				\
 do {									\
 	typedef typeof(pcp) pcp_op_T__;					\
 	pcp_op_T__ old__, new__, prev__;				\
@@ -41,27 +41,27 @@ do {									\
 	preempt_enable();						\
 } while (0)

-#define irqsafe_cpu_add_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +)
-#define irqsafe_cpu_add_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +)
-#define irqsafe_cpu_add_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +)
-#define irqsafe_cpu_add_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +)
-
-#define irqsafe_cpu_and_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &)
-#define irqsafe_cpu_and_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &)
-#define irqsafe_cpu_and_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &)
-#define irqsafe_cpu_and_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &)
-
-#define irqsafe_cpu_or_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |)
-#define irqsafe_cpu_or_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |)
-#define irqsafe_cpu_or_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |)
-#define irqsafe_cpu_or_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |)
-
-#define irqsafe_cpu_xor_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^)
-#define irqsafe_cpu_xor_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^)
-#define irqsafe_cpu_xor_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^)
-#define irqsafe_cpu_xor_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^)
+#define this_cpu_add_1(pcp, val) arch_this_cpu_to_op(pcp, val, +)
+#define this_cpu_add_2(pcp, val) arch_this_cpu_to_op(pcp, val, +)
+#define this_cpu_add_4(pcp, val) arch_this_cpu_to_op(pcp, val, +)
+#define this_cpu_add_8(pcp, val) arch_this_cpu_to_op(pcp, val, +)
+
+#define this_cpu_and_1(pcp, val) arch_this_cpu_to_op(pcp, val, &)
+#define this_cpu_and_2(pcp, val) arch_this_cpu_to_op(pcp, val, &)
+#define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, &)
+#define this_cpu_and_8(pcp, val) arch_this_cpu_to_op(pcp, val, &)
+
+#define this_cpu_or_1(pcp, val) arch_this_cpu_to_op(pcp, val, |)
+#define this_cpu_or_2(pcp, val) arch_this_cpu_to_op(pcp, val, |)
+#define this_cpu_or_4(pcp, val) arch_this_cpu_to_op(pcp, val, |)
+#define this_cpu_or_8(pcp, val) arch_this_cpu_to_op(pcp, val, |)
+
+#define this_cpu_xor_1(pcp, val) arch_this_cpu_to_op(pcp, val, ^)
+#define this_cpu_xor_2(pcp, val) arch_this_cpu_to_op(pcp, val, ^)
+#define this_cpu_xor_4(pcp, val) arch_this_cpu_to_op(pcp, val, ^)
+#define this_cpu_xor_8(pcp, val) arch_this_cpu_to_op(pcp, val, ^)

-#define arch_irqsafe_cpu_cmpxchg(pcp, oval, nval)			\
+#define arch_this_cpu_cmpxchg(pcp, oval, nval)			\
 ({									\
 	typedef typeof(pcp) pcp_op_T__;					\
 	pcp_op_T__ ret__;						\
@@ -79,10 +79,10 @@ do {									\
 	ret__;								\
 })

-#define irqsafe_cpu_cmpxchg_1(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval)
-#define irqsafe_cpu_cmpxchg_2(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval)
-#define irqsafe_cpu_cmpxchg_4(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval)
-#define irqsafe_cpu_cmpxchg_8(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
+#define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)

 #include <asm-generic/percpu.h>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22 17:58                   ` Christoph Lameter
@ 2011-12-22 18:03                     ` Ingo Molnar
  2011-12-22 18:31                     ` Linus Torvalds
  2011-12-22 18:47                     ` Tejun Heo
  2 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2011-12-22 18:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Linus Torvalds, Pekka Enberg, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner


* Christoph Lameter <cl@linux.com> wrote:

> On Thu, 22 Dec 2011, Tejun Heo wrote:
> 
> > Yeap, and that one too.  Maybe we can finally kill the duplicate
> > confusing static/dynamic accessors too.  I'm planning to get to it in
> > several weeks but if anyone can beat me to it, please go ahead.
> 
> That would be great. I looked at _and and _or and they both still have one
> use case (_xor has none though). But its easy to get rid of the irqsafe
> variants once we are willing to take the additional overhead that comes
> with disabling interrupts for the fallback cases.
> 
> 
> Subject: [percpu] Remove irqsafe_cpu_xxx variants
> 
> We simply say that regular this_cpu use must be safe regardless of preemption
> and interrupt state. That has no material change for x86 and s390 implementations
> of this_cpu operations. However, arches that do not provide their own implementation
> for this_cpu operations will now get code generated that disables interrupts
> instead of preemption.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> ---
>  arch/s390/include/asm/percpu.h     |   50 ++++-----
>  arch/x86/include/asm/percpu.h      |   28 -----
>  include/linux/netdevice.h          |    4
>  include/linux/netfilter/x_tables.h |    4
>  include/linux/percpu.h             |  190 ++++---------------------------------
>  include/net/snmp.h                 |   14 +-
>  mm/slub.c                          |    6 -
>  net/caif/caif_dev.c                |    4
>  net/caif/cffrml.c                  |    4
>  9 files changed, 65 insertions(+), 239 deletions(-)

While this is progress, i think you have missed the essence of 
Linus's observations: percpu.h is *way* too complex and is 
offering way too many variants. The irqsafe madness was just the 
most blatant problem.

Note that even wit your patch applied linux/percpu.h is 800+ 
lines long, while the total number of usecases is smaller than 
that - and then i havent even considered all the arch percpu.h 
files.

Why not implement Linus's suggestion of just one or two 
__this_cpu() ops and be content with that model?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22 17:58                   ` Christoph Lameter
  2011-12-22 18:03                     ` Ingo Molnar
@ 2011-12-22 18:31                     ` Linus Torvalds
  2011-12-23 16:55                       ` Christoph Lameter
  2011-12-22 18:47                     ` Tejun Heo
  2 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2011-12-22 18:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Thu, Dec 22, 2011 at 9:58 AM, Christoph Lameter <cl@linux.com> wrote:
>
> Subject: [percpu] Remove irqsafe_cpu_xxx variants

This looks good, but just remove the whole silly "xyz_op_return()" set
too. Don't even try to fix it up.

As far as I can tell, there are just a couple of users of that
horrible interface, it really isn't worth it. Just remove it and
opencode it. They are actually broken on x86 ('xadd' only exists if
CONFIG_X86_XADD is on - i386 didn't have it), and while most people
probably don't care (i386? What's that?), I note that the x86
add_return stuff doesn't take that into account.

If somebody cares about the end result of the add that much, they can
damn well just do it themselves.

                            Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22 17:58                   ` Christoph Lameter
  2011-12-22 18:03                     ` Ingo Molnar
  2011-12-22 18:31                     ` Linus Torvalds
@ 2011-12-22 18:47                     ` Tejun Heo
  2 siblings, 0 replies; 32+ messages in thread
From: Tejun Heo @ 2011-12-22 18:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Thu, Dec 22, 2011 at 11:58:51AM -0600, Christoph Lameter wrote:
> Subject: [percpu] Remove irqsafe_cpu_xxx variants
> 
> We simply say that regular this_cpu use must be safe regardless of preemption
> and interrupt state. That has no material change for x86 and s390 implementations
> of this_cpu operations. However, arches that do not provide their own implementation
> for this_cpu operations will now get code generated that disables interrupts
> instead of preemption.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>

Applied to wq/for-3.3.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22 18:31                     ` Linus Torvalds
@ 2011-12-23 16:55                       ` Christoph Lameter
  2011-12-23 20:54                         ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2011-12-23 16:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Thu, 22 Dec 2011, Linus Torvalds wrote:

> As far as I can tell, there are just a couple of users of that
> horrible interface, it really isn't worth it. Just remove it and
> opencode it. They are actually broken on x86 ('xadd' only exists if
> CONFIG_X86_XADD is on - i386 didn't have it), and while most people
> probably don't care (i386? What's that?), I note that the x86
> add_return stuff doesn't take that into account.

There is an #ifndef CONFIG_M386 around the implementation of these for
x86. Use on i386 will not generate an xadd instructions but fallback to a
generic implementation.

The add_return stuff was already available with the earlier per cpu apis
(local_t and percpu) that this_cpu replaces and is still available through
the atomic operations.

If one wants to exploit the per cpu atomicity then one may want to know
what the result was. Inspecting the value is not possible later in the
execution path if there are potential races on the processor through
preemption as well as interrupts. The _return variants also avoid
additional load instructions and thereby reduce code path size.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-23 16:55                       ` Christoph Lameter
@ 2011-12-23 20:54                         ` Linus Torvalds
  2012-01-04 15:30                           ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2011-12-23 20:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Fri, Dec 23, 2011 at 8:55 AM, Christoph Lameter <cl@linux.com> wrote:
>
> There is an #ifndef CONFIG_M386 around the implementation of these for
> x86. Use on i386 will not generate an xadd instructions but fallback to a
> generic implementation.

Ok, good. It's not the really right thing to do, but it will at least work.

> The add_return stuff was already available with the earlier per cpu apis
> (local_t and percpu) that this_cpu replaces and is still available through
> the atomic operations.

Sure. The "atomic_op_return()" form actually *makes*sense*. It has
atomicity guarantees, and the return value is meaningful.

The same IS SIMPLY NOT TRUE of the "this_cpu" versions.

> If one wants to exploit the per cpu atomicity then one may want to know
> what the result was.

No. You're blathering and doing "magical thinking" instead of actually
thinking things through.

Think about the semantics. What does it mean to get a result of the op
on a percpu variable?

You have two cases:

Case 1: you actually know what CPU you are on, and the number may have
some meaning. But in that case you are also *pinned* to that CPU, and
there is no difference between

   x = percpu_value;
   x += y;
   percpu_value = x;
   .. look at 'x' .. to see what you did

and

    percpu_value += x;
   ..look at percpu_value to see what you did ..

and

    x = xadd percpu_value,y

so there are no "atomicity advantages" - xadd is the same as
open-coding it. You are pinned to the CPU, all three ways are 100%
equivalent.

The other case is:

Case 2: you're not pinned to the CPU, and you just do statistics and
add some value to memory. BUT NOW THE RESULT HAS ABSOLUTELY NO
MEANING!

Now "xadd" would possibly give different results from "add and read
the result", since you migth be moving around between cpu's, but there
is absolutely no way you could ever care, sine the value you read back
is meaningless regardless of whether it's the return value of xadd, or
the value from some other CPU. There are other cpu's with the same
percpu variable, and they are all equivalent as far as you are
concerned, because you're not pinned to one of them.

See? xadd buys you *nothing* from a semantic standpoint. And anybody
who *thinks* it buys you something is just confused and just
introduced a bug.

So the "thiscpu_op_return" functions are broken-by-design! They have
no meaning. They are completely crazy.

That's why they should be removed. Anybody who thinks they add value
has not thought through what they actually do or mean.

                            Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-22  2:19               ` Linus Torvalds
  2011-12-22 16:05                 ` Tejun Heo
@ 2011-12-28 10:25                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 32+ messages in thread
From: Benjamin Herrenschmidt @ 2011-12-28 10:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Christoph Lameter, Pekka Enberg, Ingo Molnar,
	Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner

On Wed, 2011-12-21 at 18:19 -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2011 at 9:05 AM, Tejun Heo <tj@kernel.org> wrote:
> >
> > machines.  (cc'ing arch) Does anyone have better insight here?  How
> > much more expensive are local irq save/restore compared to inc/dec'ing
> > preempt count on various archs?
> 
> I think powerpc does sw irq disable, so it's pretty much the same.

On 64-bit only, but it's probably still better than going for an atomic
op, our atomics tend to be handled at the l2 level and so are sloooow.

 .../...

> So I really suspect that we could just say: "make the irq-safe version
> be the *only* version", and no architecture will really care. Sure, it
> can be more expensive, but it usually isn't. Only when done badly and
> stupidly is it nasty.

Agreed, keep it simple, or we'll just grow more bugs like this one.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2011-12-23 20:54                         ` Linus Torvalds
@ 2012-01-04 15:30                           ` Christoph Lameter
  2012-01-04 16:07                             ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2012-01-04 15:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Fri, 23 Dec 2011, Linus Torvalds wrote:

> You have two cases:
>
> Case 1: you actually know what CPU you are on, and the number may have
> some meaning. But in that case you are also *pinned* to that CPU, and
> there is no difference between
>
>    x = percpu_value;
>    x += y;
>    percpu_value = x;
>    .. look at 'x' .. to see what you did
>
> and
>
>     percpu_value += x;
>    ..look at percpu_value to see what you did ..
>
> and
>
>     x = xadd percpu_value,y
>
> so there are no "atomicity advantages" - xadd is the same as
> open-coding it. You are pinned to the CPU, all three ways are 100%
> equivalent.

As mentioned before the main point of the use of these operations (in the
form of __this_cpu_op_return) when the cpu is pinned is to reduce the
number of instructions. __this_cpu_add_return allows replacing ~ 5
instructions with one.

These operations are used in f.e. in vm statististics that have to scale
well and have the lightest possible impact on the rest of the vm code.

"atomicity advantages" were never claimed to exist for the __this_cpu_xx
operations.

>> The other case is:
>
> Case 2: you're not pinned to the CPU, and you just do statistics and
> add some value to memory. BUT NOW THE RESULT HAS ABSOLUTELY NO
> MEANING!
>
> Now "xadd" would possibly give different results from "add and read
> the result", since you migth be moving around between cpu's, but there
> is absolutely no way you could ever care, sine the value you read back
> is meaningless regardless of whether it's the return value of xadd, or
> the value from some other CPU. There are other cpu's with the same
> percpu variable, and they are all equivalent as far as you are
> concerned, because you're not pinned to one of them.

You would care if you had special actions that have to be triggered on all
processors if for example a threshold was breached on a single processors
counters (and one would fold all differentials into a global counter) or a
per cpu counter reaches zero (and one would need to replenish lists of
objects or some such thing). In that case it matters that the operational
result be checked later regardless of the cpu we are on right now.

There are no use cases of the this_cpu_xx_return operations as far as I
can see so we could remove those.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2012-01-04 15:30                           ` Christoph Lameter
@ 2012-01-04 16:07                             ` Linus Torvalds
  2012-01-04 17:00                               ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2012-01-04 16:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Wed, Jan 4, 2012 at 7:30 AM, Christoph Lameter <cl@linux.com> wrote:
>
> As mentioned before the main point of the use of these operations (in the
> form of __this_cpu_op_return) when the cpu is pinned is to reduce the
> number of instructions. __this_cpu_add_return allows replacing ~ 5
> instructions with one.

And that's fine if it's something really core, and something *so*
important that you can tell the difference between one instruction and
three.

Which isn't the case here. In fact, on many (most?) x86
microarchitectures xadd is actually slower than a regular
add-from-memory-and-store - the big advantage of it is that with the
"lock" prefix you do get special atomicity guarantees, and some
algorithms (is semaphores) do want to know the value of the add
atomically in order to know if there were other things going on.

The thing is, I care about maintainability and not having
cross-architecture problems etc. And right now many of the cpulocal
things are *much* more of a maintainability headache than they are
worth.

                     Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2012-01-04 16:07                             ` Linus Torvalds
@ 2012-01-04 17:00                               ` Christoph Lameter
  2012-01-04 23:10                                 ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2012-01-04 17:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Wed, 4 Jan 2012, Linus Torvalds wrote:

> On Wed, Jan 4, 2012 at 7:30 AM, Christoph Lameter <cl@linux.com> wrote:
> >
> > As mentioned before the main point of the use of these operations (in the
> > form of __this_cpu_op_return) when the cpu is pinned is to reduce the
> > number of instructions. __this_cpu_add_return allows replacing ~ 5
> > instructions with one.
>
> And that's fine if it's something really core, and something *so*
> important that you can tell the difference between one instruction and
> three.
>
> Which isn't the case here. In fact, on many (most?) x86
> microarchitectures xadd is actually slower than a regular
> add-from-memory-and-store - the big advantage of it is that with the
> "lock" prefix you do get special atomicity guarantees, and some
> algorithms (is semaphores) do want to know the value of the add
> atomically in order to know if there were other things going on.

xadd is 3 cycles. add is one cycle.

What we are doing here is also the use of a segment override to allow us
to relocate the per cpu address to the current cpu. So we are already
getting two additions for the price of one xadd. If we manually calcualte
the address then we have another memory reference to get the per cpu
offset for this processor (otherwise we get it from the segment register).
And then we need to store the results. We use registers etc etc.

Cannot imagine that this would be the same speed.

The thing is, I care about maintainability and not having
> cross-architecture problems etc. And right now many of the cpulocal
> things are *much* more of a maintainability headache than they are
> worth.

The cpu local things and xadd support have been around for a pretty long
time in various forms and they work reliably. I have tried to add onto
this by adding the cmpxchg/cmpxchg_double functionalty which caused some
issues because of the fallback stuff. That seems to have been addressed
though since we were willing now to make the preempt/irq tradeoff that we
were not able to get agreement on during the cleanup of the old APIs a
year or so ago.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2012-01-04 17:00                               ` Christoph Lameter
@ 2012-01-04 23:10                                 ` Linus Torvalds
  2012-01-05 19:15                                   ` Christoph Lameter
  0 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 2012-01-04 23:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Wed, Jan 4, 2012 at 9:00 AM, Christoph Lameter <cl@linux.com> wrote:
>
> xadd is 3 cycles. add is one cycle.

On some uarchs. On new uarchs it can be a single cycle, I think, and
on some uarchs it will even be microcoded and/or only go in one pipe
because it has that odd behavior that it writes both to memory and a
register, and thus doesn't fit the normal fastpaths.

The point is, xadd isn't actually any faster than just doing the
regular "add and read". It's *slower*.

There really isn't ever any reason to use xadd on percpu variables.
That's my point. You claimed that there was a performance advantage.
There really isn't.

So why are you still arguing?

                           Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2012-01-04 23:10                                 ` Linus Torvalds
@ 2012-01-05 19:15                                   ` Christoph Lameter
  2012-01-05 19:27                                     ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Lameter @ 2012-01-05 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Wed, 4 Jan 2012, Linus Torvalds wrote:

> On Wed, Jan 4, 2012 at 9:00 AM, Christoph Lameter <cl@linux.com> wrote:
> >
> > xadd is 3 cycles. add is one cycle.
>
> On some uarchs. On new uarchs it can be a single cycle, I think, and
> on some uarchs it will even be microcoded and/or only go in one pipe
> because it has that odd behavior that it writes both to memory and a
> register, and thus doesn't fit the normal fastpaths.

> The point is, xadd isn't actually any faster than just doing the
> regular "add and read". It's *slower*.

Ok that assumes that both add and read are this_cpu operations that
use the segment override to relocate the address. I thought you wanted to
open code everything. So then we would have to use the following?

	this_cpu_add(var, count);
	result = this_cpu_read(var);


XADD and ADD have the same cycle count if the ADD is used to add to a
memory location. Both use 4 microops.

The MOVE from memory costs another 2 so we are at 6. Two segment overrides
for each instruction add 2 more (nothing in the optimization manual but as
far as I can recall another uop is needed for the address calculation). So
this ends up at 8.

On the other hand

	result = this_cpu_add_return(var, count)

is 4 + 1.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [GIT PULL] slab fixes for 3.2-rc4
  2012-01-05 19:15                                   ` Christoph Lameter
@ 2012-01-05 19:27                                     ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 2012-01-05 19:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton,
	linux-kernel, linux-arch, Thomas Gleixner

On Thu, Jan 5, 2012 at 11:15 AM, Christoph Lameter <cl@linux.com> wrote:
>
> XADD and ADD have the same cycle count if the ADD is used to add to a
> memory location. Both use 4 microops.

Christ, stop counting uops. They are some random internal
microarchitectural detail, and ignores things like processor uop
scheduling and decoding issues.

The thing that matters is (a) performance and (b) code sanity.

The "local_cpu_return()" operations fail *seriously* in the code
sanity department. They are a fundamentally insane operation.

And nothing you have said says "it's a huge performance win" to me.
First off, there aren't even very many users, and the users there are
seem to not even be all that performance-critical.

For statistics, just regular "add" is the normal thing to do. The
add_return thing is insane, for all the reasons I already outlined. It
*fundamentally* doesn't have any sane semantics. Just remove it.

                         Linus

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-01-05 19:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-29 18:02 [GIT PULL] slab fixes for 3.2-rc4 Pekka Enberg
2011-11-29 19:29 ` Linus Torvalds
2011-11-29 19:38   ` Linus Torvalds
2011-12-20  9:47   ` Pekka Enberg
2011-12-20 16:23     ` Tejun Heo
2011-12-20 16:31       ` Christoph Lameter
2011-12-20 19:28       ` Linus Torvalds
2011-12-20 20:28         ` Tejun Heo
2011-12-21  8:08           ` Pekka Enberg
2011-12-21 17:09             ` Tejun Heo
2011-12-21 15:16           ` Christoph Lameter
2011-12-21 17:05             ` Tejun Heo
2011-12-22  2:19               ` Linus Torvalds
2011-12-22 16:05                 ` Tejun Heo
2011-12-28 10:25                 ` Benjamin Herrenschmidt
2011-12-22 14:58               ` Christoph Lameter
2011-12-22 16:08                 ` Tejun Heo
2011-12-22 17:58                   ` Christoph Lameter
2011-12-22 18:03                     ` Ingo Molnar
2011-12-22 18:31                     ` Linus Torvalds
2011-12-23 16:55                       ` Christoph Lameter
2011-12-23 20:54                         ` Linus Torvalds
2012-01-04 15:30                           ` Christoph Lameter
2012-01-04 16:07                             ` Linus Torvalds
2012-01-04 17:00                               ` Christoph Lameter
2012-01-04 23:10                                 ` Linus Torvalds
2012-01-05 19:15                                   ` Christoph Lameter
2012-01-05 19:27                                     ` Linus Torvalds
2011-12-22 18:47                     ` Tejun Heo
2011-12-20 16:26     ` Christoph Lameter
2011-12-21  8:06       ` Pekka Enberg
2011-12-21 15:20         ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.