* [GIT PULL] slab fixes for 3.2-rc4 @ 2011-11-29 18:02 Pekka Enberg 2011-11-29 19:29 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Pekka Enberg @ 2011-11-29 18:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Andrew Morton, Christoph Lameter, linux-kernel Hi Linus, Here's few important fixes to SLUB issues introduced in the merge window. The most important ones are from Christoph Lameter and Eric Dumazet that fix stability issues on non-x86 architectures. The patches from Shaohua Li fix a performance regression in SLUB that was also introduced in the merge window. Pekka The following changes since commit 883381d9f1c5a6329bbb796e23ae52c939940310: Merge branch 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2011-11-29 08:59:12 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux.git slab/urgent Christoph Lameter (1): slub: use irqsafe_cpu_cmpxchg for put_cpu_partial Eric Dumazet (1): slub: avoid potential NULL dereference or corruption Shaohua Li (2): slub: use correct parameter to add a page to partial list tail slub: move discard_slab out of node lock mm/slub.c | 42 ++++++++++++++++++++++++++---------------- 1 files changed, 26 insertions(+), 16 deletions(-) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-11-29 18:02 [GIT PULL] slab fixes for 3.2-rc4 Pekka Enberg @ 2011-11-29 19:29 ` Linus Torvalds 2011-11-29 19:38 ` Linus Torvalds 2011-12-20 9:47 ` Pekka Enberg 0 siblings, 2 replies; 32+ messages in thread From: Linus Torvalds @ 2011-11-29 19:29 UTC (permalink / raw) To: Pekka Enberg, Ingo Molnar Cc: Andrew Morton, Christoph Lameter, linux-kernel, linux-arch On Tue, Nov 29, 2011 at 10:02 AM, Pekka Enberg <penberg@kernel.org> wrote: > > The most important ones are from Christoph Lameter and Eric Dumazet that fix > stability issues on non-x86 architectures. > > Christoph Lameter (1): > slub: use irqsafe_cpu_cmpxchg for put_cpu_partial Quite frankly, I think "this_cpu_cmpxchg()" is pure crap, and this fix was wrong, wrong, WRONG. Dammit, the emulation is just bad. The *only* architecture that implements "this_cpu_cmpxchg()" specially seems to be x86. So every other architecture uses the emulated one, AND THE EMULATED ONE DOES NOT MATCH THE X86 SEMANTICS! Sorry for shouting, but that's just *incredibly* stupid. What's extra stupid is that s390 actually separately implements the irq-safe version, and does it with code that seems to be generic, and actually uses "cmpxchg()" like the regular "this_cpu_cmpxchg()" name implies! In other words, I would suggest that: - we get rid of that totally crazy idiotic "irqsafe" crap. The normal one should be irq-safe. It's named "cmpxchg", for chrissake! - we make the non-arch-specific "this_cpu_cmpxchg()" use the correct s390 routines instead of the pure and utter *shit* it uses now. Yes, I'm upset. This "random percpu crap for SLUB" has been going on too f*cking long, and it has been *continually* plagued by just pure sh*t like this. Seriously. Stop taking patches from Christoph in this area until they have been vetted for sanity and completeness. This kind of ugly "let's mis-name our functions in really subtle ways so that they work on x86 but nowhere else" needs to stop. I pulled this crap because clearly it fixes a bug, but just stop it. Show some TASTE! Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-11-29 19:29 ` Linus Torvalds @ 2011-11-29 19:38 ` Linus Torvalds 2011-12-20 9:47 ` Pekka Enberg 1 sibling, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2011-11-29 19:38 UTC (permalink / raw) To: Pekka Enberg, Ingo Molnar, Thomas Gleixner Cc: Andrew Morton, Christoph Lameter, linux-kernel, linux-arch On Tue, Nov 29, 2011 at 11:29 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Dammit, the emulation is just bad. Btw, wasn't the "crazy random thiscpu ops" one of the topics at the KS? I thought we agreed to clean these things up and remove most of them? Did it just get dropped, or is it pending somwhere? Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-11-29 19:29 ` Linus Torvalds 2011-11-29 19:38 ` Linus Torvalds @ 2011-12-20 9:47 ` Pekka Enberg 2011-12-20 16:23 ` Tejun Heo 2011-12-20 16:26 ` Christoph Lameter 1 sibling, 2 replies; 32+ messages in thread From: Pekka Enberg @ 2011-12-20 9:47 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel, linux-arch, Tejun Heo, Thomas Gleixner Hi, (I'm CC'ing Tejun and Thomas.) On Tue, Nov 29, 2011 at 10:02 AM, Pekka Enberg <penberg@kernel.org> wrote: >> The most important ones are from Christoph Lameter and Eric Dumazet that fix >> stability issues on non-x86 architectures. >> >> Christoph Lameter (1): >> slub: use irqsafe_cpu_cmpxchg for put_cpu_partial On Tue, Nov 29, 2011 at 9:29 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Quite frankly, I think "this_cpu_cmpxchg()" is pure crap, and this fix > was wrong, wrong, WRONG. > > Dammit, the emulation is just bad. > > The *only* architecture that implements "this_cpu_cmpxchg()" specially > seems to be x86. > > So every other architecture uses the emulated one, AND THE EMULATED > ONE DOES NOT MATCH THE X86 SEMANTICS! > > Sorry for shouting, but that's just *incredibly* stupid. > > What's extra stupid is that s390 actually separately implements the > irq-safe version, and does it with code that seems to be generic, and > actually uses "cmpxchg()" like the regular "this_cpu_cmpxchg()" name > implies! > > In other words, I would suggest that: > > - we get rid of that totally crazy idiotic "irqsafe" crap. The normal > one should be irq-safe. It's named "cmpxchg", for chrissake! > > - we make the non-arch-specific "this_cpu_cmpxchg()" use the correct > s390 routines instead of the pure and utter *shit* it uses now. > > Yes, I'm upset. This "random percpu crap for SLUB" has been going on > too f*cking long, and it has been *continually* plagued by just pure > sh*t like this. > > Seriously. Stop taking patches from Christoph in this area until they > have been vetted for sanity and completeness. This kind of ugly "let's > mis-name our functions in really subtle ways so that they work on x86 > but nowhere else" needs to stop. So, I actually looked into doing something like this and wasn't actually able to understand the purpose of the various percpu variants. It seems rather obvious that we can just drop the non-irqsafe cmpxchg() variant but what about the rest of the percpu ops? Why do we have preempt safe, irqsafe, and unsafe variants? How are they supposed to be used? To illustrate the issue, for "per cpu add" we have: __this_cpu_add() this_cpu_add() irqsafe_cpu_add() percpu_add() Why do we need all of them? Pekka ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 9:47 ` Pekka Enberg @ 2011-12-20 16:23 ` Tejun Heo 2011-12-20 16:31 ` Christoph Lameter 2011-12-20 19:28 ` Linus Torvalds 2011-12-20 16:26 ` Christoph Lameter 1 sibling, 2 replies; 32+ messages in thread From: Tejun Heo @ 2011-12-20 16:23 UTC (permalink / raw) To: Pekka Enberg Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel, linux-arch, Thomas Gleixner Hello, Pekka. On Tue, Dec 20, 2011 at 11:47:26AM +0200, Pekka Enberg wrote: > So, I actually looked into doing something like this and wasn't > actually able to understand the purpose of the various percpu > variants. It seems rather obvious that we can just drop the > non-irqsafe cmpxchg() variant but what about the rest of the percpu > ops? Why do we have preempt safe, irqsafe, and unsafe variants? How > are they supposed to be used? > > To illustrate the issue, for "per cpu add" we have: > > __this_cpu_add() > this_cpu_add() > irqsafe_cpu_add() Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and generic this_cpu_* operations" should explain the above three. In short, __this_cpu_add() : synchronization is caller's responsibility this_cpu_add() : protected against preemption irqsafe_cpu_add() : protected against irq > percpu_add() This is __this_cpu_add() + preemption disabled check. Should be removed. Christoph, is there a use case where __this_cpu_XXX() is used without preemption disabled? Why doesn't it have preemption check? > Why do we need all of them? It would great if we can drop the preempt safe one. For x86, it doesn't make any difference. For archs which can't do it in single instruction && irq on/off is expensive, it can be bad. I don't know how bad tho. percpu API needs to be cleaned up. There are quite a few duplicates - some are from the days when static and dynamic percpu memories were different, some got added during the this_cpu_*() stuff. It has been on the todo list for a while now but I never got around to do it. If I'm not missing something, all we need are, * per_cpu_ptr() * get_cpu_var(), put_cpu_var() - it would be more consistent if they're get_cpu_ptr() and put_cpu_ptr(). * [__]this_cpu_ptr() * Hopefully, smaller subset of this_cpu_XXX() ops. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 16:23 ` Tejun Heo @ 2011-12-20 16:31 ` Christoph Lameter 2011-12-20 19:28 ` Linus Torvalds 1 sibling, 0 replies; 32+ messages in thread From: Christoph Lameter @ 2011-12-20 16:31 UTC (permalink / raw) To: Tejun Heo Cc: Pekka Enberg, Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Tue, 20 Dec 2011, Tejun Heo wrote: > This is __this_cpu_add() + preemption disabled check. Should be > removed. Christoph, is there a use case where __this_cpu_XXX() is > used without preemption disabled? Why doesn't it have preemption > check? We discussed this before and said that it would be possible to add a preemption check there. We would need to verify that there is no use case of __this_cpu_XXX operations in preemptable context. There used to be a case where we did not care about the races for the vmstat counters but that seems to have been changed. Not aware of any other use case like that. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 16:23 ` Tejun Heo 2011-12-20 16:31 ` Christoph Lameter @ 2011-12-20 19:28 ` Linus Torvalds 2011-12-20 20:28 ` Tejun Heo 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2011-12-20 19:28 UTC (permalink / raw) To: Tejun Heo Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel, linux-arch, Thomas Gleixner On Tue, Dec 20, 2011 at 8:23 AM, Tejun Heo <tj@kernel.org> wrote: >> >> To illustrate the issue, for "per cpu add" we have: >> >> __this_cpu_add() >> this_cpu_add() >> irqsafe_cpu_add() > > Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and > generic this_cpu_* operations" should explain the above three. I don't think that's relevant. Sure, they have semantics, but the semantics are stupid and wrong. Whether they are documented or not isn't even the issue. There are also too many of them to begin with, and they are generally pointless. Just grep for "this_cpu_" and notice that we basically have *more* lines in the header files to defile the crap, than we have lines in the rest of the kernel to *use* it. If that doesn't show how crappy the idiotic interface is, I don't know what would. So I would suggest: - get rid of *all* of them, leave exactly one version (this_cpu_xyz()) - ask yourself which of the ops we even need. "this_cpu_write()" needs to just go away. There's no excuse for it. Are the other ones needed? I can see "add" and "cmpxchg". Is there *any* reason to support anything else? - afaik, there is exactly *one* user of the "this_cpu_cmpxchg", and *one* user of "irqsafe_cpu_cmpxchg". And those are both inside CONFIG_CMPXCHG_LOCAL. Yet we have completely *INSANE* semantics for the cmpxchg operations. - there's the magic "irqsafe_cmpxchg_double". Again, there is exactly *one* user of it, but just grep for "cpu_cmpxchg_double" and notice how we define all the crazy variations of it etc too. Again, *INSANE*. In general, just grep for "this_cpu_" and notice that we basically have *more* lines in the header files to defile the crap, than we have lines in the rest of the ernel to *use* it. Seriously, the whole piece of crap needs to go. Everything. It's shit. It's unsalvageable. I would suggest that we support exactly two operations, and nothing more. - this_cpu_cmpxchg (and the "double") version of it: This operation needs to be irq-safe and preempt-safe *by*definition*. The whole concept of "cmpxchg" doesn't make sense without that. Having three different versions of it with totally insane semantics just NEEDS TO GO AWAY. There should be one version, and it damn well is safe to use and does a totally unambiguous "cmpxchg". Nothing else. - this_cpu_add() This operation is potentially useful for "sloppy statistics" where the point is to do things quickly without requiring atomic accesses. As such, it at least has *some* excuse for having the "irqsafe" vs "preempt_safe" vs "regular" semantics. And it does have several users, so.. However, I'd like to do something sane about it, and the non-preempt version absolutely *needs* to have a debug check to verify it, Thomas was already complaining about this. But get rid of everything else. Stop using the "inc" ones, replace them by "add 1". Don't have millions of lines of duplicate definitions for crap that nobody cares about. Also, get rid of the 1-byte and 2-byte versions. I doubt anybody actually uses them, and it's not even a portable thing to do well (ie the whole alpha issue with non-atomic byte and word accesses). So again, it's just noise in the header files that makes it hard to understand how they work because they are so verbose and do so many stupid things. Being "generic" is not actually a good thing. Not when we're talking about random details like this. Hmm? Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 19:28 ` Linus Torvalds @ 2011-12-20 20:28 ` Tejun Heo 2011-12-21 8:08 ` Pekka Enberg 2011-12-21 15:16 ` Christoph Lameter 0 siblings, 2 replies; 32+ messages in thread From: Tejun Heo @ 2011-12-20 20:28 UTC (permalink / raw) To: Linus Torvalds Cc: Pekka Enberg, Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel, linux-arch, Thomas Gleixner Hello, Linus. On Tue, Dec 20, 2011 at 11:28:25AM -0800, Linus Torvalds wrote: > > Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and > > generic this_cpu_* operations" should explain the above three. > > I don't think that's relevant. > > Sure, they have semantics, but the semantics are stupid and wrong. > Whether they are documented or not isn't even the issue. I was trying to point Pekka to documentation so that at least the existing semantics are clear. > Being "generic" is not actually a good thing. Not when we're talking > about random details like this. Yeah, I generally agree that reducing the API would be great. Given the usage, I think (or at least hope) dropping preemption protected ones wouldn't hurt much but it might be worthwhile to keep __this_cpu_*() - the ones which expect the caller to take care of synchronization - w/ assertion on irq disabled. Christoph, what do you think? What would be the minimal set that you can work with? Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 20:28 ` Tejun Heo @ 2011-12-21 8:08 ` Pekka Enberg 2011-12-21 17:09 ` Tejun Heo 2011-12-21 15:16 ` Christoph Lameter 1 sibling, 1 reply; 32+ messages in thread From: Pekka Enberg @ 2011-12-21 8:08 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel, linux-arch, Thomas Gleixner On Tue, Dec 20, 2011 at 11:28:25AM -0800, Linus Torvalds wrote: >> > Description for 7340a0b152 "this_cpu: Introduce this_cpu_ptr() and >> > generic this_cpu_* operations" should explain the above three. >> >> I don't think that's relevant. >> >> Sure, they have semantics, but the semantics are stupid and wrong. >> Whether they are documented or not isn't even the issue. On Tue, Dec 20, 2011 at 10:28 PM, Tejun Heo <tj@kernel.org> wrote: > I was trying to point Pekka to documentation so that at least the > existing semantics are clear. Sure but well-defined semantics alone are not sufficient for a reasonable API. It's not at all obvious which one of the four variants to pick when writing code. I don't see any evidence that people actually understand the API. On the contrary, I see bugs caused by API confusion in mm/slub.c itself! Pekka ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-21 8:08 ` Pekka Enberg @ 2011-12-21 17:09 ` Tejun Heo 0 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2011-12-21 17:09 UTC (permalink / raw) To: Pekka Enberg Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Christoph Lameter, linux-kernel, linux-arch, Thomas Gleixner Hello, On Wed, Dec 21, 2011 at 10:08:48AM +0200, Pekka Enberg wrote: > Sure but well-defined semantics alone are not sufficient for a > reasonable API. It's not at all obvious which one of the four variants > to pick when writing code. I don't see any evidence that people > actually understand the API. On the contrary, I see bugs caused by API > confusion in mm/slub.c itself! Sure, I agree. If nobody beats me to it, I'll try to clean it up for the merge window after the coming one (ie. 3.4). But please feel free to submit patches to gut out cruft and clean up. I'll be happy to apply them. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 20:28 ` Tejun Heo 2011-12-21 8:08 ` Pekka Enberg @ 2011-12-21 15:16 ` Christoph Lameter 2011-12-21 17:05 ` Tejun Heo 1 sibling, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2011-12-21 15:16 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Tue, 20 Dec 2011, Tejun Heo wrote: > the usage, I think (or at least hope) dropping preemption protected > ones wouldn't hurt much but it might be worthwhile to keep > __this_cpu_*() - the ones which expect the caller to take care of > synchronization - w/ assertion on irq disabled. __this_cpu ops are generally the most useless. You can basically do the same thing by open coding it. But then on x86 you'd miss out on generating a simple inc seg:var instruction that does not impact registers. Plus you avoid the necessity of calculating the address first. Instead of one instruction you'd have 5. Dropping preemption protected ones is going to be difficult given their use in key subsystems. > > Christoph, what do you think? What would be the minimal set that you > can work with? If you just talking about the slub allocator and the this_cpu_cmpxchg variants there then the irqsafe variants of cmpxchg and cmpxchg_double are sufficient there. However, the this_cpu ops are widely used in many subsystems for keeping statistics. Their main role is to keep the overhead of incrementing/adding to counters as minimal as possible. Changes there would cause instructions to be generated that are longer in size and also would cause higher latency of execution. Generally the irqsafe variants are not needed for counters so we may be able to toss those. this_cpu ops are not sloppy unless one intentionally uses __this_cpu_xxx in a non preempt safe context which was the case for the vmstat counters for awhile. The amount of this_cpu functions may be excessive because I tried to cover all possible use cases rather than actuallly used forms in the kernel. So a lot of things could be weeded out. this_cpu ops is a way to experiment with different forms of synchronization that are particular important for fastpaths implementing per cpu caching. This could be of relevance to many of the allocators in the future. The way that the cmpxchg things are used is also similar to transactional memory that is becoming available in the next generation of processors by Intel and that is already available in the current generation of powerpc processors by IBM. It is a way to avoid locking overhead. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-21 15:16 ` Christoph Lameter @ 2011-12-21 17:05 ` Tejun Heo 2011-12-22 2:19 ` Linus Torvalds 2011-12-22 14:58 ` Christoph Lameter 0 siblings, 2 replies; 32+ messages in thread From: Tejun Heo @ 2011-12-21 17:05 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner Hello, Christoph. On Wed, Dec 21, 2011 at 09:16:24AM -0600, Christoph Lameter wrote: > __this_cpu ops are generally the most useless. You can basically do the > same thing by open coding it. But then on x86 you'd miss out on generating > a simple inc seg:var instruction that does not impact registers. Plus you > avoid the necessity of calculating the address first. Instead of one > instruction you'd have 5. > > Dropping preemption protected ones is going to be difficult given their > use in key subsystems. The thing is that irqsafe ones are the "complete" ones. We can use irqsafe ones instead of preempt safe ones but not the other way. This matters only if flipping irq is noticeably more expensive than inc/dec'ing preempt count but I suspect there are enough such machines. (cc'ing arch) Does anyone have better insight here? How much more expensive are local irq save/restore compared to inc/dec'ing preempt count on various archs? > > > Christoph, what do you think? What would be the minimal set that you > > can work with? > > If you just talking about the slub allocator and the this_cpu_cmpxchg > variants there then the irqsafe variants of cmpxchg and cmpxchg_double are > sufficient there. > > However, the this_cpu ops are widely used in many subsystems for keeping > statistics. Their main role is to keep the overhead of incrementing/adding > to counters as minimal as possible. Changes there would cause instructions > to be generated that are longer in size and also would cause higher > latency of execution. Generally the irqsafe variants are not needed for > counters so we may be able to toss those. > > this_cpu ops are not sloppy unless one intentionally uses __this_cpu_xxx > in a non preempt safe context which was the case for the vmstat counters > for awhile. > > The amount of this_cpu functions may be excessive because I tried to cover > all possible use cases rather than actuallly used forms in the kernel. So > a lot of things could be weeded out. this_cpu ops is a way to experiment > with different forms of synchronization that are particular important for > fastpaths implementing per cpu caching. This could be of relevance to many > of the allocators in the future. > > The way that the cmpxchg things are used is also similar to transactional > memory that is becoming available in the next generation of processors by > Intel and that is already available in the current generation of powerpc > processors by IBM. It is a way to avoid locking overhead. Hmmm... how about removing the ones which aren't currently in use? percpu API in general needs a lot more clean up but I think that would be a good starting point. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-21 17:05 ` Tejun Heo @ 2011-12-22 2:19 ` Linus Torvalds 2011-12-22 16:05 ` Tejun Heo 2011-12-28 10:25 ` Benjamin Herrenschmidt 2011-12-22 14:58 ` Christoph Lameter 1 sibling, 2 replies; 32+ messages in thread From: Linus Torvalds @ 2011-12-22 2:19 UTC (permalink / raw) To: Tejun Heo Cc: Christoph Lameter, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, Dec 21, 2011 at 9:05 AM, Tejun Heo <tj@kernel.org> wrote: > > machines. (cc'ing arch) Does anyone have better insight here? How > much more expensive are local irq save/restore compared to inc/dec'ing > preempt count on various archs? I think powerpc does sw irq disable, so it's pretty much the same. However, on *most* architectures, if the stupid generic "disable interrupts and then do the op" is too expensive, they could easily just do an architecture-specific "safe" version. Especially if they only need to do cmpxchg and add. Many architectures can do those safely with load-locked, store-conditional (ie atomic_add), and do so quickly. Sure, there are broken cases wher ethat is really slow (original alpha is an example of that), but I think it's fairly rare. So both arm (in v6+) and powerpc should be able to just use the "atomic_add" version, with no real downside. So I really suspect that we could just say: "make the irq-safe version be the *only* version", and no architecture will really care. Sure, it can be more expensive, but it usually isn't. Only when done badly and stupidly is it nasty. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 2:19 ` Linus Torvalds @ 2011-12-22 16:05 ` Tejun Heo 2011-12-28 10:25 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 32+ messages in thread From: Tejun Heo @ 2011-12-22 16:05 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner Hello, Linus. On Wed, Dec 21, 2011 at 06:19:34PM -0800, Linus Torvalds wrote: > On Wed, Dec 21, 2011 at 9:05 AM, Tejun Heo <tj@kernel.org> wrote: > > > > machines. (cc'ing arch) Does anyone have better insight here? How > > much more expensive are local irq save/restore compared to inc/dec'ing > > preempt count on various archs? > > I think powerpc does sw irq disable, so it's pretty much the same. > > However, on *most* architectures, if the stupid generic "disable > interrupts and then do the op" is too expensive, they could easily > just do an architecture-specific "safe" version. Especially if they > only need to do cmpxchg and add. > > Many architectures can do those safely with load-locked, > store-conditional (ie atomic_add), and do so quickly. Sure, there are > broken cases wher ethat is really slow (original alpha is an example > of that), but I think it's fairly rare. > > So both arm (in v6+) and powerpc should be able to just use the > "atomic_add" version, with no real downside. Hmmm... right, if actual atomic ops are used we don't need to guarantee that percpu address translation and operation happen on the same CPU. The end result would be correct regardless. Atomic ops wouldn't be as efficient as local single op as in x86 but it shouldn't be too expensive when vast majority of operations are happening on cachelines which are already exclusively owned. > So I really suspect that we could just say: "make the irq-safe version > be the *only* version", and no architecture will really care. Sure, it > can be more expensive, but it usually isn't. Only when done badly and > stupidly is it nasty. Yeah, I don't think there's any cross-architecture way of making this_cpu ops universally faster than explicit irq/preempt_disable - get_addr - do one or more ops - irq/preempt_enable. Many archs can't even load address offset with single instruction after all. Even on x86, with enough number of ops, the segment prefix and full address on each op would become more expensive than explicit sequence. Given that, I think just providing irqsafe version and let all other cases use the explicit ones is a good tradeoff. Alright, I'll do that for the 3.4 merge window. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 2:19 ` Linus Torvalds 2011-12-22 16:05 ` Tejun Heo @ 2011-12-28 10:25 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 32+ messages in thread From: Benjamin Herrenschmidt @ 2011-12-28 10:25 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Christoph Lameter, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, 2011-12-21 at 18:19 -0800, Linus Torvalds wrote: > On Wed, Dec 21, 2011 at 9:05 AM, Tejun Heo <tj@kernel.org> wrote: > > > > machines. (cc'ing arch) Does anyone have better insight here? How > > much more expensive are local irq save/restore compared to inc/dec'ing > > preempt count on various archs? > > I think powerpc does sw irq disable, so it's pretty much the same. On 64-bit only, but it's probably still better than going for an atomic op, our atomics tend to be handled at the l2 level and so are sloooow. .../... > So I really suspect that we could just say: "make the irq-safe version > be the *only* version", and no architecture will really care. Sure, it > can be more expensive, but it usually isn't. Only when done badly and > stupidly is it nasty. Agreed, keep it simple, or we'll just grow more bugs like this one. Cheers, Ben. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-21 17:05 ` Tejun Heo 2011-12-22 2:19 ` Linus Torvalds @ 2011-12-22 14:58 ` Christoph Lameter 2011-12-22 16:08 ` Tejun Heo 1 sibling, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2011-12-22 14:58 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, 21 Dec 2011, Tejun Heo wrote: > The thing is that irqsafe ones are the "complete" ones. We can use > irqsafe ones instead of preempt safe ones but not the other way. This > matters only if flipping irq is noticeably more expensive than > inc/dec'ing preempt count but I suspect there are enough such > machines. (cc'ing arch) Does anyone have better insight here? How > much more expensive are local irq save/restore compared to inc/dec'ing > preempt count on various archs? Well that would be a pretty nice simplification of the API. Replace the fallback code for the preempt safe ones with the irqsafe fallbacks, then drop the irqsafe variants from percpu.h. > > The way that the cmpxchg things are used is also similar to transactional > > memory that is becoming available in the next generation of processors by > > Intel and that is already available in the current generation of powerpc > > processors by IBM. It is a way to avoid locking overhead. > > Hmmm... how about removing the ones which aren't currently in use? Yep. Could easily be done. We can resurrect the stuff as needed when other variants become necessary. In particular the _and and _or etc stuff was just added to be backward compatible with the old per cpu and local_t interfaces. There may be no use cases left. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 14:58 ` Christoph Lameter @ 2011-12-22 16:08 ` Tejun Heo 2011-12-22 17:58 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Tejun Heo @ 2011-12-22 16:08 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner Hello, Christoph. On Thu, Dec 22, 2011 at 08:58:43AM -0600, Christoph Lameter wrote: > Well that would be a pretty nice simplification of the API. > Replace the fallback code for the preempt safe ones with the > irqsafe fallbacks, then drop the irqsafe variants from percpu.h. Yeah, it seems we're going that direction. > > > The way that the cmpxchg things are used is also similar to transactional > > > memory that is becoming available in the next generation of processors by > > > Intel and that is already available in the current generation of powerpc > > > processors by IBM. It is a way to avoid locking overhead. > > > > Hmmm... how about removing the ones which aren't currently in use? > > Yep. Could easily be done. We can resurrect the stuff as needed when other > variants become necessary. In particular the _and and _or etc stuff was > just added to be backward compatible with the old per cpu and local_t > interfaces. There may be no use cases left. Yeap, and that one too. Maybe we can finally kill the duplicate confusing static/dynamic accessors too. I'm planning to get to it in several weeks but if anyone can beat me to it, please go ahead. Thank you. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 16:08 ` Tejun Heo @ 2011-12-22 17:58 ` Christoph Lameter 2011-12-22 18:03 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Christoph Lameter @ 2011-12-22 17:58 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Thu, 22 Dec 2011, Tejun Heo wrote: > Yeap, and that one too. Maybe we can finally kill the duplicate > confusing static/dynamic accessors too. I'm planning to get to it in > several weeks but if anyone can beat me to it, please go ahead. That would be great. I looked at _and and _or and they both still have one use case (_xor has none though). But its easy to get rid of the irqsafe variants once we are willing to take the additional overhead that comes with disabling interrupts for the fallback cases. Subject: [percpu] Remove irqsafe_cpu_xxx variants We simply say that regular this_cpu use must be safe regardless of preemption and interrupt state. That has no material change for x86 and s390 implementations of this_cpu operations. However, arches that do not provide their own implementation for this_cpu operations will now get code generated that disables interrupts instead of preemption. Signed-off-by: Christoph Lameter <cl@linux.com> --- arch/s390/include/asm/percpu.h | 50 ++++----- arch/x86/include/asm/percpu.h | 28 ----- include/linux/netdevice.h | 4 include/linux/netfilter/x_tables.h | 4 include/linux/percpu.h | 190 ++++--------------------------------- include/net/snmp.h | 14 +- mm/slub.c | 6 - net/caif/caif_dev.c | 4 net/caif/cffrml.c | 4 9 files changed, 65 insertions(+), 239 deletions(-) Index: linux-2.6/include/linux/percpu.h =================================================================== --- linux-2.6.orig/include/linux/percpu.h 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/include/linux/percpu.h 2011-12-22 11:48:28.000000000 -0600 @@ -172,10 +172,10 @@ extern phys_addr_t per_cpu_ptr_to_phys(v * equal char, int or long. percpu_read() evaluates to a lvalue and * all others to void. * - * These operations are guaranteed to be atomic w.r.t. preemption. - * The generic versions use plain get/put_cpu_var(). Archs are + * These operations are guaranteed to be atomic. + * The generic versions disable interrupts. Archs are * encouraged to implement single-instruction alternatives which don't - * require preemption protection. + * require protection. */ #ifndef percpu_read # define percpu_read(var) \ @@ -347,9 +347,10 @@ do { \ #define _this_cpu_generic_to_op(pcp, val, op) \ do { \ - preempt_disable(); \ + unsigned long flags; \ + local_irq_save(flags); \ *__this_cpu_ptr(&(pcp)) op val; \ - preempt_enable(); \ + local_irq_restore(flags); \ } while (0) #ifndef this_cpu_write @@ -447,10 +448,11 @@ do { \ #define _this_cpu_generic_add_return(pcp, val) \ ({ \ typeof(pcp) ret__; \ - preempt_disable(); \ + unsigned long flags; \ + local_irq_save(flags); \ __this_cpu_add(pcp, val); \ ret__ = __this_cpu_read(pcp); \ - preempt_enable(); \ + local_irq_restore(flags); \ ret__; \ }) @@ -476,10 +478,11 @@ do { \ #define _this_cpu_generic_xchg(pcp, nval) \ ({ typeof(pcp) ret__; \ - preempt_disable(); \ + unsigned long flags; \ + local_irq_save(flags); \ ret__ = __this_cpu_read(pcp); \ __this_cpu_write(pcp, nval); \ - preempt_enable(); \ + local_irq_restore(flags); \ ret__; \ }) @@ -501,12 +504,14 @@ do { \ #endif #define _this_cpu_generic_cmpxchg(pcp, oval, nval) \ -({ typeof(pcp) ret__; \ - preempt_disable(); \ +({ \ + typeof(pcp) ret__; \ + unsigned long flags; \ + local_irq_save(flags); \ ret__ = __this_cpu_read(pcp); \ if (ret__ == (oval)) \ __this_cpu_write(pcp, nval); \ - preempt_enable(); \ + local_irq_restore(flags); \ ret__; \ }) @@ -538,10 +543,11 @@ do { \ #define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \ ({ \ int ret__; \ - preempt_disable(); \ + unsigned long flags; \ + local_irq_save(flags); \ ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2, \ oval1, oval2, nval1, nval2); \ - preempt_enable(); \ + local_irq_restore(flags); \ ret__; \ }) @@ -567,9 +573,9 @@ do { \ #endif /* - * Generic percpu operations that do not require preemption handling. + * Generic percpu operations for context that are safe from preemption/interrupts. * Either we do not care about races or the caller has the - * responsibility of handling preemptions issues. Arch code can still + * responsibility of handling preemption/interrupt issues. Arch code can still * override these instructions since the arch per cpu code may be more * efficient and may actually get race freeness for free (that is the * case for x86 for example). @@ -802,156 +808,4 @@ do { \ __pcpu_double_call_return_bool(__this_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2)) #endif -/* - * IRQ safe versions of the per cpu RMW operations. Note that these operations - * are *not* safe against modification of the same variable from another - * processors (which one gets when using regular atomic operations) - * They are guaranteed to be atomic vs. local interrupts and - * preemption only. - */ -#define irqsafe_cpu_generic_to_op(pcp, val, op) \ -do { \ - unsigned long flags; \ - local_irq_save(flags); \ - *__this_cpu_ptr(&(pcp)) op val; \ - local_irq_restore(flags); \ -} while (0) - -#ifndef irqsafe_cpu_add -# ifndef irqsafe_cpu_add_1 -# define irqsafe_cpu_add_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=) -# endif -# ifndef irqsafe_cpu_add_2 -# define irqsafe_cpu_add_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=) -# endif -# ifndef irqsafe_cpu_add_4 -# define irqsafe_cpu_add_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=) -# endif -# ifndef irqsafe_cpu_add_8 -# define irqsafe_cpu_add_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), +=) -# endif -# define irqsafe_cpu_add(pcp, val) __pcpu_size_call(irqsafe_cpu_add_, (pcp), (val)) -#endif - -#ifndef irqsafe_cpu_sub -# define irqsafe_cpu_sub(pcp, val) irqsafe_cpu_add((pcp), -(val)) -#endif - -#ifndef irqsafe_cpu_inc -# define irqsafe_cpu_inc(pcp) irqsafe_cpu_add((pcp), 1) -#endif - -#ifndef irqsafe_cpu_dec -# define irqsafe_cpu_dec(pcp) irqsafe_cpu_sub((pcp), 1) -#endif - -#ifndef irqsafe_cpu_and -# ifndef irqsafe_cpu_and_1 -# define irqsafe_cpu_and_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=) -# endif -# ifndef irqsafe_cpu_and_2 -# define irqsafe_cpu_and_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=) -# endif -# ifndef irqsafe_cpu_and_4 -# define irqsafe_cpu_and_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=) -# endif -# ifndef irqsafe_cpu_and_8 -# define irqsafe_cpu_and_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), &=) -# endif -# define irqsafe_cpu_and(pcp, val) __pcpu_size_call(irqsafe_cpu_and_, (val)) -#endif - -#ifndef irqsafe_cpu_or -# ifndef irqsafe_cpu_or_1 -# define irqsafe_cpu_or_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=) -# endif -# ifndef irqsafe_cpu_or_2 -# define irqsafe_cpu_or_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=) -# endif -# ifndef irqsafe_cpu_or_4 -# define irqsafe_cpu_or_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=) -# endif -# ifndef irqsafe_cpu_or_8 -# define irqsafe_cpu_or_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), |=) -# endif -# define irqsafe_cpu_or(pcp, val) __pcpu_size_call(irqsafe_cpu_or_, (val)) -#endif - -#ifndef irqsafe_cpu_xor -# ifndef irqsafe_cpu_xor_1 -# define irqsafe_cpu_xor_1(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=) -# endif -# ifndef irqsafe_cpu_xor_2 -# define irqsafe_cpu_xor_2(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=) -# endif -# ifndef irqsafe_cpu_xor_4 -# define irqsafe_cpu_xor_4(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=) -# endif -# ifndef irqsafe_cpu_xor_8 -# define irqsafe_cpu_xor_8(pcp, val) irqsafe_cpu_generic_to_op((pcp), (val), ^=) -# endif -# define irqsafe_cpu_xor(pcp, val) __pcpu_size_call(irqsafe_cpu_xor_, (val)) -#endif - -#define irqsafe_cpu_generic_cmpxchg(pcp, oval, nval) \ -({ \ - typeof(pcp) ret__; \ - unsigned long flags; \ - local_irq_save(flags); \ - ret__ = __this_cpu_read(pcp); \ - if (ret__ == (oval)) \ - __this_cpu_write(pcp, nval); \ - local_irq_restore(flags); \ - ret__; \ -}) - -#ifndef irqsafe_cpu_cmpxchg -# ifndef irqsafe_cpu_cmpxchg_1 -# define irqsafe_cpu_cmpxchg_1(pcp, oval, nval) irqsafe_cpu_generic_cmpxchg(pcp, oval, nval) -# endif -# ifndef irqsafe_cpu_cmpxchg_2 -# define irqsafe_cpu_cmpxchg_2(pcp, oval, nval) irqsafe_cpu_generic_cmpxchg(pcp, oval, nval) -# endif -# ifndef irqsafe_cpu_cmpxchg_4 -# define irqsafe_cpu_cmpxchg_4(pcp, oval, nval) irqsafe_cpu_generic_cmpxchg(pcp, oval, nval) -# endif -# ifndef irqsafe_cpu_cmpxchg_8 -# define irqsafe_cpu_cmpxchg_8(pcp, oval, nval) irqsafe_cpu_generic_cmpxchg(pcp, oval, nval) -# endif -# define irqsafe_cpu_cmpxchg(pcp, oval, nval) \ - __pcpu_size_call_return2(irqsafe_cpu_cmpxchg_, (pcp), oval, nval) -#endif - -#define irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \ -({ \ - int ret__; \ - unsigned long flags; \ - local_irq_save(flags); \ - ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2, \ - oval1, oval2, nval1, nval2); \ - local_irq_restore(flags); \ - ret__; \ -}) - -#ifndef irqsafe_cpu_cmpxchg_double -# ifndef irqsafe_cpu_cmpxchg_double_1 -# define irqsafe_cpu_cmpxchg_double_1(pcp1, pcp2, oval1, oval2, nval1, nval2) \ - irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) -# endif -# ifndef irqsafe_cpu_cmpxchg_double_2 -# define irqsafe_cpu_cmpxchg_double_2(pcp1, pcp2, oval1, oval2, nval1, nval2) \ - irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) -# endif -# ifndef irqsafe_cpu_cmpxchg_double_4 -# define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, oval1, oval2, nval1, nval2) \ - irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) -# endif -# ifndef irqsafe_cpu_cmpxchg_double_8 -# define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \ - irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) -# endif -# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \ - __pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2)) -#endif - #endif /* __LINUX_PERCPU_H */ Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/mm/slub.c 2011-12-22 11:48:28.000000000 -0600 @@ -1978,7 +1978,7 @@ int put_cpu_partial(struct kmem_cache *s page->pobjects = pobjects; page->next = oldpage; - } while (irqsafe_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); + } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != oldpage); stat(s, CPU_PARTIAL_FREE); return pobjects; } @@ -2304,7 +2304,7 @@ redo: * Since this is without lock semantics the protection is only against * code executing on this cpu *not* from access by other cpus. */ - if (unlikely(!irqsafe_cpu_cmpxchg_double( + if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, get_freepointer_safe(s, object), next_tid(tid)))) { @@ -2534,7 +2534,7 @@ redo: if (likely(page == c->page)) { set_freepointer(s, object, c->freelist); - if (unlikely(!irqsafe_cpu_cmpxchg_double( + if (unlikely(!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, c->freelist, tid, object, next_tid(tid)))) { Index: linux-2.6/net/caif/caif_dev.c =================================================================== --- linux-2.6.orig/net/caif/caif_dev.c 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/net/caif/caif_dev.c 2011-12-22 11:48:28.000000000 -0600 @@ -69,12 +69,12 @@ static struct caif_device_entry_list *ca static void caifd_put(struct caif_device_entry *e) { - irqsafe_cpu_dec(*e->pcpu_refcnt); + this_cpu_dec(*e->pcpu_refcnt); } static void caifd_hold(struct caif_device_entry *e) { - irqsafe_cpu_inc(*e->pcpu_refcnt); + this_cpu_inc(*e->pcpu_refcnt); } static int caifd_refcnt_read(struct caif_device_entry *e) Index: linux-2.6/net/caif/cffrml.c =================================================================== --- linux-2.6.orig/net/caif/cffrml.c 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/net/caif/cffrml.c 2011-12-22 11:48:28.000000000 -0600 @@ -177,14 +177,14 @@ void cffrml_put(struct cflayer *layr) { struct cffrml *this = container_obj(layr); if (layr != NULL && this->pcpu_refcnt != NULL) - irqsafe_cpu_dec(*this->pcpu_refcnt); + this_cpu_dec(*this->pcpu_refcnt); } void cffrml_hold(struct cflayer *layr) { struct cffrml *this = container_obj(layr); if (layr != NULL && this->pcpu_refcnt != NULL) - irqsafe_cpu_inc(*this->pcpu_refcnt); + this_cpu_inc(*this->pcpu_refcnt); } int cffrml_refcnt_read(struct cflayer *layr) Index: linux-2.6/include/linux/netdevice.h =================================================================== --- linux-2.6.orig/include/linux/netdevice.h 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/include/linux/netdevice.h 2011-12-22 11:48:28.000000000 -0600 @@ -2115,7 +2115,7 @@ extern void netdev_run_todo(void); */ static inline void dev_put(struct net_device *dev) { - irqsafe_cpu_dec(*dev->pcpu_refcnt); + this_cpu_dec(*dev->pcpu_refcnt); } /** @@ -2126,7 +2126,7 @@ static inline void dev_put(struct net_de */ static inline void dev_hold(struct net_device *dev) { - irqsafe_cpu_inc(*dev->pcpu_refcnt); + this_cpu_inc(*dev->pcpu_refcnt); } /* Carrier loss detection, dial on demand. The functions netif_carrier_on Index: linux-2.6/include/linux/netfilter/x_tables.h =================================================================== --- linux-2.6.orig/include/linux/netfilter/x_tables.h 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/include/linux/netfilter/x_tables.h 2011-12-22 11:48:28.000000000 -0600 @@ -471,7 +471,7 @@ DECLARE_PER_CPU(seqcount_t, xt_recseq); * * Begin packet processing : all readers must wait the end * 1) Must be called with preemption disabled - * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add()) + * 2) softirqs must be disabled too (or we should use this_cpu_add()) * Returns : * 1 if no recursion on this cpu * 0 if recursion detected @@ -503,7 +503,7 @@ static inline unsigned int xt_write_recs * * End packet processing : all readers can proceed * 1) Must be called with preemption disabled - * 2) softirqs must be disabled too (or we should use irqsafe_cpu_add()) + * 2) softirqs must be disabled too (or we should use this_cpu_add()) */ static inline void xt_write_recseq_end(unsigned int addend) { Index: linux-2.6/include/net/snmp.h =================================================================== --- linux-2.6.orig/include/net/snmp.h 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/include/net/snmp.h 2011-12-22 11:48:28.000000000 -0600 @@ -129,33 +129,33 @@ struct linux_xfrm_mib { __this_cpu_inc(mib[0]->mibs[field]) #define SNMP_INC_STATS_USER(mib, field) \ - irqsafe_cpu_inc(mib[0]->mibs[field]) + this_cpu_inc(mib[0]->mibs[field]) #define SNMP_INC_STATS_ATOMIC_LONG(mib, field) \ atomic_long_inc(&mib->mibs[field]) #define SNMP_INC_STATS(mib, field) \ - irqsafe_cpu_inc(mib[0]->mibs[field]) + this_cpu_inc(mib[0]->mibs[field]) #define SNMP_DEC_STATS(mib, field) \ - irqsafe_cpu_dec(mib[0]->mibs[field]) + this_cpu_dec(mib[0]->mibs[field]) #define SNMP_ADD_STATS_BH(mib, field, addend) \ __this_cpu_add(mib[0]->mibs[field], addend) #define SNMP_ADD_STATS_USER(mib, field, addend) \ - irqsafe_cpu_add(mib[0]->mibs[field], addend) + this_cpu_add(mib[0]->mibs[field], addend) #define SNMP_ADD_STATS(mib, field, addend) \ - irqsafe_cpu_add(mib[0]->mibs[field], addend) + this_cpu_add(mib[0]->mibs[field], addend) /* * Use "__typeof__(*mib[0]) *ptr" instead of "__typeof__(mib[0]) ptr" * to make @ptr a non-percpu pointer. */ #define SNMP_UPD_PO_STATS(mib, basefield, addend) \ do { \ - irqsafe_cpu_inc(mib[0]->mibs[basefield##PKTS]); \ - irqsafe_cpu_add(mib[0]->mibs[basefield##OCTETS], addend); \ + this_cpu_inc(mib[0]->mibs[basefield##PKTS]); \ + this_cpu_add(mib[0]->mibs[basefield##OCTETS], addend); \ } while (0) #define SNMP_UPD_PO_STATS_BH(mib, basefield, addend) \ do { \ Index: linux-2.6/arch/x86/include/asm/percpu.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/percpu.h 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/arch/x86/include/asm/percpu.h 2011-12-22 11:48:28.000000000 -0600 @@ -414,22 +414,6 @@ do { \ #define this_cpu_xchg_2(pcp, nval) percpu_xchg_op(pcp, nval) #define this_cpu_xchg_4(pcp, nval) percpu_xchg_op(pcp, nval) -#define irqsafe_cpu_add_1(pcp, val) percpu_add_op((pcp), val) -#define irqsafe_cpu_add_2(pcp, val) percpu_add_op((pcp), val) -#define irqsafe_cpu_add_4(pcp, val) percpu_add_op((pcp), val) -#define irqsafe_cpu_and_1(pcp, val) percpu_to_op("and", (pcp), val) -#define irqsafe_cpu_and_2(pcp, val) percpu_to_op("and", (pcp), val) -#define irqsafe_cpu_and_4(pcp, val) percpu_to_op("and", (pcp), val) -#define irqsafe_cpu_or_1(pcp, val) percpu_to_op("or", (pcp), val) -#define irqsafe_cpu_or_2(pcp, val) percpu_to_op("or", (pcp), val) -#define irqsafe_cpu_or_4(pcp, val) percpu_to_op("or", (pcp), val) -#define irqsafe_cpu_xor_1(pcp, val) percpu_to_op("xor", (pcp), val) -#define irqsafe_cpu_xor_2(pcp, val) percpu_to_op("xor", (pcp), val) -#define irqsafe_cpu_xor_4(pcp, val) percpu_to_op("xor", (pcp), val) -#define irqsafe_cpu_xchg_1(pcp, nval) percpu_xchg_op(pcp, nval) -#define irqsafe_cpu_xchg_2(pcp, nval) percpu_xchg_op(pcp, nval) -#define irqsafe_cpu_xchg_4(pcp, nval) percpu_xchg_op(pcp, nval) - #ifndef CONFIG_M386 #define __this_cpu_add_return_1(pcp, val) percpu_add_return_op(pcp, val) #define __this_cpu_add_return_2(pcp, val) percpu_add_return_op(pcp, val) @@ -445,9 +429,6 @@ do { \ #define this_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) #define this_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define irqsafe_cpu_cmpxchg_1(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define irqsafe_cpu_cmpxchg_2(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define irqsafe_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) #endif /* !CONFIG_M386 */ #ifdef CONFIG_X86_CMPXCHG64 @@ -467,7 +448,6 @@ do { \ #define __this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2) #define this_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2) -#define irqsafe_cpu_cmpxchg_double_4(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2) #endif /* CONFIG_X86_CMPXCHG64 */ /* @@ -495,13 +475,6 @@ do { \ #define this_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) #define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) -#define irqsafe_cpu_add_8(pcp, val) percpu_add_op((pcp), val) -#define irqsafe_cpu_and_8(pcp, val) percpu_to_op("and", (pcp), val) -#define irqsafe_cpu_or_8(pcp, val) percpu_to_op("or", (pcp), val) -#define irqsafe_cpu_xor_8(pcp, val) percpu_to_op("xor", (pcp), val) -#define irqsafe_cpu_xchg_8(pcp, nval) percpu_xchg_op(pcp, nval) -#define irqsafe_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval) - /* * Pretty complex macro to generate cmpxchg16 instruction. The instruction * is not supported on early AMD64 processors so we must be able to emulate @@ -532,7 +505,6 @@ do { \ #define __this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) #define this_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) -#define irqsafe_cpu_cmpxchg_double_8(pcp1, pcp2, o1, o2, n1, n2) percpu_cmpxchg16b_double(pcp1, o1, o2, n1, n2) #endif Index: linux-2.6/arch/s390/include/asm/percpu.h =================================================================== --- linux-2.6.orig/arch/s390/include/asm/percpu.h 2011-12-22 11:48:21.000000000 -0600 +++ linux-2.6/arch/s390/include/asm/percpu.h 2011-12-22 11:50:22.000000000 -0600 @@ -19,7 +19,7 @@ #define ARCH_NEEDS_WEAK_PER_CPU #endif -#define arch_irqsafe_cpu_to_op(pcp, val, op) \ +#define arch_this_cpu_to_op(pcp, val, op) \ do { \ typedef typeof(pcp) pcp_op_T__; \ pcp_op_T__ old__, new__, prev__; \ @@ -41,27 +41,27 @@ do { \ preempt_enable(); \ } while (0) -#define irqsafe_cpu_add_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +) -#define irqsafe_cpu_add_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +) -#define irqsafe_cpu_add_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +) -#define irqsafe_cpu_add_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, +) - -#define irqsafe_cpu_and_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &) -#define irqsafe_cpu_and_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &) -#define irqsafe_cpu_and_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &) -#define irqsafe_cpu_and_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, &) - -#define irqsafe_cpu_or_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |) -#define irqsafe_cpu_or_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |) -#define irqsafe_cpu_or_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |) -#define irqsafe_cpu_or_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, |) - -#define irqsafe_cpu_xor_1(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^) -#define irqsafe_cpu_xor_2(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^) -#define irqsafe_cpu_xor_4(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^) -#define irqsafe_cpu_xor_8(pcp, val) arch_irqsafe_cpu_to_op(pcp, val, ^) +#define this_cpu_add_1(pcp, val) arch_this_cpu_to_op(pcp, val, +) +#define this_cpu_add_2(pcp, val) arch_this_cpu_to_op(pcp, val, +) +#define this_cpu_add_4(pcp, val) arch_this_cpu_to_op(pcp, val, +) +#define this_cpu_add_8(pcp, val) arch_this_cpu_to_op(pcp, val, +) + +#define this_cpu_and_1(pcp, val) arch_this_cpu_to_op(pcp, val, &) +#define this_cpu_and_2(pcp, val) arch_this_cpu_to_op(pcp, val, &) +#define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, &) +#define this_cpu_and_8(pcp, val) arch_this_cpu_to_op(pcp, val, &) + +#define this_cpu_or_1(pcp, val) arch_this_cpu_to_op(pcp, val, |) +#define this_cpu_or_2(pcp, val) arch_this_cpu_to_op(pcp, val, |) +#define this_cpu_or_4(pcp, val) arch_this_cpu_to_op(pcp, val, |) +#define this_cpu_or_8(pcp, val) arch_this_cpu_to_op(pcp, val, |) + +#define this_cpu_xor_1(pcp, val) arch_this_cpu_to_op(pcp, val, ^) +#define this_cpu_xor_2(pcp, val) arch_this_cpu_to_op(pcp, val, ^) +#define this_cpu_xor_4(pcp, val) arch_this_cpu_to_op(pcp, val, ^) +#define this_cpu_xor_8(pcp, val) arch_this_cpu_to_op(pcp, val, ^) -#define arch_irqsafe_cpu_cmpxchg(pcp, oval, nval) \ +#define arch_this_cpu_cmpxchg(pcp, oval, nval) \ ({ \ typedef typeof(pcp) pcp_op_T__; \ pcp_op_T__ ret__; \ @@ -79,10 +79,10 @@ do { \ ret__; \ }) -#define irqsafe_cpu_cmpxchg_1(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval) -#define irqsafe_cpu_cmpxchg_2(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval) -#define irqsafe_cpu_cmpxchg_4(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval) -#define irqsafe_cpu_cmpxchg_8(pcp, oval, nval) arch_irqsafe_cpu_cmpxchg(pcp, oval, nval) +#define this_cpu_cmpxchg_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval) +#define this_cpu_cmpxchg_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval) +#define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval) +#define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval) #include <asm-generic/percpu.h> ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 17:58 ` Christoph Lameter @ 2011-12-22 18:03 ` Ingo Molnar 2011-12-22 18:31 ` Linus Torvalds 2011-12-22 18:47 ` Tejun Heo 2 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2011-12-22 18:03 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Linus Torvalds, Pekka Enberg, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner * Christoph Lameter <cl@linux.com> wrote: > On Thu, 22 Dec 2011, Tejun Heo wrote: > > > Yeap, and that one too. Maybe we can finally kill the duplicate > > confusing static/dynamic accessors too. I'm planning to get to it in > > several weeks but if anyone can beat me to it, please go ahead. > > That would be great. I looked at _and and _or and they both still have one > use case (_xor has none though). But its easy to get rid of the irqsafe > variants once we are willing to take the additional overhead that comes > with disabling interrupts for the fallback cases. > > > Subject: [percpu] Remove irqsafe_cpu_xxx variants > > We simply say that regular this_cpu use must be safe regardless of preemption > and interrupt state. That has no material change for x86 and s390 implementations > of this_cpu operations. However, arches that do not provide their own implementation > for this_cpu operations will now get code generated that disables interrupts > instead of preemption. > > Signed-off-by: Christoph Lameter <cl@linux.com> > > --- > arch/s390/include/asm/percpu.h | 50 ++++----- > arch/x86/include/asm/percpu.h | 28 ----- > include/linux/netdevice.h | 4 > include/linux/netfilter/x_tables.h | 4 > include/linux/percpu.h | 190 ++++--------------------------------- > include/net/snmp.h | 14 +- > mm/slub.c | 6 - > net/caif/caif_dev.c | 4 > net/caif/cffrml.c | 4 > 9 files changed, 65 insertions(+), 239 deletions(-) While this is progress, i think you have missed the essence of Linus's observations: percpu.h is *way* too complex and is offering way too many variants. The irqsafe madness was just the most blatant problem. Note that even wit your patch applied linux/percpu.h is 800+ lines long, while the total number of usecases is smaller than that - and then i havent even considered all the arch percpu.h files. Why not implement Linus's suggestion of just one or two __this_cpu() ops and be content with that model? Thanks, Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 17:58 ` Christoph Lameter 2011-12-22 18:03 ` Ingo Molnar @ 2011-12-22 18:31 ` Linus Torvalds 2011-12-23 16:55 ` Christoph Lameter 2011-12-22 18:47 ` Tejun Heo 2 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2011-12-22 18:31 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Thu, Dec 22, 2011 at 9:58 AM, Christoph Lameter <cl@linux.com> wrote: > > Subject: [percpu] Remove irqsafe_cpu_xxx variants This looks good, but just remove the whole silly "xyz_op_return()" set too. Don't even try to fix it up. As far as I can tell, there are just a couple of users of that horrible interface, it really isn't worth it. Just remove it and opencode it. They are actually broken on x86 ('xadd' only exists if CONFIG_X86_XADD is on - i386 didn't have it), and while most people probably don't care (i386? What's that?), I note that the x86 add_return stuff doesn't take that into account. If somebody cares about the end result of the add that much, they can damn well just do it themselves. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 18:31 ` Linus Torvalds @ 2011-12-23 16:55 ` Christoph Lameter 2011-12-23 20:54 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2011-12-23 16:55 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Thu, 22 Dec 2011, Linus Torvalds wrote: > As far as I can tell, there are just a couple of users of that > horrible interface, it really isn't worth it. Just remove it and > opencode it. They are actually broken on x86 ('xadd' only exists if > CONFIG_X86_XADD is on - i386 didn't have it), and while most people > probably don't care (i386? What's that?), I note that the x86 > add_return stuff doesn't take that into account. There is an #ifndef CONFIG_M386 around the implementation of these for x86. Use on i386 will not generate an xadd instructions but fallback to a generic implementation. The add_return stuff was already available with the earlier per cpu apis (local_t and percpu) that this_cpu replaces and is still available through the atomic operations. If one wants to exploit the per cpu atomicity then one may want to know what the result was. Inspecting the value is not possible later in the execution path if there are potential races on the processor through preemption as well as interrupts. The _return variants also avoid additional load instructions and thereby reduce code path size. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-23 16:55 ` Christoph Lameter @ 2011-12-23 20:54 ` Linus Torvalds 2012-01-04 15:30 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2011-12-23 20:54 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Fri, Dec 23, 2011 at 8:55 AM, Christoph Lameter <cl@linux.com> wrote: > > There is an #ifndef CONFIG_M386 around the implementation of these for > x86. Use on i386 will not generate an xadd instructions but fallback to a > generic implementation. Ok, good. It's not the really right thing to do, but it will at least work. > The add_return stuff was already available with the earlier per cpu apis > (local_t and percpu) that this_cpu replaces and is still available through > the atomic operations. Sure. The "atomic_op_return()" form actually *makes*sense*. It has atomicity guarantees, and the return value is meaningful. The same IS SIMPLY NOT TRUE of the "this_cpu" versions. > If one wants to exploit the per cpu atomicity then one may want to know > what the result was. No. You're blathering and doing "magical thinking" instead of actually thinking things through. Think about the semantics. What does it mean to get a result of the op on a percpu variable? You have two cases: Case 1: you actually know what CPU you are on, and the number may have some meaning. But in that case you are also *pinned* to that CPU, and there is no difference between x = percpu_value; x += y; percpu_value = x; .. look at 'x' .. to see what you did and percpu_value += x; ..look at percpu_value to see what you did .. and x = xadd percpu_value,y so there are no "atomicity advantages" - xadd is the same as open-coding it. You are pinned to the CPU, all three ways are 100% equivalent. The other case is: Case 2: you're not pinned to the CPU, and you just do statistics and add some value to memory. BUT NOW THE RESULT HAS ABSOLUTELY NO MEANING! Now "xadd" would possibly give different results from "add and read the result", since you migth be moving around between cpu's, but there is absolutely no way you could ever care, sine the value you read back is meaningless regardless of whether it's the return value of xadd, or the value from some other CPU. There are other cpu's with the same percpu variable, and they are all equivalent as far as you are concerned, because you're not pinned to one of them. See? xadd buys you *nothing* from a semantic standpoint. And anybody who *thinks* it buys you something is just confused and just introduced a bug. So the "thiscpu_op_return" functions are broken-by-design! They have no meaning. They are completely crazy. That's why they should be removed. Anybody who thinks they add value has not thought through what they actually do or mean. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-23 20:54 ` Linus Torvalds @ 2012-01-04 15:30 ` Christoph Lameter 2012-01-04 16:07 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2012-01-04 15:30 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Fri, 23 Dec 2011, Linus Torvalds wrote: > You have two cases: > > Case 1: you actually know what CPU you are on, and the number may have > some meaning. But in that case you are also *pinned* to that CPU, and > there is no difference between > > x = percpu_value; > x += y; > percpu_value = x; > .. look at 'x' .. to see what you did > > and > > percpu_value += x; > ..look at percpu_value to see what you did .. > > and > > x = xadd percpu_value,y > > so there are no "atomicity advantages" - xadd is the same as > open-coding it. You are pinned to the CPU, all three ways are 100% > equivalent. As mentioned before the main point of the use of these operations (in the form of __this_cpu_op_return) when the cpu is pinned is to reduce the number of instructions. __this_cpu_add_return allows replacing ~ 5 instructions with one. These operations are used in f.e. in vm statististics that have to scale well and have the lightest possible impact on the rest of the vm code. "atomicity advantages" were never claimed to exist for the __this_cpu_xx operations. >> The other case is: > > Case 2: you're not pinned to the CPU, and you just do statistics and > add some value to memory. BUT NOW THE RESULT HAS ABSOLUTELY NO > MEANING! > > Now "xadd" would possibly give different results from "add and read > the result", since you migth be moving around between cpu's, but there > is absolutely no way you could ever care, sine the value you read back > is meaningless regardless of whether it's the return value of xadd, or > the value from some other CPU. There are other cpu's with the same > percpu variable, and they are all equivalent as far as you are > concerned, because you're not pinned to one of them. You would care if you had special actions that have to be triggered on all processors if for example a threshold was breached on a single processors counters (and one would fold all differentials into a global counter) or a per cpu counter reaches zero (and one would need to replenish lists of objects or some such thing). In that case it matters that the operational result be checked later regardless of the cpu we are on right now. There are no use cases of the this_cpu_xx_return operations as far as I can see so we could remove those. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2012-01-04 15:30 ` Christoph Lameter @ 2012-01-04 16:07 ` Linus Torvalds 2012-01-04 17:00 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2012-01-04 16:07 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, Jan 4, 2012 at 7:30 AM, Christoph Lameter <cl@linux.com> wrote: > > As mentioned before the main point of the use of these operations (in the > form of __this_cpu_op_return) when the cpu is pinned is to reduce the > number of instructions. __this_cpu_add_return allows replacing ~ 5 > instructions with one. And that's fine if it's something really core, and something *so* important that you can tell the difference between one instruction and three. Which isn't the case here. In fact, on many (most?) x86 microarchitectures xadd is actually slower than a regular add-from-memory-and-store - the big advantage of it is that with the "lock" prefix you do get special atomicity guarantees, and some algorithms (is semaphores) do want to know the value of the add atomically in order to know if there were other things going on. The thing is, I care about maintainability and not having cross-architecture problems etc. And right now many of the cpulocal things are *much* more of a maintainability headache than they are worth. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2012-01-04 16:07 ` Linus Torvalds @ 2012-01-04 17:00 ` Christoph Lameter 2012-01-04 23:10 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2012-01-04 17:00 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, 4 Jan 2012, Linus Torvalds wrote: > On Wed, Jan 4, 2012 at 7:30 AM, Christoph Lameter <cl@linux.com> wrote: > > > > As mentioned before the main point of the use of these operations (in the > > form of __this_cpu_op_return) when the cpu is pinned is to reduce the > > number of instructions. __this_cpu_add_return allows replacing ~ 5 > > instructions with one. > > And that's fine if it's something really core, and something *so* > important that you can tell the difference between one instruction and > three. > > Which isn't the case here. In fact, on many (most?) x86 > microarchitectures xadd is actually slower than a regular > add-from-memory-and-store - the big advantage of it is that with the > "lock" prefix you do get special atomicity guarantees, and some > algorithms (is semaphores) do want to know the value of the add > atomically in order to know if there were other things going on. xadd is 3 cycles. add is one cycle. What we are doing here is also the use of a segment override to allow us to relocate the per cpu address to the current cpu. So we are already getting two additions for the price of one xadd. If we manually calcualte the address then we have another memory reference to get the per cpu offset for this processor (otherwise we get it from the segment register). And then we need to store the results. We use registers etc etc. Cannot imagine that this would be the same speed. The thing is, I care about maintainability and not having > cross-architecture problems etc. And right now many of the cpulocal > things are *much* more of a maintainability headache than they are > worth. The cpu local things and xadd support have been around for a pretty long time in various forms and they work reliably. I have tried to add onto this by adding the cmpxchg/cmpxchg_double functionalty which caused some issues because of the fallback stuff. That seems to have been addressed though since we were willing now to make the preempt/irq tradeoff that we were not able to get agreement on during the cleanup of the old APIs a year or so ago. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2012-01-04 17:00 ` Christoph Lameter @ 2012-01-04 23:10 ` Linus Torvalds 2012-01-05 19:15 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2012-01-04 23:10 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, Jan 4, 2012 at 9:00 AM, Christoph Lameter <cl@linux.com> wrote: > > xadd is 3 cycles. add is one cycle. On some uarchs. On new uarchs it can be a single cycle, I think, and on some uarchs it will even be microcoded and/or only go in one pipe because it has that odd behavior that it writes both to memory and a register, and thus doesn't fit the normal fastpaths. The point is, xadd isn't actually any faster than just doing the regular "add and read". It's *slower*. There really isn't ever any reason to use xadd on percpu variables. That's my point. You claimed that there was a performance advantage. There really isn't. So why are you still arguing? Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2012-01-04 23:10 ` Linus Torvalds @ 2012-01-05 19:15 ` Christoph Lameter 2012-01-05 19:27 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2012-01-05 19:15 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Wed, 4 Jan 2012, Linus Torvalds wrote: > On Wed, Jan 4, 2012 at 9:00 AM, Christoph Lameter <cl@linux.com> wrote: > > > > xadd is 3 cycles. add is one cycle. > > On some uarchs. On new uarchs it can be a single cycle, I think, and > on some uarchs it will even be microcoded and/or only go in one pipe > because it has that odd behavior that it writes both to memory and a > register, and thus doesn't fit the normal fastpaths. > The point is, xadd isn't actually any faster than just doing the > regular "add and read". It's *slower*. Ok that assumes that both add and read are this_cpu operations that use the segment override to relocate the address. I thought you wanted to open code everything. So then we would have to use the following? this_cpu_add(var, count); result = this_cpu_read(var); XADD and ADD have the same cycle count if the ADD is used to add to a memory location. Both use 4 microops. The MOVE from memory costs another 2 so we are at 6. Two segment overrides for each instruction add 2 more (nothing in the optimization manual but as far as I can recall another uop is needed for the address calculation). So this ends up at 8. On the other hand result = this_cpu_add_return(var, count) is 4 + 1. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2012-01-05 19:15 ` Christoph Lameter @ 2012-01-05 19:27 ` Linus Torvalds 0 siblings, 0 replies; 32+ messages in thread From: Linus Torvalds @ 2012-01-05 19:27 UTC (permalink / raw) To: Christoph Lameter Cc: Tejun Heo, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Thu, Jan 5, 2012 at 11:15 AM, Christoph Lameter <cl@linux.com> wrote: > > XADD and ADD have the same cycle count if the ADD is used to add to a > memory location. Both use 4 microops. Christ, stop counting uops. They are some random internal microarchitectural detail, and ignores things like processor uop scheduling and decoding issues. The thing that matters is (a) performance and (b) code sanity. The "local_cpu_return()" operations fail *seriously* in the code sanity department. They are a fundamentally insane operation. And nothing you have said says "it's a huge performance win" to me. First off, there aren't even very many users, and the users there are seem to not even be all that performance-critical. For statistics, just regular "add" is the normal thing to do. The add_return thing is insane, for all the reasons I already outlined. It *fundamentally* doesn't have any sane semantics. Just remove it. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-22 17:58 ` Christoph Lameter 2011-12-22 18:03 ` Ingo Molnar 2011-12-22 18:31 ` Linus Torvalds @ 2011-12-22 18:47 ` Tejun Heo 2 siblings, 0 replies; 32+ messages in thread From: Tejun Heo @ 2011-12-22 18:47 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Pekka Enberg, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Thomas Gleixner On Thu, Dec 22, 2011 at 11:58:51AM -0600, Christoph Lameter wrote: > Subject: [percpu] Remove irqsafe_cpu_xxx variants > > We simply say that regular this_cpu use must be safe regardless of preemption > and interrupt state. That has no material change for x86 and s390 implementations > of this_cpu operations. However, arches that do not provide their own implementation > for this_cpu operations will now get code generated that disables interrupts > instead of preemption. > > Signed-off-by: Christoph Lameter <cl@linux.com> Applied to wq/for-3.3. Thanks. -- tejun ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 9:47 ` Pekka Enberg 2011-12-20 16:23 ` Tejun Heo @ 2011-12-20 16:26 ` Christoph Lameter 2011-12-21 8:06 ` Pekka Enberg 1 sibling, 1 reply; 32+ messages in thread From: Christoph Lameter @ 2011-12-20 16:26 UTC (permalink / raw) To: Pekka Enberg Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Tejun Heo, Thomas Gleixner On Tue, 20 Dec 2011, Pekka Enberg wrote: > To illustrate the issue, for "per cpu add" we have: > > __this_cpu_add() > this_cpu_add() > irqsafe_cpu_add() > percpu_add() > > Why do we need all of them? These are all operations that frequently occur in hot paths of the OS. On x86 all variants map to the same instruction. These will be issueing exactly one ASM instruction that is therefore irqsafe and preempt safe. The single instruction will perform the relocation of the pointer relative to the current cpu area and then perform the RMV operation without the cost of an atomic operation. For non-x86 we have the issue that typically separate instruction have to be used to perform the relocation relative to the current per cpu area and the RMW operation. The problem is that an interrupt or reschedule operation can occur between the address calculation and the RMW operation. The RMW operation may occur on the wrong processors per cpu data. So we need some way of preventing the change of processors or interrupts. The __this_cpu_add() variant simply does nothing to prevent this. Just assumes that the caller has taken a lock or disabling interrupts that provides sufficient measures to prevent the rescheduling on a different processor. The this_cpu_add() variant disables preemption, then does the operations and then reenables preemption. This is usually sufficient since most per cpu data is not accessed from an interrupt context. The irqsafe_cpu_add() variant disables interrupts, does the operations and then reenabled interrupts. It is needed if counters etc are modified from an interrupt context. percpu_add() is an older variant of the per cpu operations that is (or was?) x86 specific. this_cpu_xxx() operations are used in core code. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-20 16:26 ` Christoph Lameter @ 2011-12-21 8:06 ` Pekka Enberg 2011-12-21 15:20 ` Christoph Lameter 0 siblings, 1 reply; 32+ messages in thread From: Pekka Enberg @ 2011-12-21 8:06 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Tejun Heo, Thomas Gleixner On Tue, 20 Dec 2011, Pekka Enberg wrote: >> To illustrate the issue, for "per cpu add" we have: >> >> __this_cpu_add() >> this_cpu_add() >> irqsafe_cpu_add() >> percpu_add() >> >> Why do we need all of them? On Tue, Dec 20, 2011 at 6:26 PM, Christoph Lameter <cl@linux.com> wrote: > These are all operations that frequently occur in hot paths of the OS. It's a provably difficult API to use that has pretty much zero debugging code. That's a problem. I still don't understand why we'd want separate preempt safe and irqsafe variants. It should be enough to have only unsafe and safe variants where the latter would always do the right thing. Pekka ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [GIT PULL] slab fixes for 3.2-rc4 2011-12-21 8:06 ` Pekka Enberg @ 2011-12-21 15:20 ` Christoph Lameter 0 siblings, 0 replies; 32+ messages in thread From: Christoph Lameter @ 2011-12-21 15:20 UTC (permalink / raw) To: Pekka Enberg Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, linux-kernel, linux-arch, Tejun Heo, Thomas Gleixner On Wed, 21 Dec 2011, Pekka Enberg wrote: > I still don't understand why we'd want separate preempt safe and > irqsafe variants. It should be enough to have only unsafe and safe > variants where the latter would always do the right thing. The effort to make something irqsafe is higher than making it preempt safe. If that difference is not important then we could just have safe and unsafe variants. Traditionally counter operations were only preempt safe though. So making those irqsafe would increse the overhead. ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2012-01-05 19:27 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-11-29 18:02 [GIT PULL] slab fixes for 3.2-rc4 Pekka Enberg 2011-11-29 19:29 ` Linus Torvalds 2011-11-29 19:38 ` Linus Torvalds 2011-12-20 9:47 ` Pekka Enberg 2011-12-20 16:23 ` Tejun Heo 2011-12-20 16:31 ` Christoph Lameter 2011-12-20 19:28 ` Linus Torvalds 2011-12-20 20:28 ` Tejun Heo 2011-12-21 8:08 ` Pekka Enberg 2011-12-21 17:09 ` Tejun Heo 2011-12-21 15:16 ` Christoph Lameter 2011-12-21 17:05 ` Tejun Heo 2011-12-22 2:19 ` Linus Torvalds 2011-12-22 16:05 ` Tejun Heo 2011-12-28 10:25 ` Benjamin Herrenschmidt 2011-12-22 14:58 ` Christoph Lameter 2011-12-22 16:08 ` Tejun Heo 2011-12-22 17:58 ` Christoph Lameter 2011-12-22 18:03 ` Ingo Molnar 2011-12-22 18:31 ` Linus Torvalds 2011-12-23 16:55 ` Christoph Lameter 2011-12-23 20:54 ` Linus Torvalds 2012-01-04 15:30 ` Christoph Lameter 2012-01-04 16:07 ` Linus Torvalds 2012-01-04 17:00 ` Christoph Lameter 2012-01-04 23:10 ` Linus Torvalds 2012-01-05 19:15 ` Christoph Lameter 2012-01-05 19:27 ` Linus Torvalds 2011-12-22 18:47 ` Tejun Heo 2011-12-20 16:26 ` Christoph Lameter 2011-12-21 8:06 ` Pekka Enberg 2011-12-21 15:20 ` Christoph Lameter
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.