From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756453Ab2ADRAm (ORCPT ); Wed, 4 Jan 2012 12:00:42 -0500 Received: from smtp101.prem.mail.ac4.yahoo.com ([76.13.13.40]:40741 "HELO smtp101.prem.mail.ac4.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1750936Ab2ADRAi (ORCPT ); Wed, 4 Jan 2012 12:00:38 -0500 X-Yahoo-Newman-Property: ymail-3 X-YMail-OSG: Q_U5t_wVM1mtm_mjn6mF3Ixw7q1TG1QkoHVuniy1whwOmXr CNWfMC.DU_xdppYSdn3op9ivAGcQ6iauTUyxkQxMsdRbDqpGfj2c9wJouGXt LMyqD4F5K5bORYDGmYYKNEhZIx0RJVoy7Xu42lZgx0_DcbHE56V4_TX.f.7U kjybZaDhgQMh52ZrdievKkRbl2ckJ0UQamqHkpH2f.FrFmQ5DSQnzToqD9mt 49s6yiId7xFDXBDJNynvs5n7wBBXIz.Rxl0vMxZDNfoOqRdt8RYNzD02B900 NACQuEEiqslSG2ghymn4jQk6T7nHSss2k.YljPZdcIgKqjuiELTAHp0yHTsU RFp.iGVKo4qu4oNp4SqKgFeJvc.aWlC0ZriOLMIayaJ3GJQ-- X-Yahoo-SMTP: _Dag8S.swBC1p4FJKLCXbs8NQzyse1SYSgnAbY0- Date: Wed, 4 Jan 2012 11:00:34 -0600 (CST) From: Christoph Lameter X-X-Sender: cl@router.home To: Linus Torvalds cc: Tejun Heo , Pekka Enberg , Ingo Molnar , Andrew Morton , linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, Thomas Gleixner Subject: Re: [GIT PULL] slab fixes for 3.2-rc4 In-Reply-To: Message-ID: References: <20111220162315.GC10752@google.com> <20111220202854.GH10752@google.com> <20111221170535.GB9213@google.com> <20111222160822.GE17084@google.com> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 4 Jan 2012, Linus Torvalds wrote: > On Wed, Jan 4, 2012 at 7:30 AM, Christoph Lameter wrote: > > > > As mentioned before the main point of the use of these operations (in the > > form of __this_cpu_op_return) when the cpu is pinned is to reduce the > > number of instructions. __this_cpu_add_return allows replacing ~ 5 > > instructions with one. > > And that's fine if it's something really core, and something *so* > important that you can tell the difference between one instruction and > three. > > Which isn't the case here. In fact, on many (most?) x86 > microarchitectures xadd is actually slower than a regular > add-from-memory-and-store - the big advantage of it is that with the > "lock" prefix you do get special atomicity guarantees, and some > algorithms (is semaphores) do want to know the value of the add > atomically in order to know if there were other things going on. xadd is 3 cycles. add is one cycle. What we are doing here is also the use of a segment override to allow us to relocate the per cpu address to the current cpu. So we are already getting two additions for the price of one xadd. If we manually calcualte the address then we have another memory reference to get the per cpu offset for this processor (otherwise we get it from the segment register). And then we need to store the results. We use registers etc etc. Cannot imagine that this would be the same speed. The thing is, I care about maintainability and not having > cross-architecture problems etc. And right now many of the cpulocal > things are *much* more of a maintainability headache than they are > worth. The cpu local things and xadd support have been around for a pretty long time in various forms and they work reliably. I have tried to add onto this by adding the cmpxchg/cmpxchg_double functionalty which caused some issues because of the fallback stuff. That seems to have been addressed though since we were willing now to make the preempt/irq tradeoff that we were not able to get agreement on during the cleanup of the old APIs a year or so ago.