Nick Piggin wrote: > On Tuesday 26 August 2008 22:23, Gregory Haskins wrote: > >> Nick Piggin wrote: >> >>> On Tuesday 26 August 2008 06:15, Gregory Haskins wrote: >>> >>>> double_lock balance() currently favors logically lower cpus since they >>>> often do not have to release their own lock to acquire a second lock. >>>> The result is that logically higher cpus can get starved when there is >>>> a lot of pressure on the RQs. This can result in higher latencies on >>>> higher cpu-ids. >>>> >>>> This patch makes the algorithm more fair by forcing all paths to have >>>> to release both locks before acquiring them again. Since callsites to >>>> double_lock_balance already consider it a potential >>>> preemption/reschedule point, they have the proper logic to recheck for >>>> atomicity violations. >>>> >>>> Signed-off-by: Gregory Haskins >>>> --- >>>> >>>> kernel/sched.c | 17 +++++------------ >>>> 1 files changed, 5 insertions(+), 12 deletions(-) >>>> >>>> diff --git a/kernel/sched.c b/kernel/sched.c >>>> index 6e0bde6..b7326cd 100644 >>>> --- a/kernel/sched.c >>>> +++ b/kernel/sched.c >>>> @@ -2790,23 +2790,16 @@ static int double_lock_balance(struct rq >>>> *this_rq, struct rq *busiest) __acquires(busiest->lock) >>>> __acquires(this_rq->lock) >>>> { >>>> - int ret = 0; >>>> - >>>> if (unlikely(!irqs_disabled())) { >>>> /* printk() doesn't work good under rq->lock */ >>>> spin_unlock(&this_rq->lock); >>>> BUG_ON(1); >>>> } >>>> - if (unlikely(!spin_trylock(&busiest->lock))) { >>>> - if (busiest < this_rq) { >>>> - spin_unlock(&this_rq->lock); >>>> - spin_lock(&busiest->lock); >>>> - spin_lock_nested(&this_rq->lock, SINGLE_DEPTH_NESTING); >>>> - ret = 1; >>>> - } else >>>> - spin_lock_nested(&busiest->lock, SINGLE_DEPTH_NESTING); >>>> - } >>>> - return ret; >>>> + >>>> + spin_unlock(&this_rq->lock); >>>> + double_rq_lock(this_rq, busiest); >>>> >>> Rather than adding the extra atomic operation, can't you just put this >>> into the unlikely spin_trylock failure path rather than the unfair logic >>> there? >>> >> The trick is that we *must* first release this_rq before proceeding or >> the new proposal doesn't work as intended. This patch effectively >> breaks up the this_rq->lock critical section evenly across all CPUs as >> if it hit the case common for higher cpus. >> > > I don't exactly see why my proposal would introduce any more latency, because > we only trylock while holding the existing lock -- this is will only ever add > a small ~constant time to the critical section, regardless of whether it is a > high or low CPU runqueue. > Its because we are trying to create a break in the critical section of this_rq->lock, not improve the acquisition of busiest->lock. So whether you do spin_lock or spin_trylock on busiest does not matter. Busiest will not be contended in the case that I am concerned with. If you use my example below: rq[N] will not be contended because cpuN is blocked on rq[0] after already having released rq[N]. So its the contention against this_rq that is the problem. Or am I missing your point completely? > > >> This modification decreased >> latency by over 800% (went from > 400us to < 50us) on cpus 6 and 7 in my >> 8-way box namely because they were not forced to wait for all the other >> lower cores to finish, but rather completions of double_lock_balance >> were handled in true FIFO w.r.t. to other calls to >> double_lock_balance(). It has to do with the positioning within your >> FIFO ticket locks (though even if ticket locks are not present on a >> given architecture we should still see an improvement.) >> >> When a low cpu wants to double lock, it tends to hold this_rq and gets >> in line for busiest_rq with no bearing on how long it held this_rq. >> Therefore the following scenario can occur: >> >> cpu 0 cpu N >> ---------------------------------- >> rq[0] locked >> .. >> .. >> .. >> double_lock(N, 0) >> rq[N] released >> blocked on rq[0] >> .. >> .. >> .. >> .. >> double_lock(0, N) >> rq[N] locked >> double_lock returns >> .. >> .. >> .. >> .. >> rq[0] released rq[0] locked >> double_lock returns >> ... >> ... >> ... >> >> --------------------------------- >> >> So double lock acquisition favors the lower cpus unfairly. They will >> always win, even if they were not first. Now with the combination of my >> patch plus your ticket locks, entry into the double lock becomes FIFO >> because the "blocked on rq[0]" would have inserted it in the >> time-ordered head of rq[0]. >> > > Right, but I don't think it is particularly wrong to allow a given > CPU to double_lock_balance ahead of another guy if we're already holding > the lock. Its not "wrong". Its just a latency source ;) > _So long as_ the lock we are trying to acquire is uncontended, > and we don't introduce this skewed unfairness due to lower CPUs being > allowed to hold their lower lock while higher CPUs have to release their > lock and first queue on the lower. > > The difference is that with my patch, there is a small window where the > guy who asks for the double lock first will go through second. I don't > think this really adds a fundamental amount of latency, and the > performance benefit should not be ignored. > > Linux's traditional and I suppose much largest user base does not require > realtime or really strict fairness, so IMO it is always questionable to > make changes like this. > Please take a look at the v2 series that I sent out yesterday. I have now predicated this on CONFIG_PREEMPT, per your comments. -Greg