From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761556AbYEMVGV (ORCPT ); Tue, 13 May 2008 17:06:21 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758525AbYEMVGJ (ORCPT ); Tue, 13 May 2008 17:06:09 -0400 Received: from mx3.mail.elte.hu ([157.181.1.138]:50637 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759441AbYEMVGH (ORCPT ); Tue, 13 May 2008 17:06:07 -0400 Date: Tue, 13 May 2008 23:05:13 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Matthew Wilcox , Sven Wegener , "Zhang, Yanmin" , Andi Kleen , LKML , Alexander Viro , Andrew Morton , Thomas Gleixner , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [git pull] scheduler fixes Message-ID: <20080513210513.GA17492@elte.hu> References: <20080511144203.GB3220@elte.hu> <20080511144821.GW19219@parisc-linux.org> <20080511151909.GA3887@elte.hu> <20080511152942.GY19219@parisc-linux.org> <20080513141129.GC18798@elte.hu> <20080513142135.GF9324@parisc-linux.org> <20080513144207.GA4697@elte.hu> <20080513152846.GI9324@parisc-linux.org> <20080513171352.GE22348@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.3 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > > Why does it signal to its waiters that "resource is available", when > > in reality that resource is not available but immediately serialized > > via a lock? (even if the lock might technically be some _other_ > > object) > > So you have 'n' resources available, and you use a counting semaphore > for that resource counting. > > But you'd still need a spinlock to actually protect the list of > resources themselves. The spinlock protects a different thing than the > semaphore. The semaphore isn't about mutual exclusion - it's about > counting resources and waiting when there are too many things in > flight. > > And you seem to think that a counting semaphore is about mutual > exclusion. It has nothing what-so-ever to do with that. i was reacting to the point that Matthew made: " The first thing that each task does is grab a spinlock, so if you put as much in flight as early as possible, you end up with horrible contention on that spinlock. " We were talking about the case of double, parallel up()s. My point was that the best guess is to put two tasks in flight in the synchronization primitive. Matthew's point was that the wakeups of the two tasks should be chained: one task gets woken first, which then wakes the second task in a chain. [ Matthew, i'm paraphrasing your opinion so please correct me if i'm misinterpreting your point. ] My argument is that chaining like that in the synchronization primitive is bad for parallelism in general, because wakeup latencies are just too long in general and any action required in the context of the "first waker" throttles parallelism artificially and introduces artificial workload delays. Even in the simple list+lock case you are talking about it's beneficial to keep as many wakers in flight as possible. The reason is that even with the worst-possible cacheline bouncing imaginable, it's much better to spin a bit on a spinlock (which is "horrible bouncing" only on paper, in practice it's nicely ordered waiting on a FIFO spinlock, with a list op every 100 nsecs or so) than to implement "soft bouncing" of tasks via an artificial chain of wakeups. That artificial chain of wakeups has a latency of a few microseconds even in the best case - and in the worst-case it can be a delay of milliseconds later - throttling not just the current parallelism of the workload but also hiding potential future parallelism. The hardware is so much faster at messaging. [*] [**] Ingo [*] Not without qualifications - maybe not so on 1024 CPU systems, but certainly so on anything sane up to 16way or so. But if worry about 1024 CPU systems i'd suggest to first take a look at the current practice in kernel/semaphore.c taking the semaphore internal spinlock again _after_ a task has woken up, just to remove itself from the list of waiters. That is rather unnecessary serialization - at up() time we are already holding the lock, so we should remove the entry there. That's what the mutex code does too. [**] The only place where i can see staggered/chained wakeups help is in the specific case when the wakee runs into a heavy lock _right after_ wakeup. XFS might be running into that and my reply above talks about that hypothetical scenario. If it is so then it is handled incorrectly, because in that case we dont have any true parallelism at the point of up(), and we know it right there. There is no reason to pretend at that point that we have more parallelism, when all we do is we block on a heavy lock right after wakeup. Instead, the correct implementation in that case would be to have a wait queue for that heavy lock (which in other words can be thought of as a virtual single-resource which is either 'available' or 'unavailable'), and _then_ after that to use a counting semaphore for the rest of the program flow, which is hopefully more parallel. I.e. precisely map the true parallelism of the code via the synchronization primitives, do not over-synchronize it and do not under-synchronize it. And if there's any doubt which one should be used, under-synchronize it - because while the scheduler is rather good at dealing with too many tasks it cannot conjure runnable tasks out of thin air. Btw., the AIM7 regression was exactly that: in the 50% regressed workload the semaphore code hid the true parallelism of the workload and we had only had 5 tasks on the runqueue and the scheduler had no chance to saturate all CPUs. In the "good" case (be that spinlock based or proper-semaphores based BKL) there were 2000 runnable tasks on the runqueues, and the scheduler sorted them out and batched the workload just fine!