From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752862Ab2ESRJS (ORCPT ); Sat, 19 May 2012 13:09:18 -0400 Received: from mail-wg0-f44.google.com ([74.125.82.44]:35120 "EHLO mail-wg0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751298Ab2ESRJQ (ORCPT ); Sat, 19 May 2012 13:09:16 -0400 MIME-Version: 1.0 In-Reply-To: <1337193010.27694.146.camel@twins> References: <1337084609.27020.156.camel@laptop> <1337086834.27020.162.camel@laptop> <1337096141.27694.82.camel@twins> <1337193010.27694.146.camel@twins> From: Linus Torvalds Date: Sat, 19 May 2012 10:08:54 -0700 X-Google-Sender-Auth: ulMT0u5snY7-kt_ajbB4lvBY3mM Message-ID: Subject: Re: Plumbers: Tweaking scheduler policy micro-conf RFP To: Peter Zijlstra Cc: Vincent Guittot , paulmck@linux.vnet.ibm.com, smuckle@quicinc.com, khilman@ti.com, Robin.Randhawa@arm.com, suresh.b.siddha@intel.com, thebigcorporation@gmail.com, venki@google.com, panto@antoniou-consulting.com, mingo@elte.hu, paul.brett@intel.com, pdeschrijver@nvidia.com, pjt@google.com, efault@gmx.de, fweisbec@gmail.com, geoff@infradead.org, rostedt@goodmis.org, tglx@linutronix.de, amit.kucheria@linaro.org, linux-kernel , linaro-sched-sig@lists.linaro.org, Morten Rasmussen , Juri Lelli Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 16, 2012 at 11:30 AM, Peter Zijlstra wrote: > > The reason I want to push this into the arch is that the current one > size fits all topology really doesn't fit. NAK NAK NAK. Ingo, please don't take any of these patches if they are starting to make NUMA scheduling be some arch-specific crap. Peter - you're way off base. You are totally and utterly wrong for several reasons: - x86 isn't that special. In fact, x86 covers almost all the possible NUMA architectures just within x86, so making some arch-specific NUMA crap is *idiotic*. Your argument that SMT, multi-core with NUMA effects within the core, yadda yadda is somehow x86-specific is pure crap. Everybody else already does that or will do that. More importantly, most of the other architectures don't even *matter* enough for them to ever write their own NUMA scheduler. - NUMA topology details aren't even that important! Christ, people. You're never going to be perfect anyway. And outside of some benchmarks, and some trivial loads, the loads aren't going to be well-behaved to really let you even try. You don't even know what the right scheduling policy should be. We already know that even the *existing* power-aware scheduling is pure crap and nobody believes it actually works. Trying to redesign something from scratch when you don't even understand it, AND THEN MAKING IT ARCH-SPECIFIC, is just f*cking moronic. The *only* thing that will ever result in is some crap code that handles the one or two machines you tested it on right, for the one or two loads you tested it with. If you cannot make it simple enough, and generic enough, that it works reasonably well for POWER and s390 (and ARM), don't even bother. Seriously. If you hate the s390 'book' domain and it really causes problems, rip it out. NUMA code has *never* worked well. And I'm not talking just Linux. I'm talking about all the other systems that tried to to do it, and tried too effin hard. Stop the idiocy. Make things *simpler* and *less* capable instead of trying to cover some "real" topology. Nobody cares in real life. Seriously. And the hardware people are going to keep on making things different. Don't try to build up some perfect NUMA topology and then try to see how insanely well you can match a particular machine. Make some generic "roughly like this" topology with (say) four three of NUMAness, and then have architectures say "this is roughly what my machine looks like". And if you cannot do that kind of "roughly generic NUMA", then stop working on this crap immediately. Rather than waste everybodys time with some crazy arch-specific random scheduling. Make the levels be something like (a) "share core resources" (SMT or shared inner cache, like a shared L2 when there is a big L3) (b) share socket (c) shared board/node and don't try to make it any more clever. Don't try to describe just *how* much resources are shared. Nobody cares at that level. If it's not a difference of an order of magnitude, it's damn well the same f*cking thing! Don't think that SMT is all that different from "two cores sharing the same L2". Sure, they are different, but you won't find real-life loads that care enough for us to ever bother with the differences. Don't try to describe it any more than that. Seriously. Any time you add "implementation detail" level knowledge (like the s390 'book' or the various forms of SMT vs "shared decorer" (Bulldozer/Trinity) vs "true cores sharing L2" (clusters of cores sharing an L2, with a big shared L3 within the socket), you're just shooting yourself in the foot. You're adding crap that shouldn't be added. In particular, don't even try to give random "weights" to how close things are to each other. Sure, you can parse (and generate) those complex NUMA tables, but nobody is *ever* smart enough to really use them. Once you move data between boards/nodes, screw the number of hops. You are NOT going to get some scheduling decision right that says "node X is closer to node Y than to node Z". Especially since the load is invariably going to access non-node memory too *anyway*. Seriously, if you think you need some complex data structures to describe the relationship between cores, you're barking up the wrong tree. Make it a simple three-level tree like the above. No weights. No *nothing*. If there isn't an order of magnitude difference in performance and/or a *major* power domain issue, they're at the same level. Nothing smarter than that, because it will just be f*cking stupid, and it will be a nightmare to maintain, and nobody will understand it anyway. And the *only* thing that should be architecture-specific is the architecture CPU init code that says "ok, this core is a SMT thread, so it is in the same (a) level NUMA domain as that other core". I'm very very serious about this. Try to make the scheduler have a *simple* model that people can actually understand. For example, maybe it can literally be a multi-level balancing thing, where the per-cpu runqueues are grouped into a "shared core resources" balancer that balances within the SMT or shared-L2 domain. And then there's an upper-level balancer (that runs much more seldom) that is written to balances within the socket. And then one that balances within the node/board. And finally one that balances across boards. Then, at each level, just say "spread out" (unrelated loads that want maximum throughput) vs "try to group together" (for power reasons and to avoid cache bouncing for related loads). I dunno what the details need to be, and the above is just some random off-the-cuff example. But what I *do* know that we don't want any arch-specific code. And I *do* know that the real world simply isn't simple enough that we could ever do a perfect job, so don't even try - instead aim for "understandable, maintainable, and gets the main issues roughly right". Linus