From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751016AbWBFGED (ORCPT ); Mon, 6 Feb 2006 01:04:03 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751019AbWBFGEB (ORCPT ); Mon, 6 Feb 2006 01:04:01 -0500 Received: from mx3.mail.elte.hu ([157.181.1.138]:37842 "EHLO mx3.mail.elte.hu") by vger.kernel.org with ESMTP id S1751016AbWBFGEA (ORCPT ); Mon, 6 Feb 2006 01:04:00 -0500 Date: Mon, 6 Feb 2006 07:02:43 +0100 From: Ingo Molnar To: Andrew Morton Cc: Paul Jackson , dgc@sgi.com, steiner@sgi.com, Simon.Derr@bull.net, ak@suse.de, linux-kernel@vger.kernel.org, clameter@sgi.com Subject: Re: [PATCH 1/5] cpuset memory spread basic implementation Message-ID: <20060206060243.GA11918@elte.hu> References: <20060204071910.10021.8437.sendpatchset@jackhammer.engr.sgi.com> <20060205203711.2c855971.akpm@osdl.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060205203711.2c855971.akpm@osdl.org> User-Agent: Mutt/1.4.2.1i X-ELTE-SpamScore: 0.0 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=0.0 required=5.9 tests=AWL autolearn=no SpamAssassin version=3.0.3 0.0 AWL AWL: From: address is in the auto white-list X-ELTE-VirusStatus: clean Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org * Andrew Morton wrote: > Paul Jackson wrote: > > > > This policy can provide substantial improvements for jobs that > > need to place thread local data on the corresponding node, but > > that need to access large file system data sets that need to > > be spread across the several nodes in the jobs cpuset in order > > to fit. Without this patch, especially for jobs that might > > have one thread reading in the data set, the memory allocation > > across the nodes in the jobs cpuset can become very uneven. > > > It all seems rather ironic. We do vast amounts of development to make > certain microbenchmarks look good, then run a real workload on the > thing, find that all those microbenchmark-inspired tweaks actually > deoptimised the real workload? So now we need to add per-task knobs > to turn off the previously-added microbenchmark-tweaks. > > What happens if one process does lots of filesystem activity and > another one (concurrent or subsequent) wants lots of thread-local > storage? Won't the same thing happen? > > IOW: this patch seems to be a highly specific bandaid which is > repairing an ill-advised problem of our own making, does it not? i suspect it all depends whether the workload is 'global' or 'local'. Lets consider the hypothetical case of a 16-node box with 64 CPUs and 1 TB of RAM, which could have two fundamental types of workloads: - lots of per-CPU tasks which are highly independent and each does its own stuff. For this case we really want to allocate everything per-CPU and as close to the task as possible. - 90% of the 1 TB of RAM is in a shared 'database' that is accessed by all nodes in a nonpredictable pattern, from any CPU. For this case we want to 'spread out' the pagecache as much as possible. If we dont spread it out then one task - e.g. an initialization process - could create a really bad distribution for the pagecache: big continuous ranges allocated on the same node. If the workload has randomness but also occasional "medium range" locality, an uneven portion of the accesses could go to the same node, hosting some big continuous chunk of the database. So we want to spread out in an as finegrained way as possible. (perhaps not too finegrained though, to let block IO still be reasonably batched.) we normally optimize for the first case, and it works pretty well on both SMP and NUMA. We do pretty well with the second workload on SMP, but on NUMA, the non-spreadig can hurt. So it makes sense to artificially 'interleave' all the RAM that goes into the pagecache, to have a good statistical distribution of pages. neither workload is broken, nor did we do any design mistake to optimize the SMP case for the first one - it is really the common thing on most boxes. But the second workload does happen too, and it conflicts with the first workload's needs. The difference between the workloads cannot be bridged by the kernel: it is two very different access patterns that results from the problem the application is trying to solve - the kernel cannot influence that. I suspect there is no other way but to let the application tell the kernel which strategy it wants to be utilized. Ingo