From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Tue, 24 Sep 2002 17:14:27 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Tue, 24 Sep 2002 17:14:26 -0400 Received: from franka.aracnet.com ([216.99.193.44]:54404 "EHLO franka.aracnet.com") by vger.kernel.org with ESMTP id ; Tue, 24 Sep 2002 17:14:25 -0400 Date: Tue, 24 Sep 2002 14:17:38 -0700 From: "Martin J. Bligh" Reply-To: "Martin J. Bligh" To: Erich Focht , linux-kernel cc: LSE , Ingo Molnar , Michael Hohnbaum Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler Message-ID: <265747254.1032877057@[10.10.2.3]> In-Reply-To: <200209242304.44799.efocht@ess.nec.de> References: <200209242304.44799.efocht@ess.nec.de> X-Mailer: Mulberry/2.1.2 (Win32) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org >> > 2: I have no idea how tasks sharing the mm structure will behave. I'd >> > like them to run on different nodes (that's why node_mem is not in mm), >> > but they could (legally) free pages which they did not allocate and >> > have wrong values in node_mem[]. >> >> Yes, that really ought to be per-process, not per task. Which means >> locking or atomics ... and overhead. Ick. > > Hmm, I think it is sometimes ok to have it per task. For example OpenMP > parallel jobs working on huge arrays. The "first-touch" of these arrays > leads to pagefaults generated by the different tasks and thus different > node_mem[] arrays for each task. As long as they just allocate memory, > all is well. If they only release it at the end of the job, too. This > probably goes wrong if we have a long running task that spawns short > living clones. They inherit the node_mem from the parent but pages > added by them to the common mm are not reflected in the parent's node_mem > after their death. But you're left with a choice whether to base it on the per-task or per-process information when you make decisions. 1. per-process requires cross-node collation for a data read. Bad. 2. per-task leads to obviously bad decision cases when there's significant amounts of shared data between the tasks of a process (which was often the point of making them threads in the first place). Yes, I can imagine a situation for which it would work, as you describe above ... but that's not the point - this is a general policy, and I don't think it works in general ... as you say yourself "it is *sometimes* ok" ;-) > The first patch needs a correction, add in load_balance() > if (!busiest) goto out; > after the call to find_busiest_queue. This works alone. On top of this > pooling NUMA scheduler we can put the node affinity approach that fits > best. With or without memory allocation. I'll update the patches and > their setup code (thanks for the comments!) and resend them. Excellent! Will try out the correction above and get back to you. Might be a day or so, I'm embroiled in some other code at the moment. M.