From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753440AbbELPpy (ORCPT ); Tue, 12 May 2015 11:45:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40824 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751583AbbELPpv (ORCPT ); Tue, 12 May 2015 11:45:51 -0400 Message-ID: <55522005.1080705@redhat.com> Date: Tue, 12 May 2015 11:45:09 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: dedekind1@gmail.com CC: linux-kernel@vger.kernel.org, mgorman@suse.de, peterz@infradead.org, jhladky@redhat.com Subject: Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination References: <1430908530.7444.145.camel@sauron.fi.intel.com> <20150506114128.0c846a37@cuia.bos.redhat.com> <1431090801.1418.87.camel@sauron.fi.intel.com> <554D1681.7040902@redhat.com> <1431438610.20417.0.camel@sauron.fi.intel.com> In-Reply-To: <1431438610.20417.0.camel@sauron.fi.intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/12/2015 09:50 AM, Artem Bityutskiy wrote: > On Fri, 2015-05-08 at 16:03 -0400, Rik van Riel wrote: >> Currently the load balancer has a preference for moving >> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true), >> but there is no resistance to moving tasks away from their >> preferred nodes (NUMA_RESIST_LOWER, false). That setting >> was arrived at after a fair amount of experimenting, and >> is probably correct. > > FYI, (NUMA_RESIST_LOWER, true) does not make any difference for me. I am not surprised by this. The idle balancing code will simply take a runnable-but-not-running task off the run queue of the busiest CPU in the system. On a system with some idle time, it is likely there are only one or two tasks available on the run queue of the busiest CPU, which leaves little or no choice to the NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER code. The idle balancing code, through find_busiest_queue() already tries to select a CPU where at least one of the runnable tasks is on the wrong NUMA node. However, that task may well be the current task, leading us to steal the other (runnable but on the queue) task instead, moving that one to the wrong NUMA node. I have a few poorly formed ideas on what could be done about that: 1) have fbq_classify_rq take the current task on the rq into account, and adjust the fbq classification if all the runnable-but-queued tasks are on the right node 2) ensure that rq->nr_numa_running and rq->nr_preferred_running also get incremented for kernel threads that are bound to a particular CPU - currently CPU-bound kernel threads will cause the NUMA statistics to look like a CPU has tasks that do not belong on that NUMA node 3) have detach_tasks take env->fbq_type into account when deciding whether to look at NUMA affinity at all 4) maybe have detach_tasks fail if env->fbq_type is regular or remote, but no !numa or on-the-wrong-node tasks were found ? not sure if that would cause problems, or what kind... -- All rights reversed