From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753440AbbELPpy (ORCPT <rfc822;w@1wt.eu>);
	Tue, 12 May 2015 11:45:54 -0400
Received: from mx1.redhat.com ([209.132.183.28]:40824 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751583AbbELPpv (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Tue, 12 May 2015 11:45:51 -0400
Message-ID: <55522005.1080705@redhat.com>
Date: Tue, 12 May 2015 11:45:09 -0400
From: Rik van Riel <riel@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: dedekind1@gmail.com
CC: linux-kernel@vger.kernel.org, mgorman@suse.de, peterz@infradead.org,
        jhladky@redhat.com
Subject: Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing
 destination
References: <1430908530.7444.145.camel@sauron.fi.intel.com>		 <20150506114128.0c846a37@cuia.bos.redhat.com>	 <1431090801.1418.87.camel@sauron.fi.intel.com>	 <554D1681.7040902@redhat.com> <1431438610.20417.0.camel@sauron.fi.intel.com>
In-Reply-To: <1431438610.20417.0.camel@sauron.fi.intel.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 05/12/2015 09:50 AM, Artem Bityutskiy wrote:
> On Fri, 2015-05-08 at 16:03 -0400, Rik van Riel wrote:
>> Currently the load balancer has a preference for moving
>> tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
>> but there is no resistance to moving tasks away from their
>> preferred nodes (NUMA_RESIST_LOWER, false).  That setting
>> was arrived at after a fair amount of experimenting, and
>> is probably correct.
> 
> FYI, (NUMA_RESIST_LOWER, true) does not make any difference for me.

I am not surprised by this.

The idle balancing code will simply take a runnable-but-not-running
task off the run queue of the busiest CPU in the system. On a system
with some idle time, it is likely there are only one or two tasks
available on the run queue of the busiest CPU, which leaves little or
no choice to the NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER code.

The idle balancing code, through find_busiest_queue() already tries
to select a CPU where at least one of the runnable tasks is on the
wrong NUMA node.

However, that task may well be the current task, leading us to steal
the other (runnable but on the queue) task instead, moving that one
to the wrong NUMA node.

I have a few poorly formed ideas on what could be done about that:

1) have fbq_classify_rq take the current task on the rq into account,
   and adjust the fbq classification if all the runnable-but-queued
   tasks are on the right node

2) ensure that rq->nr_numa_running and rq->nr_preferred_running also
   get incremented for kernel threads that are bound to a particular
   CPU - currently CPU-bound kernel threads will cause the NUMA
   statistics to look like a CPU has tasks that do not belong on that
   NUMA node

3) have detach_tasks take env->fbq_type into account when deciding
   whether to look at NUMA affinity at all

4) maybe have detach_tasks fail if env->fbq_type is regular or remote,
   but no !numa or on-the-wrong-node tasks were found ?  not sure if
   that would cause problems, or what kind...

-- 
All rights reversed