From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S932389AbbEHUDu (ORCPT <rfc822;w@1wt.eu>);
	Fri, 8 May 2015 16:03:50 -0400
Received: from mx1.redhat.com ([209.132.183.28]:38293 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752628AbbEHUDs (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 8 May 2015 16:03:48 -0400
Message-ID: <554D1681.7040902@redhat.com>
Date: Fri, 08 May 2015 16:03:13 -0400
From: Rik van Riel <riel@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: dedekind1@gmail.com
CC: linux-kernel@vger.kernel.org, mgorman@suse.de, peterz@infradead.org,
        jhladky@redhat.com
Subject: Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing
 destination
References: <1430908530.7444.145.camel@sauron.fi.intel.com>	 <20150506114128.0c846a37@cuia.bos.redhat.com> <1431090801.1418.87.camel@sauron.fi.intel.com>
In-Reply-To: <1431090801.1418.87.camel@sauron.fi.intel.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 05/08/2015 09:13 AM, Artem Bityutskiy wrote:
> On Wed, 2015-05-06 at 11:41 -0400, Rik van Riel wrote:
>> On Wed, 06 May 2015 13:35:30 +0300
>> Artem Bityutskiy <dedekind1@gmail.com> wrote:
>>
>>> we observe a tremendous regression between kernel version 3.16 and 3.17
>>> (and up), and I've bisected it to this commit:
>>>
>>> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node
>>
>> Artem, Jirka, does this patch fix (or at least improve) the issues you
>> have been seeing?  Does it introduce any new regressions?
> 
> Hi Rik,
> 
> first of all thanks for your help!
> 
> I've tried this patch and it has very small effect. I've also ran the
> benchmark with auto-NUMA disabled too, which is useful, I think. I used
> the tip of Linuses tree (v4.1-rc2+).

Trying with NUMA balancing disabled was extremely useful!
I now have an idea what is going on with your workload.
I suspect Peter and Mel aren't going to like it...

>  Kernel         Avg response time, ms
> ------------------------------------------------------
> Vanilla                1481
> Patched                1240
> Reverted               256
> Disabled               309
> 
> 
> Vanilla: pristine v4.1-rc2+
> Patched: Vanilla + this patch
> Reverted: Vanilla + a revert of a43455a
> Disabled: Vanilla and auto-NUMA disabled via procfs

My hypothesis: the NUMA code moving tasks at all is what is
hurting your workload.  On a two-node system, you only have
the current node the task is on, and the task's preferred
node, which may or may not be the same node.

In case the preferred node is different, with the patch
reverted the kernel would only try to move a task to the
preferred node if that load was running fewer tasks than
it has CPU cores. It would never attempt task swaps, or
anything else.

With both the vanilla kernel, and with my new patch, the
NUMA balancing code will try to move a task to a better
location (from a NUMA point of view).

This works well when dealing with tasks that are constantly
running, but fails catastrophically when dealing with tasks
that go to sleep, wake back up, go back to sleep, wake back
up, and generally mess up the load statistics that the NUMA
balancing code use in a random way.

If the normal scheduler load balancer is moving tasks the
other way the NUMA balancer is moving them, things will
not converge, and tasks will have worse memory locality
than not doing NUMA balancing at all.

Currently the load balancer has a preference for moving
tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true),
but there is no resistance to moving tasks away from their
preferred nodes (NUMA_RESIST_LOWER, false).  That setting
was arrived at after a fair amount of experimenting, and
is probably correct.

I am still curious what my current patch does for Jirka's
workload (if anything). I have no idea whether his workload
suffers from similar issues as Artem's workload, or whether
they perform relatively poorly for different reasons.

 END CONCLUSION


 BEGIN RAMBLING UNFORMED IDEA

I do not have a solid idea in my mind on how to solve the
problem above, but I have some poorly formed ideas...

1) It may be worth for the load balancer to keep track of
how many times it moves a task to a NUMA node where it has
worse locality, in order to give it CPU time now.

2) The NUMA balancing code, in turn, may resist/skip moving
tasks to nodes with better NUMA locality, when the load balancer
has moved that task away in the past, and is likely to move it
away again.

3) The statistic from (1) could be a floating average of
se.statistics.nr_forced_migrations, which would require some
modifications to migrate_degrades_locality() and can_migrate_task()
to do the evaluation even when it does not factor it into its
decisions.

4) I am not sure yet how to weigh that floating average against
the NUMA locality. Should the floating average of forced
migrations only block NUMA locality when it is large, and when
the difference in NUMA locality score between nodes is small?
How do we weigh these things?

5) What am I forgetting / overlooking?

-- 
All rights reversed