From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932389AbbEHUDu (ORCPT ); Fri, 8 May 2015 16:03:50 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38293 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752628AbbEHUDs (ORCPT ); Fri, 8 May 2015 16:03:48 -0400 Message-ID: <554D1681.7040902@redhat.com> Date: Fri, 08 May 2015 16:03:13 -0400 From: Rik van Riel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: dedekind1@gmail.com CC: linux-kernel@vger.kernel.org, mgorman@suse.de, peterz@infradead.org, jhladky@redhat.com Subject: Re: [PATCH] numa,sched: only consider less busy nodes as numa balancing destination References: <1430908530.7444.145.camel@sauron.fi.intel.com> <20150506114128.0c846a37@cuia.bos.redhat.com> <1431090801.1418.87.camel@sauron.fi.intel.com> In-Reply-To: <1431090801.1418.87.camel@sauron.fi.intel.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/08/2015 09:13 AM, Artem Bityutskiy wrote: > On Wed, 2015-05-06 at 11:41 -0400, Rik van Riel wrote: >> On Wed, 06 May 2015 13:35:30 +0300 >> Artem Bityutskiy wrote: >> >>> we observe a tremendous regression between kernel version 3.16 and 3.17 >>> (and up), and I've bisected it to this commit: >>> >>> a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node >> >> Artem, Jirka, does this patch fix (or at least improve) the issues you >> have been seeing? Does it introduce any new regressions? > > Hi Rik, > > first of all thanks for your help! > > I've tried this patch and it has very small effect. I've also ran the > benchmark with auto-NUMA disabled too, which is useful, I think. I used > the tip of Linuses tree (v4.1-rc2+). Trying with NUMA balancing disabled was extremely useful! I now have an idea what is going on with your workload. I suspect Peter and Mel aren't going to like it... > Kernel Avg response time, ms > ------------------------------------------------------ > Vanilla 1481 > Patched 1240 > Reverted 256 > Disabled 309 > > > Vanilla: pristine v4.1-rc2+ > Patched: Vanilla + this patch > Reverted: Vanilla + a revert of a43455a > Disabled: Vanilla and auto-NUMA disabled via procfs My hypothesis: the NUMA code moving tasks at all is what is hurting your workload. On a two-node system, you only have the current node the task is on, and the task's preferred node, which may or may not be the same node. In case the preferred node is different, with the patch reverted the kernel would only try to move a task to the preferred node if that load was running fewer tasks than it has CPU cores. It would never attempt task swaps, or anything else. With both the vanilla kernel, and with my new patch, the NUMA balancing code will try to move a task to a better location (from a NUMA point of view). This works well when dealing with tasks that are constantly running, but fails catastrophically when dealing with tasks that go to sleep, wake back up, go back to sleep, wake back up, and generally mess up the load statistics that the NUMA balancing code use in a random way. If the normal scheduler load balancer is moving tasks the other way the NUMA balancer is moving them, things will not converge, and tasks will have worse memory locality than not doing NUMA balancing at all. Currently the load balancer has a preference for moving tasks to their preferred nodes (NUMA_FAVOUR_HIGHER, true), but there is no resistance to moving tasks away from their preferred nodes (NUMA_RESIST_LOWER, false). That setting was arrived at after a fair amount of experimenting, and is probably correct. I am still curious what my current patch does for Jirka's workload (if anything). I have no idea whether his workload suffers from similar issues as Artem's workload, or whether they perform relatively poorly for different reasons. END CONCLUSION BEGIN RAMBLING UNFORMED IDEA I do not have a solid idea in my mind on how to solve the problem above, but I have some poorly formed ideas... 1) It may be worth for the load balancer to keep track of how many times it moves a task to a NUMA node where it has worse locality, in order to give it CPU time now. 2) The NUMA balancing code, in turn, may resist/skip moving tasks to nodes with better NUMA locality, when the load balancer has moved that task away in the past, and is likely to move it away again. 3) The statistic from (1) could be a floating average of se.statistics.nr_forced_migrations, which would require some modifications to migrate_degrades_locality() and can_migrate_task() to do the evaluation even when it does not factor it into its decisions. 4) I am not sure yet how to weigh that floating average against the NUMA locality. Should the floating average of forced migrations only block NUMA locality when it is large, and when the difference in NUMA locality score between nodes is small? How do we weigh these things? 5) What am I forgetting / overlooking? -- All rights reversed