From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933198AbcIEQ07 (ORCPT ); Mon, 5 Sep 2016 12:26:59 -0400 Received: from mail-wm0-f67.google.com ([74.125.82.67]:36202 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932301AbcIEQ05 (ORCPT ); Mon, 5 Sep 2016 12:26:57 -0400 Message-ID: <1473092813.4412.6.camel@gmail.com> Subject: [v2 patch v3.18+ regression fix] sched: Further improve spurious CPU_IDLE active migrations From: Mike Galbraith To: Vincent Guittot Cc: Peter Zijlstra , LKML , Rik van Riel Date: Mon, 05 Sep 2016 18:26:53 +0200 In-Reply-To: References: <1472535775.3960.3.camel@suse.de> <20160831100117.GV10121@twins.programming.kicks-ass.net> <1472638699.3942.14.camel@suse.de> <1472639782.3942.27.camel@gmail.com> <1472703062.3979.60.camel@gmail.com> Content-Type: text/plain; charset="us-ascii" X-Mailer: Evolution 3.16.5 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Coming back to this, how about this instead, only increase the group imbalance threshold when sd_llc_size == 2. Newer L3 equipped processors then aren't affected. 43f4d666 partially cured uprious migrations, but when there are completely idle groups on a lightly loaded processor, and there is a buddy pair occupying the busiest group, we will not attempt to migrate due to select_idle_sibling() buddy placement, leaving the busiest queue with one task. We skip balancing, but increment nr_balance_failed until we kick active balancing, and bounce a buddy pair endlessly, demolishing throughput. Increase group imbalance threshold to two when sd_llc_size == 2 to allow buddies to share L2 without affecting larger L3 processors. Regression detected on X5472 box, which has 4 MC groups of 2 cores. netperf -l 60 -H 127.0.0.1 -t UDP_STREAM -i5,1 -I 95,5 pre: !!! WARNING !!! Desired confidence was not achieved within the specified iterations. !!! This implies that there was variability in the test environment that !!! must be investigated before going further. !!! Confidence intervals: Throughput : 66.421% !!! Local CPU util : 0.000% !!! Remote CPU util : 0.000% Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 65507 60.00 1779143 0 15539.49 212992 60.00 1773551 15490.65 post: Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 65507 60.00 3719377 0 32486.01 212992 60.00 3717492 32469.54 Signed-off-by: Mike Galbraith Fixes: caeb178c sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd Cc: # v3.18+ --- kernel/sched/fair.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7249,12 +7249,19 @@ static struct sched_group *find_busiest_ * This cpu is idle. If the busiest group is not overloaded * and there is no imbalance between this and busiest group * wrt idle cpus, it is balanced. The imbalance becomes - * significant if the diff is greater than 1 otherwise we - * might end up to just move the imbalance on another group + * significant if the diff is greater than 1 for most CPUs, + * or 2 for older CPUs having multiple groups of 2 cores + * sharing an L2, otherwise we may end up uselessly moving + * the imbalance to another group, or starting a tug of war + * with idle L2 groups constantly ripping communicating + * tasks apart, and no L3 to mitigate the cache miss pain. */ - if ((busiest->group_type != group_overloaded) && - (local->idle_cpus <= (busiest->idle_cpus + 1))) - goto out_balanced; + if (busiest->group_type != group_overloaded) { + int imbalance = __this_cpu_read(sd_llc_size) == 2 ? 2 : 1; + + if (local->idle_cpus <= busiest->idle_cpus + imbalance) + goto out_balanced; + } } else { /* * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use