From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1759880AbcDEVGL (ORCPT <rfc822;w@1wt.eu>);
	Tue, 5 Apr 2016 17:06:11 -0400
Received: from mail-qk0-f193.google.com ([209.85.220.193]:33141 "EHLO
	mail-qk0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751320AbcDEVGH (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 5 Apr 2016 17:06:07 -0400
Subject: Re: [PATCH RFC] select_idle_sibling experiments
To: Matt Fleming <matt@codeblueprint.co.uk>, Chris Mason <clm@fb.com>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        Mike Galbraith <mgalbraith@suse.de>, linux-kernel@vger.kernel.org
References: <20160405180822.tjtyyc3qh4leflfj@floor.thefacebook.com>
 <20160405200302.GL2701@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
From: Bastien Philbert <bastienphilbert@gmail.com>
Message-ID: <570428B2.5020900@gmail.com>
Date: Tue, 5 Apr 2016 17:05:54 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: <20160405200302.GL2701@codeblueprint.co.uk>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


On 2016-04-05 04:03 PM, Matt Fleming wrote:
> On Tue, 05 Apr, at 02:08:22PM, Chris Mason wrote:
>>
>> I started with a small-ish program to benchmark wakeup latencies.  The
>> basic idea is a bunch of worker threads who sit around and burn CPU.
>> Every once and a while they send a message to a message thread.
>  
> This reminds me of something I've been looking at recently; a similar
> workload in Mel's mmtests based on pgbench with 1-client that also has
> this problem of cpu_idle() being false at an inconvenient time in
> select_idle_sibling(), so we move the task off the cpu and the cpu
> then immediately goes idle.
> 
> This leads to tasks bouncing around the socket as we search for idle
> cpus.
> 
>> It has knobs for cpu think time, and for how long the messenger thread
>> waits before replying.  Here's how I'm running it with my patch:
>  
> [...]
> 
> Cool, I'll go have a play with this.
> 
>> Now, on to the patch.  I pushed some code around and narrowed the
>> problem down to select_idle_sibling()   We have cores going into and out
>> of idle fast enough that even this cut our latencies in half:
>>
>> static int select_idle_sibling(struct task_struct *p, int target)
>>                                 goto next;
>>  
>>                         for_each_cpu(i, sched_group_cpus(sg)) {
>> -                               if (i == target || !idle_cpu(i))
>> +                               if (!idle_cpu(i))
>>                                         goto next;
>>                         }
>>  
>> IOW, by the time we get down to for_each_cpu(), the idle_cpu() check
>> done at the top of the function is no longer valid.
>  
> Yeah. The problem is that because we're racing with the cpu going in
> and out of idle, and since you're exploiting that race condition, this
> is highly tuned to your specific workload.
> 
> Which is a roundabout way of saying, this is probably going to
> negatively impact other workloads.
> 
>> I tried a few variations on select_idle_sibling() that preserved the
>> underlying goal of returning idle cores before idle SMT threads.  They
>> were all horrible in different ways, and none of them were fast.
>  
> I toyed with ignoring cpu_idle() in select_idle_sibling() for my
> workload. That actually was faster ;)
> 
>> The patch below just makes select_idle_sibling pick the first idle
>> thread it can find.  When I ran it through production workloads here, it
>> was faster than the patch we've been carrying around for the last few
>> years.
> 
> It would be really nice if we had a lightweight way to gauge the
> "idleness" of a cpu, and whether we expect it to be idle again soon.
> 
The best way to do this is either embed it in a already used structure to
allow us to check it quickly. Otherwise I am curious if writing a marco
may prove useful for this. Seems that idleness checking needs to accounted
for when scheduling, in order to make this lightweight enough to avoid using
it during a context switch, the challenge however is to make the reference
counting lightweight enough to out weight it being done during current scheduling
functions.
> Failing that, could we just force the task onto 'target' when it makes
> sense and skip the idle search (and the race) altogether?
> 
Doesn't this possibly cause a context switch or even a extensive move to another
CPU instruction(s) on certain architectures. Seems we need to add reference counting
or tracking of idle CPUS somewhere.
Bastien