Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs

From: Mel Gorman <mgorman@techsingularity.net>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Rik van Riel <riel@surriel.com>, Yi Wang <wang.yi59@zte.com.cn>,
	zhong.weidong@zte.com.cn, Yi Liu <liu.yi24@zte.com.cn>,
	Frederic Weisbecker <frederic@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH v2] sched/core: Don't mix isolcpus and housekeeping CPUs
Date: Wed, 24 Oct 2018 09:56:36 +0100	[thread overview]
Message-ID: <20181024085636.GB23537@techsingularity.net> (raw)
In-Reply-To: <1540350169-18581-1-git-send-email-srikar@linux.vnet.ibm.com>

On Wed, Oct 24, 2018 at 08:32:49AM +0530, Srikar Dronamraju wrote:
> Load balancer and NUMA balancer are not suppose to work on isolcpus.
> 
> Currently when setting sched affinity, there are no checks to see if the
> requested cpumask has CPUs from both isolcpus and housekeeping CPUs.
> 
> If user passes a mix of isolcpus and housekeeping CPUs, then
> NUMA balancer can pick a isolcpu to schedule.
> With this change, if a combination of isolcpus and housekeeping CPUs are
> provided, then we restrict ourselves to housekeeping CPUs.
> 
> For example: System with 32 CPUs
> $ grep -o "isolcpus=[,,1-9]*" /proc/cmdline
> isolcpus=1,5,9,13
> $ grep -i cpus_allowed /proc/$$/status
> Cpus_allowed:   ffffdddd
> Cpus_allowed_list:      0,2-4,6-8,10-12,14-31
> 
> Running "perf bench numa mem --no-data_rand_walk -p 4 -t 8 -G 0 -P 3072
> -T 0 -l 50 -c -s 1000" which  calls sched_setaffinity to all CPUs in
> system.
> 

Forgive my naivety, but is it wrong for a process to bind to both isolated
CPUs and housekeeping CPUs? It would certainly be a bit odd because the
application is asking for some protection but no guarantees are given
and the application is not made aware via an error code that there is a
problem. Asking the application to parse dmesg hoping to find the right
error message is going to be fragile.

Would it be more appropriate to fail sched_setaffinity when there is a
mix of isolated and housekeeping CPUs? In that case, an info message in
dmesg may be appropriate as it'll likely be a once-off configuration
error that's obvious due to an application failure. Alternatively,
should NUMA balancing ignore isolated CPUs? The latter seems unusual as
the application has specified a mask that allows those CPUs and it's not
clear why NUMA balancing should ignore them. If anything, an application
that wants to avoid all interference should also be using memory policies
to bind to nodes so it behaves predictably with respect to access latencies
(presumably if an application cannot tolerate kernel threads interfering
then it also cannot tolerate remote access latencies) or disabling NUMA
balancing entirely to avoid incurring minor faults.

Thanks.

-- 
Mel Gorman
SUSE Labs