From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933997AbdDFHUR (ORCPT ); Thu, 6 Apr 2017 03:20:17 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:47099 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933829AbdDFHUA (ORCPT ); Thu, 6 Apr 2017 03:20:00 -0400 Date: Thu, 6 Apr 2017 12:49:50 +0530 From: Srikar Dronamraju To: Michal Hocko Cc: Ingo Molnar , Peter Zijlstra , LKML , Mel Gorman , Rik van Riel Subject: Re: [PATCH] sched: Fix numabalancing to work with isolated cpus Reply-To: Srikar Dronamraju References: <1491326848-5748-1-git-send-email-srikar@linux.vnet.ibm.com> <20170405125743.GB7258@dhcp22.suse.cz> <20170405152215.GA6019@linux.vnet.ibm.com> <20170405164437.GT6035@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20170405164437.GT6035@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable x-cbid: 17040607-0032-0000-0000-0000020A4AAA X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17040607-0033-0000-0000-0000123F6974 Message-Id: <20170406071950.GA5843@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-04-06_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1702020001 definitions=main-1704060064 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > > > The isolated cpus are part of the cpus allowed list. In the above case, > > > > numabalancing ends up scheduling some of these tasks on isolated cpus. > > > > > > Why is this bad? If the task is allowed to run on isolated CPUs then why > > > > 1. kernel-parameters.txt states: isolcpus as "Isolate CPUs from the > > general scheduler." So the expectation that numabalancing can schedule > > tasks on it is wrong. > > Right but if the task is allowed to run on isolated cpus then the numa > balancing for this taks should be allowed to run on those cpus, no? No numabalancing or any other scheduler balancing should be looking at tasks that are bound to isolated cpus. Similar example that I gave in my reply to Mel. Lets consider 2 node, 24 core with 12 cores in each node. Cores 0-11 in Node 1 and cores 12-23 in the other node. Lets also disable smt/hyperthreading, enable isolcpus from core 6-11,12-17. Lets run 48 thread ebizzy workload and give it a cpu list of say 11,12-17 using taskset. Now all the 48 ebizzy threads will only run on core 11. It will never spread to other cores even in the same node(or in the same node/but isolated cpus) or to the different nodes. i.e even if numabalancing is running or not, even if my fix is around or not, all threads will be confined to core 11, even though the cpus_allowed is 11,12-17. > Say your application would be bound _only_ to isolated cpus. Should that > imply no numa balancing at all? Yes, it implies no numa balancing. > > > 2. If numabalancing was disabled, the task would never run on the > > isolated CPUs. > > I am confused. I thought you said "However a task might call > sched_setaffinity() that includes all possible cpus in the system > including the isolated cpus." So the task is allowed to run there. > Or am I missing something? > Peter, Rik, Ingo can correct me here. I feel most programs that call sched_setaffinity including perf bench are written with an assumption that they are never run with isolcpus. If they were written with isolcpus in mind, they would have either looked at sched_getaffinity and modified the mask or they would have explicitly looked for isolated maps and derived the mask. Just because the program calls sched_setaffinity with all cpus includine the isolcpus, we cant assume that the user wanted us to do scheduling on top of isolated CPUS. That would be break isolcpus. > > 3. With the faulty behaviour, it was observed that tasks scheduled on > > the isolated cpus might end up taking more time, because they never get > > a chance to move back to a node which has local memory. > > I am not sure I understand. Lets say we ran 48 thread perf bench (2 process/12 threads each) on a 24 CPU/2node/12 cores per node system with 4 isolated cpus with all isolated CPUs in one node. The assumption here is each process would eventually end up consolidating in one node. However lets say one of the perf thread was moved to isolated thread on the wrong node, i.e it ends up accessing remote memory. (Why does it end up on the wrong node? Because when we compare numa faults, this thread fault may be low at present and it should have moved back to the right node eventually.) However once it gets scheduled on isolated cpus, it wont be picked for moving out. So it ends up taking far more time to finish. I have observed this practically. > > > 4. The isolated cpus may be idle at that point, but actual work may be > > scheduled on isolcpus later (when numabalancing had already scheduled > > work on to it.) Since scheduler doesnt do any balancing on isolcpus even > > if they are overloaded and the system is completely free, the isolcpus > > stay overloaded. > > Please note that I do not claim the patch is wrong. I am still not sure > myself but the chagelog is missing the most important information "why > the change is the right thing". I am open to editing the changelog, I assumed that isolcpus kernel parameter was clear that no scheduling algorithms can interfere with isolcpus. Would stating this in the changelog clarify to you that this change is right?