From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754569AbcHSJwZ (ORCPT ); Fri, 19 Aug 2016 05:52:25 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36655 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754498AbcHSJwX (ORCPT ); Fri, 19 Aug 2016 05:52:23 -0400 X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com X-IBM-MailFrom: holzheu@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Fri, 19 Aug 2016 11:52:12 +0200 From: Michael Holzheu To: Tejun Heo Cc: Heiko Carstens , Peter Zijlstra , Ming Lei , Thomas Gleixner , LKML , Yasuaki Ishimatsu , Andrew Morton , Lai Jiangshan , Martin Schwidefsky Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning In-Reply-To: <20160818144208.GA3166@htj.duckdns.org> References: <20160815111908.GA3903@osiris> <20160815224801.GA3672@mtj.duckdns.org> <20160816075505.GB3896@osiris> <20160816152027.GD9516@htj.duckdns.org> <20160816152949.GL30192@twins.programming.kicks-ass.net> <20160816154205.GE9516@htj.duckdns.org> <20160816221953.GA3373@osiris> <20160817135855.GH9516@htj.duckdns.org> <20160818113051.10cdab65@TP-holzheu> <20160818144208.GA3166@htj.duckdns.org> Organization: IBM X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16081909-0032-0000-0000-000001FE2809 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16081909-0033-0000-0000-00001C7335B4 Message-Id: <20160819115212.1c5eba20@TP-holzheu> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-19_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1608190126 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Am Thu, 18 Aug 2016 10:42:08 -0400 schrieb Tejun Heo : > Hello, Michael. > > On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote: > > Well, "no requirement" this is not 100% correct. Currently we use > > the CPU topology information to assign newly coming CPUs to the > > "best fitting" node. > > > > Example: > > > > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU > > assignment: > > > > - N1: cpu 1 on chip 1 > > - N2: cpu 2 on chip 2 > > > > 2) A new cpu 3 is configured that lives on chip 2 > > 3) We assign cpu 3 to N2 > > > > We do this only if the nodes are balanced. If N2 had already one > > more cpu than N1 we would assign the new cpu to N1. > > I see. Out of curiosity, what's the purpose of fakenuma on s390? > There don't seem to be any actual memory locality concerns. Is it > just to segment memory of a machine into multiple pieces? Correct. > If so, why > is that necessary, do you hit some scalability issues w/o NUMA nodes? Yes we hit a scalability issue. Our performance team found out that for big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we see problems: - Zone locks are highly contended because ZONE_NORMAL is big: * zone->lock * zone->lru_lock - One kswapd is not enough for swapping We hope that those problems are resolved by fake NUMA because for each node a separate memory subsystem is created with separate zone locks and kswapd threads. > As for the solution, if blind RR isn't good enough, although it sounds > like it could given that the balancing wasn't all that strong to begin > with, would it be an option to implement an interface which just > requests a new CPU rather than a specific one and then pick one of the > vacant possible CPUs considering node balancing? IMHO this is a promising idea. To say it in my words: - At boot time we already pin all remaining "not configured" logical CPUs to nodes. So all possible cpus are pinned to nodes and cpu_to_node() will work. - If a new physical cpu get's configured, we get the CPU topology information from the system and find the best node. - We get a logical cpu number from the node pool and assign the new physical cpu to that number. If that works we would be as good as before. We will have a look into the code if it is possible. Michael