From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753182AbcHPHzQ (ORCPT ); Tue, 16 Aug 2016 03:55:16 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:53784 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750785AbcHPHzO (ORCPT ); Tue, 16 Aug 2016 03:55:14 -0400 X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com X-IBM-MailFrom: heiko.carstens@de.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Tue, 16 Aug 2016 09:55:05 +0200 From: Heiko Carstens To: Tejun Heo Cc: Ming Lei , Thomas Gleixner , Peter Zijlstra , LKML , Yasuaki Ishimatsu , Andrew Morton , Lai Jiangshan , Michael Holzheu , Martin Schwidefsky Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning References: <20160727125412.GB3912@osiris> <20160730112552.GA3744@osiris> <20160815111908.GA3903@osiris> <20160815224801.GA3672@mtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160815224801.GA3672@mtj.duckdns.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16081607-0008-0000-0000-000002B27851 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16081607-0009-0000-0000-00001976BBEE Message-Id: <20160816075505.GB3896@osiris> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-16_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1608160094 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Tejun, > On Mon, Aug 15, 2016 at 01:19:08PM +0200, Heiko Carstens wrote: > > I can imagine several ways to fix this for s390, but before doing that I'm > > wondering if the workqueue code is correct with > > > > a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus > > that early > > This can be debatable and making it "first registration sticks" is > likely easy enough. > > > and > > > > b) that the cpu_to_node() mapping does never change > > However, this part isn't just from workqueue. It just hits in a more > obvious way. For example, memory allocation has the same problem and > we would have to synchronize memory allocations against cpu <-> node > mapping changing. It'd be silly to add the complexity and overhead of > making the mapping dynamic when that there's nothing inherently > dynamic about it. The surface area is pretty big here. > > I have no idea how s390 fakenuma works. Is that very difficult from > x86's? IIRC, x86's fakenuma isn't all that dynamic. I'm not asking to make the cpu <-> node completely dynamic. We have already code in place to keep the cpu <-> node mapping static, however currently this happens too late, but can be fixed quite easily. Unfortunately we do not always know to which node a cpu belongs when we register it, currently all cpus will be registered to node 0 and only when a cpu is brought online this will be corrected. The problem we have are "standby" cpus on s390, for which we know they are present but can't use them currently. The mechanism is the following: We detect a standby cpu and register it via register_cpu(); since the node isn't known yet for this cpu, the cpu_to_node() function will return 0, therefore all standby cpus will be registered under node 0. The new standby cpu will have a "configure" sysfs attribute. If somebody writes "1" to it we signal the hypervisor that we want to use the cpu and it allocates one. If this request succeeds we finally know where the cpu is located topology wise and can fix up everything (and can also make the cpu to node mapping static). Note: as long as cpu isn't configured it cannot be brought online. If the cpu now is finally brought online the change_cpu_under_node() code within drivers/base/cpu.c fixes up the node symlinks so at least the sysfs representation is also correct. If later on the cpu is brought offline, deconfigured, etc. we do not change the cpu_to_node mapping anymore. So the question is how to define "first registration sticks". :)