From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752957AbcHOLTV (ORCPT ); Mon, 15 Aug 2016 07:19:21 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34731 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752883AbcHOLTS (ORCPT ); Mon, 15 Aug 2016 07:19:18 -0400 X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com X-IBM-MailFrom: heiko.carstens@de.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Mon, 15 Aug 2016 13:19:08 +0200 From: Heiko Carstens To: Ming Lei , Tejun Heo Cc: Thomas Gleixner , Peter Zijlstra , LKML , Yasuaki Ishimatsu , Andrew Morton , Lai Jiangshan , Michael Holzheu , Martin Schwidefsky Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning References: <20160727125412.GB3912@osiris> <20160730112552.GA3744@osiris> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16081511-0040-0000-0000-000002BFDC29 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16081511-0041-0000-0000-00001C97D50E Message-Id: <20160815111908.GA3903@osiris> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-15_03:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000 definitions=main-1608150131 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Aug 08, 2016 at 03:45:05PM +0800, Ming Lei wrote: > On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens > wrote: > > On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote: > >> On Wed, 27 Jul 2016, Heiko Carstens wrote: > >> > [ 3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8) > >> > [ 3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478) > >> > [ 3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0) > >> > [ 3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438) > >> > >> > For some unknown reason select_task_rq() gets called with a task that has > >> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));" > >> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the > >> > warning later on. > >> > > >> > It only happens with more than one node, otherwise it seems to work fine. > >> > > >> > Any idea what could be wrong here? > >> > >> create_worker() > >> tsk = kthread_create_on_node(); > >> kthread_bind_mask(tsk, pool->attrs->cpumask); > >> do_set_cpus_allowed(tsk, mask); > >> set_cpus_allowed_common(tsk, mask); > >> cpumask_copy(&tsk->cpus_allowed, mask); > >> tsk->nr_cpus_allowed = cpumask_weight(mask); > >> wake_up_process(task); > >> > >> So this looks like pool->attrs->cpumask is simply empty..... > > > > Just had some time to look into this a bit more. Looks like we initialize > > the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's > > patch just revealed that problem. > > > > I'll see if initializing the masks earlier will fix this, but I think it > > will. > > Hello, > > Is there any fix for this issue? I can see the issue on arm64 running > v4.7 kernel too. And the oops can be avoided by reverting commit > e9d867a(sched: Allow per-cpu kernel threads to run on online && !active). I don't know about the arm64 issue. The s390 problem is a result from initializing the cpu_to_node mapping too late. However, the workqueue code seems to assume that we know the cpu_to_node mapping for all _possible_ cpus very early and apparently it assumes that this mapping is stable and doesn't change anymore. This assumption however contradicts the purpose of 346404682434 ("numa, cpu hotplug: change links of CPU and node when changing node number by onlining CPU"). So something is wrong here... On s390 with fake numa we wouldn't even know the mapping of all _possible_ cpus at boot time. When establishing the node mapping we try hard to map our existing cpu topology into a sane node mapping. However we simply don't know where non-present cpus are located topology-wise. Even for present cpus the answer is not always there since present cpus can be in either the state "configured" (topology location known - cpu online possible) or "deconfigured" (topology location unknown - cpu online not possible). I can imagine several ways to fix this for s390, but before doing that I'm wondering if the workqueue code is correct with a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus that early and b) that the cpu_to_node() mapping does never change Tejun?