From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752957AbcHOLTV (ORCPT <rfc822;w@1wt.eu>);
	Mon, 15 Aug 2016 07:19:21 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34731 "EHLO
	mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1752883AbcHOLTS (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 15 Aug 2016 07:19:18 -0400
X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com
X-IBM-MailFrom: heiko.carstens@de.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Mon, 15 Aug 2016 13:19:08 +0200
From: Heiko Carstens <heiko.carstens@de.ibm.com>
To: Ming Lei <tom.leiming@gmail.com>, Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Lai Jiangshan <laijs@cn.fujitsu.com>,
        Michael Holzheu <holzheu@linux.vnet.ibm.com>,
        Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on online
 && !active" causes warning
References: <20160727125412.GB3912@osiris>
 <alpine.DEB.2.11.1607271717320.19896@nanos>
 <20160730112552.GA3744@osiris>
 <CACVXFVNrMjk46pB_E=5fQP2njN8cntSKJ_BMnR-Z4ZmxsMpqyg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CACVXFVNrMjk46pB_E=5fQP2njN8cntSKJ_BMnR-Z4ZmxsMpqyg@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16081511-0040-0000-0000-000002BFDC29
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16081511-0041-0000-0000-00001C97D50E
Message-Id: <20160815111908.GA3903@osiris>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-15_03:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000
 definitions=main-1608150131
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Aug 08, 2016 at 03:45:05PM +0800, Ming Lei wrote:
> On Sat, Jul 30, 2016 at 7:25 PM, Heiko Carstens
> <heiko.carstens@de.ibm.com> wrote:
> > On Wed, Jul 27, 2016 at 05:23:05PM +0200, Thomas Gleixner wrote:
> >> On Wed, 27 Jul 2016, Heiko Carstens wrote:
> >> > [    3.162961] ([<0000000000176c30>] select_task_rq+0xc0/0x1a8)
> >> > [    3.162963] ([<0000000000177d64>] try_to_wake_up+0x2e4/0x478)
> >> > [    3.162968] ([<000000000015d46c>] create_worker+0x174/0x1c0)
> >> > [    3.162971] ([<0000000000161a98>] alloc_unbound_pwq+0x360/0x438)
> >>
> >> > For some unknown reason select_task_rq() gets called with a task that has
> >> > nr_cpus_allowed == 0. Hence "cpu = cpumask_any(tsk_cpus_allowed(p));"
> >> > within select_task_rq() will set cpu to nr_cpu_ids which in turn causes the
> >> > warning later on.
> >> >
> >> > It only happens with more than one node, otherwise it seems to work fine.
> >> >
> >> > Any idea what could be wrong here?
> >>
> >> create_worker()
> >>     tsk = kthread_create_on_node();
> >>     kthread_bind_mask(tsk, pool->attrs->cpumask);
> >>         do_set_cpus_allowed(tsk, mask);
> >>             set_cpus_allowed_common(tsk, mask);
> >>                 cpumask_copy(&tsk->cpus_allowed, mask);
> >>                 tsk->nr_cpus_allowed = cpumask_weight(mask);
> >>     wake_up_process(task);
> >>
> >> So this looks like pool->attrs->cpumask is simply empty.....
> >
> > Just had some time to look into this a bit more. Looks like we initialize
> > the cpu_to_node_masks (way) too late on s390 for fake numa. So Peter's
> > patch just revealed that problem.
> >
> > I'll see if initializing the masks earlier will fix this, but I think it
> > will.
> 
> Hello,
> 
> Is there any fix for this issue?  I can see the issue on arm64 running
> v4.7 kernel too.  And the oops can be avoided by reverting commit
> e9d867a(sched: Allow per-cpu kernel threads to run on online && !active).

I don't know about the arm64 issue. The s390 problem is a result from
initializing the cpu_to_node mapping too late.

However, the workqueue code seems to assume that we know the cpu_to_node
mapping for all _possible_ cpus very early and apparently it assumes that
this mapping is stable and doesn't change anymore.

This assumption however contradicts the purpose of 346404682434 ("numa, cpu
hotplug: change links of CPU and node when changing node number by onlining
CPU").

So something is wrong here...

On s390 with fake numa we wouldn't even know the mapping of all _possible_
cpus at boot time. When establishing the node mapping we try hard to map
our existing cpu topology into a sane node mapping. However we simply don't
know where non-present cpus are located topology-wise.  Even for present
cpus the answer is not always there since present cpus can be in either the
state "configured" (topology location known - cpu online possible) or
"deconfigured" (topology location unknown - cpu online not possible).

I can imagine several ways to fix this for s390, but before doing that I'm
wondering if the workqueue code is correct with

a) assuming that the cpu_to_node() mapping is valid for all _possible_ cpus
   that early

and

b) that the cpu_to_node() mapping does never change

Tejun?