From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754569AbcHSJwZ (ORCPT <rfc822;w@1wt.eu>);
        Fri, 19 Aug 2016 05:52:25 -0400
Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36655 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1754498AbcHSJwX (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 19 Aug 2016 05:52:23 -0400
X-IBM-Helo: d06dlp01.portsmouth.uk.ibm.com
X-IBM-MailFrom: holzheu@linux.vnet.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Fri, 19 Aug 2016 11:52:12 +0200
From: Michael Holzheu <holzheu@linux.vnet.ibm.com>
To: Tejun Heo <tj@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ming Lei <tom.leiming@gmail.com>, Thomas Gleixner <tglx@linutronix.de>,
        LKML <linux-kernel@vger.kernel.org>,
        Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Lai Jiangshan <laijs@cn.fujitsu.com>,
        Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [bisected] "sched: Allow per-cpu kernel threads to run on
 online && !active" causes warning
In-Reply-To: <20160818144208.GA3166@htj.duckdns.org>
References: <CACVXFVNrMjk46pB_E=5fQP2njN8cntSKJ_BMnR-Z4ZmxsMpqyg@mail.gmail.com>
        <20160815111908.GA3903@osiris>
        <20160815224801.GA3672@mtj.duckdns.org>
        <20160816075505.GB3896@osiris>
        <20160816152027.GD9516@htj.duckdns.org>
        <20160816152949.GL30192@twins.programming.kicks-ass.net>
        <20160816154205.GE9516@htj.duckdns.org>
        <20160816221953.GA3373@osiris>
        <20160817135855.GH9516@htj.duckdns.org>
        <20160818113051.10cdab65@TP-holzheu>
        <20160818144208.GA3166@htj.duckdns.org>
Organization: IBM
X-Mailer: Claws Mail 3.9.3 (GTK+ 2.24.23; x86_64-pc-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 16081909-0032-0000-0000-000001FE2809
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 16081909-0033-0000-0000-00001C7335B4
Message-Id: <20160819115212.1c5eba20@TP-holzheu>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-08-19_04:,,
 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0
 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam
 adjust=0 reason=mlx scancount=1 engine=8.0.1-1604210000
 definitions=main-1608190126
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo <tj@kernel.org>:

> Hello, Michael.
> 
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> > 
> > Example:
> > 
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> >    assignment:
> > 
> >    - N1: cpu 1 on chip 1
> >    - N2: cpu 2 on chip 2
> > 
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> > 
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
> 
> I see.  Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns.  Is it
> just to segment memory of a machine into multiple pieces?

Correct.

> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?

Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:

 - Zone locks are highly contended because ZONE_NORMAL is big:
   * zone->lock
   * zone->lru_lock
 - One kswapd is not enough for swapping

We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.

> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?

IMHO this is a promising idea. To say it in my words:

 - At boot time we already pin all remaining "not configured" logical
   CPUs to nodes. So all possible cpus are pinned to nodes and
   cpu_to_node() will work.

 - If a new physical cpu get's configured, we get the CPU topology
   information from the system and find the best node.

 - We get a logical cpu number from the node pool and assign the
   new physical cpu to that number.

If that works we would be as good as before. We will have a look into
the code if it is possible.

Michael