From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57910)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <ehabkost@redhat.com>) id 1Wwu2c-0001bc-4U
	for qemu-devel@nongnu.org; Tue, 17 Jun 2014 10:07:21 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <ehabkost@redhat.com>) id 1Wwu2U-0002ZI-H1
	for qemu-devel@nongnu.org; Tue, 17 Jun 2014 10:07:13 -0400
Date: Tue, 17 Jun 2014 11:07:00 -0300
From: Eduardo Habkost <ehabkost@redhat.com>
Message-ID: <20140617140700.GG3222@otherpad.lan.raisama.net>
References: <1402905233-26510-1-git-send-email-aik@ozlabs.ru>
	<539EA7DD.8040306@ozlabs.ru>
	<20140616205150.GD8629@otherpad.lan.raisama.net>
	<539FD767.2020905@ozlabs.ru>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <539FD767.2020905@ozlabs.ru>
Subject: Re: [Qemu-devel] [PATCH 0/7] spapr: rework memory nodes
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>, qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Alexander Graf <agraf@suse.de>

On Tue, Jun 17, 2014 at 03:51:35PM +1000, Alexey Kardashevskiy wrote:
> On 06/17/2014 06:51 AM, Eduardo Habkost wrote:
> > On Mon, Jun 16, 2014 at 06:16:29PM +1000, Alexey Kardashevskiy wrote:
> >> On 06/16/2014 05:53 PM, Alexey Kardashevskiy wrote:
> >>> c4177479 "spapr: make sure RMA is in first mode of first memory node"
> >>> introduced regression which prevents from running guests with memoryless
> >>> NUMA node#0 which may happen on real POWER8 boxes and which would make
> >>> sense to debug in QEMU.
> >>>
> >>> This patchset aim is to fix that and also fix various code problems in
> >>> memory nodes generation.
> >>>
> >>> These 2 patches could be merged (the resulting patch looks rather ugly):
> >>> spapr: Use DT memory node rendering helper for other nodes
> >>> spapr: Move DT memory node rendering to a helper
> >>>
> >>> Please comment. Thanks!
> >>>
> >>
> >> Sure I forgot to add an example of what I am trying to run without errors
> >> and warnings:
> >>
> >> /home/aik/qemu-system-ppc64 \
> >> -enable-kvm \
> >> -machine pseries \
> >> -nographic \
> >> -vga none \
> >> -drive id=id0,if=none,file=virtimg/fc20_24GB.qcow2,format=qcow2 \
> >> -device scsi-disk,id=id1,drive=id0 \
> >> -m 2080 \
> >> -smp 8 \
> >> -numa node,nodeid=0,cpus=0-7,memory=0 \
> >> -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >> -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > (Note: I will ignore the "cpus" argument for the discussion below.)
> 
> The example is quite bad, I should not have used same CPUs in 2 nodes.
> SPAPR allows this but QEMU does not really support this and I am not
> touching this now.
> 
> 
> > 
> > I understand now that the non-contiguous node IDs are guest-visible.
> > 
> > But I still would like to understand the motivations for your use case,
> > to understand which solution makes more sense.
> 
> One of examples is a 2 CPUs on one die, one of CPUs is connected to memory
> bus, the other is not, instead it is connected to the first CPU (via super
> fast bus) and the first CPU acts as a bridge.
> 
> 
> 
> > If you really want 5 nodes, you just need to write this:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=3 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > If you just want 3 nodes, you can just write this:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > 
> > But you seem to claim you need 3 nodes with non-contiguous IDs. In that
> > case, which exactly is the guest-visible difference you expect to get
> > between:
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=1 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=3 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > and
> >   -numa node,nodeid=0,cpus=0-7,memory=0 \
> >   -numa node,nodeid=2,cpus=0-3,mem=1040 \
> >   -numa node,nodeid=4,cpus=4-7,mem=1040
> > ?
> > 
> > Because your patch is making both be exactly the same, and I guess you
> > don't want that (otherwise you could simply use the 5-node command-line
> > above and we wouldn't need patch 7/7).
> 
> If it is canonical and kosher way of using NUMA in QEMU, ok, we can use it.
> I just fail to see why we need a requirement for nodes to go consequently
> here. And it confuses me as a user a bit if I can add "-numa
> node,nodeid=22" (no memory, no cpus) but do not get to see it in the guest.

I agree with you it is confusing. But before we support that use case,
we need to make sure auto-allocation is handled properly, because it
would be hard to fix it later without breaking compatibility.

We probably just need a "present" field on struct NodeInfo, so
machine-specific code and auto-allocation code can differentiate nodes
that are not present on the command-line from empty nodes that were
specified in the command-line.

In the meantime, people can use the 5-node example above as a
workaround.

> 
> btw how is it supposed to work with memory hotplug? Current "-numa" does
> not support gaps in memory and I would expect that we will need it. Any
> plans here?

The DIMM device used for memory hotplug has a "node" property, for the
NUMA node ID.

-- 
Eduardo