All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eduardo Habkost <ehabkost@redhat.com>
To: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Alexey Kardashevskiy <aik@ozlabs.ru>,
	Hu Tao <hutao@cn.fujitsu.com>,
	qemu-devel@nongnu.org, Anton Blanchard <anton@samba.org>,
	David Rientjes <rientjes@google.com>,
	Igor Mammedov <imammedo@redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH v3] numa: enable sparse node numbering
Date: Thu, 26 Jun 2014 15:39:32 -0300	[thread overview]
Message-ID: <20140626183932.GI3222@otherpad.lan.raisama.net> (raw)
In-Reply-To: <20140626175807.GM4323@linux.vnet.ibm.com>

On Thu, Jun 26, 2014 at 10:58:07AM -0700, Nishanth Aravamudan wrote:
> On 26.06.2014 [17:09:25 +0800], Hu Tao wrote:
> > On Wed, Jun 25, 2014 at 09:23:17PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 25, 2014 at 01:52:56PM -0300, Eduardo Habkost wrote:
> > > > On Wed, Jun 25, 2014 at 09:13:59AM -0700, Nishanth Aravamudan wrote:
> > > > > On 25.06.2014 [13:21:34 +0200], Igor Mammedov wrote:
> > > > > > On Tue, 24 Jun 2014 10:40:38 -0700
> > > > > > Nishanth Aravamudan <nacc@linux.vnet.ibm.com> wrote:
> > > > > <snip>
> > > > > > > diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> > > > > > > index 277230d..b90bf66 100644
> > > > > > > --- a/include/sysemu/sysemu.h
> > > > > > > +++ b/include/sysemu/sysemu.h
> > > > > > > @@ -145,11 +145,13 @@ extern int mem_prealloc;
> > > > > > >   */
> > > > > > >  #define MAX_CPUMASK_BITS 255
> > > > > > >  
> > > > > > > -extern int nb_numa_nodes;
> > > > > > > +extern int nb_numa_nodes; /* Number of NUMA nodes */
> > > > > > > +extern int max_numa_node; /* Highest specified NUMA node ID */
> > > > > > >  typedef struct node_info {
> > > > > > >      uint64_t node_mem;
> > > > > > >      DECLARE_BITMAP(node_cpu, MAX_CPUMASK_BITS);
> > > > > > >      struct HostMemoryBackend *node_memdev;
> > > > > > > +    bool present;
> > > > > > How about dropping 'present' and replacing array with a list
> > > > > > of only present nodes?
> > > > > 
> > > > > If that would be preferred, I can move to that. I assume a simple
> > > > > linked-list is fine. Does qemu provide any infrastructure for defining
> > > > > lists? I'll look through the source but any pointers would be helpful.
> > > > > 
> > > > > Generally speaking, sparse NUMA nodes aren't that common and when they
> > > > > exist, the gaps aren't large. But it does seem to make sense if we have
> > > > > sparse IDs at all, we might as well move to a list.
> > > > > 
> > > > > In any case, moving to the list means we'd have a nodeid as part of the
> > > > > structure instead.
> > > > > 
> > > > > > That way it will be one more step closer to converting numa
> > > > > > infrastructure to a set of QOM objects.
> > > > > 
> > > > > Sounds like a good idea to me. I'll respin the patch soon.
> > > > 
> > > > Having a list makes sense, the only difference is that keeping a sparse
> > > > array sorted is much easier than making a sorted list (because the ACPI
> > > > tables are nodeid-ordered). That's why I suggested keeping the array
> > > > initially.
> > > > 
> > > > Adding a "present" field to the array is a trivial and easy-to-review
> > > > change. Changing NodeInfo to use linked lists is a more complex change
> > > > that I wouldn't want to include after soft freeze.
> > > > 
> > > > In other words:
> > > >  * Having a list is better than a sparse array; but:
> > > >  * Having a small sparse array with the "present" field is better
> > > >    than broken sparse nodeid support (IMO).
> > > 
> > > I agree here. This patchset is still RFC but if that's
> > > the only issue I might apply it.
> > 
> > I don't think it is ready for 2.1 since it lacks arch-specific changes.
> > The patch itself only makes qemu be able to bring up guest with sparse
> > numa nodes, without arch-specific changes, guest sees only one node(at
> > least on x86).
> 
> Well, it does no harm or foul to have it, on it's own. In that, nothing
> is worse or better than it was. Although I would argue the qemu core is
> better than it was :) I have sent the powerpc enabling patch, but even
> with that, we need Alexey's series of 6 patches (which do fix a
> regression) to fully support the sparse numbering. Maybe that means I've
> convinced myself it's best to wait until after 2.1.
> 
> I honestly am not sure how to "fix" the x86 code to support the sparse
> numbering -- the APIC code seems to rely on it being continuous. From
> some googling, it seems like the most common reason for non-continuous
> numbering on x86 is faulty hardware. I will keep fiddling if no one else
> steps up, but I'm not an x86 NUMA expert :)

There are two ACPI table building methods: the fw_cfg-based (where
SeaBIOS builds the SRAT table), and the acpi-build.c code.

The fw_cfg-based method assumes contigous nodeids: it builds a table
with nb_numa_nodes entries, carrying only the memory size of each node.

The acpi-build.c code could easily use the "present" flag, and doesn't
seem to require contiguous nodeids. The ACPI specs don't seem to have
any requirement about contiguous proximity domain IDs, but I am not sure
about how guest OSes would react.

acpi-build.c has some confusing code to make sure the table has at least
nb_numa_nodes+2 entries (adding disabled entries at the end of the table
if necessary), but I don't understand why. Maybe it is just for
compatibility with SeaBIOS versions that calculated the SRAT table size
based on nb_numa_nodes beforehand.

But we need to this carefully, as changing the behavior again later
would require compatibility code, and would make things worse.  I don't
even know if guest OSes would properly handle non-contiguous node IDs.

In either case, QEMU is currently broken when nodeids are not contiguous
or nodeids are repeated in the command-line. So I suggest we do this on
QEMU 2.1:

 * Add the "present" field (required to allow us to detect
   command-lines with missing nodeids);
 * Reject the command-line if a nodeid is specified twice;
 * Make the PC initialization code reject the configuration if some
   nodeid is not present. This is a bug fix, not a regression, as
   non-contiguous nodeids were never supported.

This way, we are not introducing new behavior to keep compatibility in
the future, but we are fixing the bug where QEMU doesn't complain about
non-contiguous nodeids. And after 2.1, we can make PC support
non-contiguous nodeids if we agree it is useful.

-- 
Eduardo

  reply	other threads:[~2014-06-26 18:39 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-23 19:33 [Qemu-devel] [RFC PATCH] numa: enable sparse node numbering Nishanth Aravamudan
2014-06-24  0:48 ` [Qemu-devel] [RFC PATCH v2] " Nishanth Aravamudan
2014-06-24  1:08   ` [Qemu-devel] [RFC PATCH] ppc/spapr: support sparse NUMA " Nishanth Aravamudan
2014-06-24  3:34     ` Nishanth Aravamudan
2014-06-24  7:43   ` [Qemu-devel] [RFC PATCH v2] numa: enable sparse " Hu Tao
2014-06-24 17:40   ` [Qemu-devel] [RFC PATCH v3] " Nishanth Aravamudan
2014-06-25 11:21     ` Igor Mammedov
2014-06-25 16:13       ` Nishanth Aravamudan
2014-06-25 16:52         ` Eduardo Habkost
2014-06-25 17:04           ` Nishanth Aravamudan
2014-06-25 18:24             ` Michael S. Tsirkin
2014-06-26  6:29             ` Igor Mammedov
2014-06-25 18:23           ` Michael S. Tsirkin
2014-06-26  9:09             ` Hu Tao
2014-06-26 17:58               ` Nishanth Aravamudan
2014-06-26 18:39                 ` Eduardo Habkost [this message]
2014-06-26 19:37     ` Eduardo Habkost
2014-06-30 18:26       ` Nishanth Aravamudan
2014-06-30 21:48         ` Eduardo Habkost
2014-06-24  7:33 ` [Qemu-devel] [RFC PATCH] " Igor Mammedov
2014-06-24  8:58   ` Alexey Kardashevskiy
2014-06-24 16:39   ` Nishanth Aravamudan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140626183932.GI3222@otherpad.lan.raisama.net \
    --to=ehabkost@redhat.com \
    --cc=aik@ozlabs.ru \
    --cc=anton@samba.org \
    --cc=hutao@cn.fujitsu.com \
    --cc=imammedo@redhat.com \
    --cc=mst@redhat.com \
    --cc=nacc@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.