Re: [PATCH v2 2/2] migration: savevm_state_handler_insert: constant-time element insertion

From: Michael Roth <mdroth@linux.vnet.ibm.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Laurent Vivier <lvivier@redhat.com>
Cc: david@gibson.dropbear.id.au,
	Scott Cheloha <cheloha@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>
Subject: Re: [PATCH v2 2/2] migration: savevm_state_handler_insert: constant-time element insertion
Date: Fri, 18 Oct 2019 11:38:37 -0500	[thread overview]
Message-ID: <157141671749.15348.15966144834012002565@sif> (raw)
In-Reply-To: <20191018094352.GC2990@work-vm>

Quoting Dr. David Alan Gilbert (2019-10-18 04:43:52)
> * Laurent Vivier (lvivier@redhat.com) wrote:
> > On 18/10/2019 10:16, Dr. David Alan Gilbert wrote:
> > > * Scott Cheloha (cheloha@linux.vnet.ibm.com) wrote:
> > >> savevm_state's SaveStateEntry TAILQ is a priority queue.  Priority
> > >> sorting is maintained by searching from head to tail for a suitable
> > >> insertion spot.  Insertion is thus an O(n) operation.
> > >>
> > >> If we instead keep track of the head of each priority's subqueue
> > >> within that larger queue we can reduce this operation to O(1) time.
> > >>
> > >> savevm_state_handler_remove() becomes slightly more complex to
> > >> accomodate these gains: we need to replace the head of a priority's
> > >> subqueue when removing it.
> > >>
> > >> With O(1) insertion, booting VMs with many SaveStateEntry objects is
> > >> more plausible.  For example, a ppc64 VM with maxmem=8T has 40000 such
> > >> objects to insert.
> > > 
> > > Separate from reviewing this patch, I'd like to understand why you've
> > > got 40000 objects.  This feels very very wrong and is likely to cause
> > > problems to random other bits of qemu as well.
> > 
> > I think the 40000 objects are the "dr-connectors" that are used to plug
> > peripherals (memory, pci card, cpus, ...).
> 
> Yes, Scott confirmed that in the reply to the previous version.
> IMHO nothing in qemu is designed to deal with that many devices/objects
> - I'm sure that something other than the migration code is going to get upset.

The device/object management aspect seems to handle things *mostly* okay, at
least ever since QOM child properties started being tracked by a hash table
instead of a linked list. It's worth noting that that change (b604a854) was
done to better handle IRQ pins for ARM guests with lots of CPUs. I think it is
inevitable that certain machine types/configurations will call for large
numbers of objects and I think it is fair to improve things to allow for this
sort of scalability.

But I agree it shouldn't be abused, and you're right that there are some
problem areas that arise. Trying to outline them:

 a) introspection commands like 'info qom-tree' become pretty unwieldly,
    and with large enough numbers of objects might even break things (QMP
    response size limits maybe?)
 b) various related lists like reset handlers, vmstate/savevm handlers might
    grow quite large

I think we could work around a) with maybe flagging certain
"internally-only" objects as 'hidden'. Introspection routines could then
filter these out, and routines like qom-set/qom-get could return report
something similar to EACCESS so they are never used/useful to management
tools.

In cases like b) we can optimize things where it makes sense like with
Scott's patch here. In most cases these lists need to be walked one way
or another, whether it's done internally by the object or through common
interfaces provided by QEMU. It's really just the O(n^2) type handling
where relying on common interfaces becomes drastically less efficient,
but I think we should avoid implementing things in that way anyway, or
improve them as needed.

> 
> Is perhaps the structure wrong somewhere - should there be a single DRC
> device that knows about all DRCs?

That's an interesting proposition, I think it's worth exploring further,
but from a high level:

 - each SpaprDrc has migration state, and some sub-classes SpaprDrc (e.g.
   SpaprDrcPhysical) have additional migration state. These are sent
   as-needed as separate VMState entries in the migration stream.
   Moving to a single DRC means we're either sending them as an flat
   array or a sparse list, which would put just as much load on the
   migration code (at least, with Scott's changes in place). It would
   also be difficult to do all this in a way which maintains migration
   compatibility with older machine types.
 - other aspects of modeling these as QOM objects, such as look-ups,
   reset-handling, and memory allocations, wouldn't be dramatically
   improved upon by handling it all internally within the object

AFAICT the biggest issue with modeling the DRCs as individual objects
is actually how we deal with introspection, and we should try to
improve. What do you think of the alternative suggestion above of
marking certain objects as 'hidden' from various introspection
interfaces?

> 
> Dave
> 
> 
> > https://github.com/qemu/qemu/blob/master/hw/ppc/spapr_drc.c
> > 
> > They are part of SPAPR specification.
> > 
> > https://raw.githubusercontent.com/qemu/qemu/master/docs/specs/ppc-spapr-hotplug.txt
> > 
> > CC Michael Roth
> > 
> > Thanks,
> > Laurent
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>