On 22/06/14 15:36, Shriram Rajagopalan wrote:
>
>
> On Jun 19, 2014 4:16 PM, "Andrew Cooper" <andrew.cooper3@citrix.com
> <mailto:andrew.cooper3@citrix.com>> wrote:
> >
> > On 19/06/14 11:23, Hongyang Yang wrote:
> > > On 06/19/2014 05:36 PM, Andrew Cooper wrote:
> > >> On 19/06/14 10:13, Hongyang Yang wrote:
> > >>> Hi Andrew, Ian,
> > >>>
> > >>> On 06/18/2014 02:04 AM, Andrew Cooper wrote:
> > >>>> On 17/06/14 17:40, Ian Campbell wrote:
> > >>>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote:
> > >>>>>> +The following features are not yet fully specified and will be
> > >>>>>> +included in a future draft.
> > >>>>>> +
> > >>>>>> +* Remus
> > >>>>> What is the plan for Remus here?
> > >>>>>
> > >>>>> It has pretty large implications for the flow of a migration
> > >>>>> stream and
> > >>>>> therefore on the code in the final two patches, I suspect it will
> > >>>>> require high level changes to those functions, so I'm reluctant to
> > >>>>> spend
> > >>>>> a lot of time on them as they are.
> > >>>>
> > >>>> I don't believe too much change will be required to the final two
> > >>>> patches, but it does depend on fixing the current qemu record layer
> > >>>> violations.
> > >>>>
> > >>>> It will be much easier to do after a prototype to the libxl level
> > >>>> fixes.
> > >>>
> > >>> I'm trying to porting Remus to migration v2...
> > >>
> > >> Ah fantastic! Here I was expecting to have eventually brave that code
> > >> myself.
> > >>
> > >> How is it going?  How are you finding hacking on v2 compared to the
> > >> legacy code? (I think you are the first person who isn't me trying to
> > >> extend it)  Is there anything I can do while still developing v2
> to make
> > >> things easier?
> > >
> > > It's just starting, but only on libxc side based on your patch series.
> > > v2 code is more cleaner than legacy code, easy to understand, and yes,
> > > make hacking easier. Maybe I will need your help when the hacking goes
> > > on...
> > >
> > >>
> > >>
> > >> I really need to get a prototype libxl framing document sorted,
> but in
> > >> principle my plan (given only a minimum understanding of the
> algorithm)
> > >> is this:
> > >>
> > >> ...
> > >> * Write page data update
> > >> * Write vcpu context etc
> > >> * Write a REMUS_CHECKPOINT record (or appropriate name)
> > >> * Call the checkpoint callback, passing ownership of the fd to libxl
> > >> ** libxl writes a libxl qemu record into the stream
> > >> * checkpoint callback returns to libxl, returning ownership of the fd
> > >> * libxc chooses between sending an END record or looping
> > >> ...
> > >>
> > >> The fd ownership is expected to work exactly the same on the
> receiving
> > >> side, using the REMUS_CHECKPOINT record as an indicator.
> > >
> > > It mostly looks plausible, but the save side and restore side needs to
> > > be synchronised, otherwise, the following problem may exists:
> > >   sending side is in libxl and send qemu records, receiving side still
> > >   in libxc, after it is switched to libxl, part of record may lose.
> > > maybe a handshake will solve the problem, weather it's in libxl or
> libxc,
> > > but current migration frame dose not support send msgs from receiving
> > > side
> > > to sending side, so it need modifications. We should support this
> > > feature.
> >
> > Ah yes I see.
> >
> > How about this?
> >
> > Libxc REMUS_CHECKPOINT is defined as a 0-length record (like the current
> > END record).
> > Libxl REMUS_CHECKPOINT is defined containing at least "last checkpoint"
> > bit in the header.
> >
> > Libxc writes a libxc REMUS_CHECKPOINT record into the stream and always
> > hands the fd to libxl.
> > Libxl then writes a libxl REMUS_CHECKPOINT record, including the last
> > checkpoint bit if needed.
> >
>
> I am a bit lost on this part. A silly question: the last I recall (a
> long time ago), the v2 format didn't allow for the page compression to
> be done asynchronously. Has this limitation changed?
>

The v2 format specifies records in a stream; nothing more.  It has no
bearing on whether the page compression happens asynchronously wrt
unpausing the domain or not.

I presume you actually mean the current implementation...

> IOW, in the current migration process, the dirty page data is written
> out while the guest remains suspended. With remus, the compressed page
> data is written out after resuming the guest. This deferred write out
> logic needs to be incorporated into v2 code.
>

... which is the way it is because the first implementation was done
with regular basic migration as a top priority.  This can certainly be
reworked when remus support is reintroduced.

> > This means that it is libxl on the receiving side which determines
> > whether the last checkpoint has been reached, and libxc must always pass
> > the fd up.  This fixes the synchronisation issues, without requiring a
> > back channel, but still maintaining appropriate layering.
> >
>
> So there is a TODO item in the current libxl-remus patches. We need an
> explicit acknowledgement from the reveiver side that it has gotten the
> memory checkpoint. Whether it is from libxc or libxl on the receiver
> side does not matter, as long as the ack signifies reception of the
> memory checkpoint.
> The need for an explicit memory ack is because the disk and memory
> checkpoint channels are independent.
> We need both acks before releasing the buffered network output on the
> receiver side.
>   The disk channel (blktap2 or DRBD ) has always sent an explicit ack.
> But not the memory channel. Though its over TCP, on a given iteration,
> memory checkpoint data may still reside on the sender side socket
> buffer while the disk checkpoint has reached the other end -- which
> isn't good.
>
> Existing libxc code does a fdatasync or fsync on the fd at the end of
> each iteration. I don't think it works as intended on TCP sockets.
> Please correct me if I am wrong about this.
>

That is a very sensible need for an explicit ack, although it would seem
to make more sense at the libxl level rather than the libxc level.

~Andrew