On 22/06/14 15:36, Shriram Rajagopalan wrote: > > > On Jun 19, 2014 4:16 PM, "Andrew Cooper" > wrote: > > > > On 19/06/14 11:23, Hongyang Yang wrote: > > > On 06/19/2014 05:36 PM, Andrew Cooper wrote: > > >> On 19/06/14 10:13, Hongyang Yang wrote: > > >>> Hi Andrew, Ian, > > >>> > > >>> On 06/18/2014 02:04 AM, Andrew Cooper wrote: > > >>>> On 17/06/14 17:40, Ian Campbell wrote: > > >>>>> On Wed, 2014-06-11 at 19:14 +0100, Andrew Cooper wrote: > > >>>>>> +The following features are not yet fully specified and will be > > >>>>>> +included in a future draft. > > >>>>>> + > > >>>>>> +* Remus > > >>>>> What is the plan for Remus here? > > >>>>> > > >>>>> It has pretty large implications for the flow of a migration > > >>>>> stream and > > >>>>> therefore on the code in the final two patches, I suspect it will > > >>>>> require high level changes to those functions, so I'm reluctant to > > >>>>> spend > > >>>>> a lot of time on them as they are. > > >>>> > > >>>> I don't believe too much change will be required to the final two > > >>>> patches, but it does depend on fixing the current qemu record layer > > >>>> violations. > > >>>> > > >>>> It will be much easier to do after a prototype to the libxl level > > >>>> fixes. > > >>> > > >>> I'm trying to porting Remus to migration v2... > > >> > > >> Ah fantastic! Here I was expecting to have eventually brave that code > > >> myself. > > >> > > >> How is it going? How are you finding hacking on v2 compared to the > > >> legacy code? (I think you are the first person who isn't me trying to > > >> extend it) Is there anything I can do while still developing v2 > to make > > >> things easier? > > > > > > It's just starting, but only on libxc side based on your patch series. > > > v2 code is more cleaner than legacy code, easy to understand, and yes, > > > make hacking easier. Maybe I will need your help when the hacking goes > > > on... > > > > > >> > > >> > > >> I really need to get a prototype libxl framing document sorted, > but in > > >> principle my plan (given only a minimum understanding of the > algorithm) > > >> is this: > > >> > > >> ... > > >> * Write page data update > > >> * Write vcpu context etc > > >> * Write a REMUS_CHECKPOINT record (or appropriate name) > > >> * Call the checkpoint callback, passing ownership of the fd to libxl > > >> ** libxl writes a libxl qemu record into the stream > > >> * checkpoint callback returns to libxl, returning ownership of the fd > > >> * libxc chooses between sending an END record or looping > > >> ... > > >> > > >> The fd ownership is expected to work exactly the same on the > receiving > > >> side, using the REMUS_CHECKPOINT record as an indicator. > > > > > > It mostly looks plausible, but the save side and restore side needs to > > > be synchronised, otherwise, the following problem may exists: > > > sending side is in libxl and send qemu records, receiving side still > > > in libxc, after it is switched to libxl, part of record may lose. > > > maybe a handshake will solve the problem, weather it's in libxl or > libxc, > > > but current migration frame dose not support send msgs from receiving > > > side > > > to sending side, so it need modifications. We should support this > > > feature. > > > > Ah yes I see. > > > > How about this? > > > > Libxc REMUS_CHECKPOINT is defined as a 0-length record (like the current > > END record). > > Libxl REMUS_CHECKPOINT is defined containing at least "last checkpoint" > > bit in the header. > > > > Libxc writes a libxc REMUS_CHECKPOINT record into the stream and always > > hands the fd to libxl. > > Libxl then writes a libxl REMUS_CHECKPOINT record, including the last > > checkpoint bit if needed. > > > > I am a bit lost on this part. A silly question: the last I recall (a > long time ago), the v2 format didn't allow for the page compression to > be done asynchronously. Has this limitation changed? > The v2 format specifies records in a stream; nothing more. It has no bearing on whether the page compression happens asynchronously wrt unpausing the domain or not. I presume you actually mean the current implementation... > IOW, in the current migration process, the dirty page data is written > out while the guest remains suspended. With remus, the compressed page > data is written out after resuming the guest. This deferred write out > logic needs to be incorporated into v2 code. > ... which is the way it is because the first implementation was done with regular basic migration as a top priority. This can certainly be reworked when remus support is reintroduced. > > This means that it is libxl on the receiving side which determines > > whether the last checkpoint has been reached, and libxc must always pass > > the fd up. This fixes the synchronisation issues, without requiring a > > back channel, but still maintaining appropriate layering. > > > > So there is a TODO item in the current libxl-remus patches. We need an > explicit acknowledgement from the reveiver side that it has gotten the > memory checkpoint. Whether it is from libxc or libxl on the receiver > side does not matter, as long as the ack signifies reception of the > memory checkpoint. > The need for an explicit memory ack is because the disk and memory > checkpoint channels are independent. > We need both acks before releasing the buffered network output on the > receiver side. > The disk channel (blktap2 or DRBD ) has always sent an explicit ack. > But not the memory channel. Though its over TCP, on a given iteration, > memory checkpoint data may still reside on the sender side socket > buffer while the disk checkpoint has reached the other end -- which > isn't good. > > Existing libxc code does a fdatasync or fsync on the fd at the end of > each iteration. I don't think it works as intended on TCP sockets. > Please correct me if I am wrong about this. > That is a very sensible need for an explicit ack, although it would seem to make more sense at the libxl level rather than the libxc level. ~Andrew