From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47944) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YTRyi-0001az-M0 for qemu-devel@nongnu.org; Thu, 05 Mar 2015 04:22:02 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YTRyd-0000Wo-Ve for qemu-devel@nongnu.org; Thu, 05 Mar 2015 04:22:00 -0500 Received: from mx1.redhat.com ([209.132.183.28]:60509) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YTRyd-0000Wk-KS for qemu-devel@nongnu.org; Thu, 05 Mar 2015 04:21:55 -0500 Date: Thu, 5 Mar 2015 09:21:39 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20150305092138.GA2381@work-vm> References: <1424883128-9841-1-git-send-email-dgilbert@redhat.com> <1424883128-9841-2-git-send-email-dgilbert@redhat.com> <20150305032119.GK18072@voom.fritz.box> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150305032119.GK18072@voom.fritz.box> Subject: Re: [Qemu-devel] [PATCH v5 01/45] Start documenting how postcopy works. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: aarcange@redhat.com, yamahata@private.email.ne.jp, quintela@redhat.com, qemu-devel@nongnu.org, amit.shah@redhat.com, pbonzini@redhat.com, yanghy@cn.fujitsu.com * David Gibson (david@gibson.dropbear.id.au) wrote: > On Wed, Feb 25, 2015 at 04:51:24PM +0000, Dr. David Alan Gilbert (git) wrote: > > From: "Dr. David Alan Gilbert" > > > > Signed-off-by: Dr. David Alan Gilbert > > --- > > docs/migration.txt | 189 +++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 189 insertions(+) > > > > diff --git a/docs/migration.txt b/docs/migration.txt > > index 0492a45..c6c3798 100644 > > --- a/docs/migration.txt > > +++ b/docs/migration.txt > > @@ -294,3 +294,192 @@ save/send this state when we are in the middle of a pio operation > > (that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is > > not enabled, the values on that fields are garbage and don't need to > > be sent. > > + > > += Return path = > > + > > +In most migration scenarios there is only a single data path that runs > > +from the source VM to the destination, typically along a single fd (although > > +possibly with another fd or similar for some fast way of throwing pages across). > > + > > +However, some uses need two way communication; in particular the Postcopy destination > > +needs to be able to request pages on demand from the source. > > + > > +For these scenarios there is a 'return path' from the destination to the source; > > +qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return > > +path. > > + > > + Source side > > + Forward path - written by migration thread > > + Return path - opened by main thread, read by return-path thread > > + > > + Destination side > > + Forward path - read by main thread > > + Return path - opened by main thread, written by main thread AND postcopy > > + thread (protected by rp_mutex) > > + > > += Postcopy = > > +'Postcopy' migration is a way to deal with migrations that refuse to converge; > > +its plus side is that there is an upper bound on the amount of migration traffic > > +and time it takes, the down side is that during the postcopy phase, a failure of > > +*either* side or the network connection causes the guest to be lost. > > + > > +In postcopy the destination CPUs are started before all the memory has been > > +transferred, and accesses to pages that are yet to be transferred cause > > +a fault that's translated by QEMU into a request to the source QEMU. > > + > > +Postcopy can be combined with precopy (i.e. normal migration) so that if precopy > > +doesn't finish in a given time the switch is made to postcopy. > > + > > +=== Enabling postcopy === > > + > > +To enable postcopy (prior to the start of migration): > > + > > +migrate_set_capability x-postcopy-ram on > > + > > +The migration will still start in precopy mode, however issuing: > > + > > +migrate_start_postcopy > > + > > +will now cause the transition from precopy to postcopy. > > +It can be issued immediately after migration is started or any > > +time later on. Issuing it after the end of a migration is harmless. > > It's not quite clear to me what this means. Does > "migrate_start_postcopy" mean it will immediately transfer execution > and transfer any remaining pages postcopy, or does it just mean it > will start postcopying once the remaining data to transfer is small > enough? Yes; it will flip into postcopy soon after issuing that command irrespective of the amount of data remaining. > What's the reason for this rather awkward two stage activation of > postcopy? We need to keep track of the pages that are received during the precopy phase, and do some madvise and other setups on the destination RAM area before precopy starts; and so we need to know we might want to do postcopy - so we need to be told early. In the earliest posted version of my patches I had a time-limit setting and after the time limit expired QEMU would switch into the second phase of postcopy itself, but Paolo suggested the migrate_start_postcopy: https://lists.nongnu.org/archive/html/qemu-devel/2014-07/msg00943.html and it works out simpler anyway. > > +=== Postcopy device transfer === > > + > > +Loading of device data may cause the device emulation to access guest RAM > > +that may trigger faults that have to be resolved by the source, as such > > +the migration stream has to be able to respond with page data *during* the > > +device load, and hence the device data has to be read from the stream completely > > +before the device load begins to free the stream up. This is achieved by > > +'packaging' the device data into a blob that's read in one go. > > + > > +Source behaviour > > + > > +Until postcopy is entered the migration stream is identical to normal > > +precopy, except for the addition of a 'postcopy advise' command at > > +the beginning, to tell the destination that postcopy might happen. > > +When postcopy starts the source sends the page discard data and then > > +forms the 'package' containing: > > + > > + Command: 'postcopy ram listen' > > + The device state > > + A series of sections, identical to the precopy streams device state stream > > + containing everything except postcopiable devices (i.e. RAM) > > + Command: 'postcopy ram run' > > + > > +The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the > > +contents are formatted in the same way as the main migration stream. > > It seems to me the "ram listen", "ram run" and CMD_PACKAGED really > have to be used in conjuction this way, they don't really have any use > on their own. So why not make it all CMD_POSTCOPY_TRANSITION and have > the "listen" and "run" take effect implicitly at the beginning and end > of the device data. CMD_PACKAGED seems like something that was generally useful; it's fairly complicated on it's own and so it seemed best to keep it separate. (Reading your comment here I notice I've still got it as 'postcopy ram listen' when I removed the 'ram' based on previous review comments; I've fixed that locally). Dave > > +Destination behaviour > > + > > +Initially the destination looks the same as precopy, with a single thread > > +reading the migration stream; the 'postcopy advise' and 'discard' commands > > +are processed to change the way RAM is managed, but don't affect the stream > > +processing. > > + > > +------------------------------------------------------------------------------ > > + 1 2 3 4 5 6 7 > > +main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) > > +thread | | > > + | (page request) > > + | \___ > > + v \ > > +listen thread: --- page -- page -- page -- page -- page -- > > + > > + a b c > > +------------------------------------------------------------------------------ > > + > > +On receipt of CMD_PACKAGED (1) > > + All the data associated with the package - the ( ... ) section in the > > +diagram > > >- is read into memory (into a QEMUSizedBuffer), and the main thread > > +recurses into qemu_loadvm_state_main to process the contents of the package (2) > > +which contains commands (3,6) and devices (4...) > > + > > +On receipt of 'postcopy ram listen' - 3 -(i.e. the 1st command in the package) > > +a new thread (a) is started that takes over servicing the migration stream, > > +while the main thread carries on loading the package. It loads normal > > +background page data (b) but if during a device load a fault happens (5) the > > +returned page (c) is loaded by the listen thread allowing the main threads > > +device load to carry on. > > + > > +The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination > > +CPUs start running. > > +At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour > > +and is no longer used by migration, while the listen thread carries > > +on servicing page data until the end of migration. > > + > > +=== Postcopy states === > > + > > +Postcopy moves through a series of states (see postcopy_state) from > > +ADVISE->LISTEN->RUNNING->END > > + > > + Advise: Set at the start of migration if postcopy is enabled, even > > + if it hasn't had the start command; here the destination > > + checks that its OS has the support needed for postcopy, and performs > > + setup to ensure the RAM mappings are suitable for later postcopy. > > + (Triggered by reception of POSTCOPY_ADVISE command) > > + > > + Listen: The first command in the package, POSTCOPY_LISTEN, switches > > + the destination state to Listen, and starts a new thread > > + (the 'listen thread') which takes over the job of receiving > > + pages off the migration stream, while the main thread carries > > + on processing the blob. With this thread able to process page > > + reception, the destination now 'sensitises' the RAM to detect > > + any access to missing pages (on Linux using the 'userfault' > > + system). > > + > > + Running: POSTCOPY_RUN causes the destination to synchronise all > > + state and start the CPUs and IO devices running. The main > > + thread now finishes processing the migration package and > > + now carries on as it would for normal precopy migration > > + (although it can't do the cleanup it would do as it > > + finishes a normal migration). > > + > > + End: The listen thread can now quit, and perform the cleanup of migration > > + state, the migration is now complete. > > + > > +=== Source side page maps === > > + > > +The source side keeps two bitmaps during postcopy; 'the migration bitmap' > > +and 'sent map'. The 'migration bitmap' is basically the same as in > > +the precopy case, and holds a bit to indicate that page is 'dirty' - > > +i.e. needs sending. During the precopy phase this is updated as the CPU > > +dirties pages, however during postcopy the CPUs are stopped and nothing > > +should dirty anything any more. > > + > > +The 'sent map' is used for the transition to postcopy. It is a bitmap that > > +has a bit set whenever a page is sent to the destination, however during > > +the transition to postcopy mode it is masked against the migration bitmap > > +(sentmap &= migrationbitmap) to generate a bitmap recording pages that > > +have been previously been sent but are now dirty again. This masked > > +sentmap is sent to the destination which discards those now dirty pages > > +before starting the CPUs. > > + > > +Note that once in postcopy mode, the sent map is still updated; however, > > +its contents are not necessarily consistent with the pages already sent > > +due to the masking with the migration bitmap. > > + > > +=== Destination side page maps === > > + > > +(Needs to be changed so we can update both easily - at the moment updates are done > > + with a lock) > > +The destination keeps a state for each page which is 'missing', 'received' > > +or 'requested'; these three states are encoded in a 2 bit state array. > > +Incoming requests from the kernel cause the state to transition from 'missing' > > +to 'requested'. Received pages cause a transition from either 'missing' or > > +'requested' to 'received'; the kernel is notified on reception to wake up > > +any threads that were waiting for the page. > > +If the kernel requests a page that has already been 'received' the kernel is > > +notified without re-requesting. > > + > > +This leads to four valid page states: > > +page states: > > + missing - page not yet received or requested > > + received - Page received > > + requested - page requested but not yet received > > + > > +state transitions: > > + received -> missing (only during setup/discard) > > + missing -> received (normal incoming page) > > + requested -> received (incoming page previously requested) > > + missing -> requested (userfault request) > > -- > David Gibson | I'll have my music baroque, and my code > david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ > | _way_ _around_! > http://www.ozlabs.org/~dgibson -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK