From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:38869) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RulsI-0004EQ-66 for qemu-devel@nongnu.org; Tue, 07 Feb 2012 09:18:28 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ruls8-0007ND-Eu for qemu-devel@nongnu.org; Tue, 07 Feb 2012 09:18:26 -0500 Received: from mail-bk0-f45.google.com ([209.85.214.45]:45857) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ruls8-0007Ms-06 for qemu-devel@nongnu.org; Tue, 07 Feb 2012 09:18:16 -0500 Received: by bkbzu17 with SMTP id zu17so6302809bkb.4 for ; Tue, 07 Feb 2012 06:18:14 -0800 (PST) Message-ID: <4F31329E.50204@zerto.com> Date: Tue, 07 Feb 2012 16:18:06 +0200 From: Ori Mamluk MIME-Version: 1.0 References: <73865e0ce364c40e0eb65ec6b22b819d@mail.gmail.com> <4F312854.3080404@redhat.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: Kevin Wolf , =?UTF-8?B?16rXldee16gg15HXnyDXkNeV16g=?= , =?UTF-8?B?16LXldeT15Mg16fXk9ed?= , dlaor@redhat.com, qemu-devel@nongnu.org, Luiz Capitulino On 07/02/2012 15:50, Stefan Hajnoczi wrote: First let me say that I'm not completely used to the inline replies - so I initially missed some of your mail before. > On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf wrote: >> Am 07.02.2012 11:29, schrieb Ori Mamluk: >>> Repagent is a new module that allows an external replication system to >>> replicate a volume of a Qemu VM. > I recently joked with Kevin that QEMU is on its way to reimplementing > the Linux block and device-mapper layers. Now we have drbd, thanks! > :P > > Except for image files, the way to do this on a Linux host would be > using drbd block devices. We still haven't figured out a nice way to > make image files full-fledged Linux block devices, so we're > reimplementing all the block code in QEMU userspace. > >>> This RFC patch adds the repagent client module to Qemu. >>> >>> >>> >>> Documentation of the module role and API is in the patch at >>> replication/qemu-repagent.txt >>> >>> >>> >>> The main motivation behind the module is to allow replication of VMs in >>> a virtualization environment like RhevM. >>> >>> To achieve this we need basic replication support in Qemu. >>> >>> >>> >>> This is the first submission of this module, which was written as a >>> Proof Of Concept, and used successfully for replicating and recovering a >>> Qemu VM. >> I'll mostly ignore the code for now and just comment on the design. >> >> One thing to consider for the next version of the RFC would be to split >> this in a series smaller patches. This one has become quite large, which >> makes it hard to review (and yes, please use git send-email). >> >>> Points and open issues: >>> >>> * The module interfaces the Qemu storage stack at block.c >>> generic layer. Is this the right place to intercept/inject IOs? >> There are two ways to intercept I/O requests. The first one is what you >> chose, just add some code to bdrv_co_do_writev, and I think it's >> reasonable to do this. >> >> The other one would be to add a special block driver for a replication: >> protocol that writes to two different places (the real block driver for >> the image, and the network connection). Generally this feels even a bit >> more elegant, but it brings new problems with it: For example, when you >> create an external snapshot, you need to pay attention not to lose the >> replication because the protocol is somewhere in the middle of a backing >> file chain. >> >>> * The patch contains performing IO reads invoked by a new >>> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It >>> is not protected by any lock – is this OK? >> No, definitely not. Block layer code expects that it holds >> qemu_global_mutex. >> >> I'm not sure if a thread is the right solution. You should probably use >> something that resembles other asynchronous code in qemu, i.e. either >> callback or coroutine based. > There is a flow control problem here which is interesting. If the > rephub is slower than the writer or unavailable, then eventually we > either need to stop replicating writes or we need to throttle the > guest writes. I haven't read through the whole patch yet but the flow > control solution is very closely tied to how you use > threads/coroutines and how you use network sockets. In general the replication is naturally less important than the main (production) volume. This means that the solution aims to never throttle the guest writes. In the current stage, both IOs will need to complete before reporting back to the guest, but the volume is a real write to storage while the Rephub may involve only copying to memory. Later on we can get rid of waiting to the replicated IO altogether by adding a bitmap - but this is only for a later stage. > >>> + * Read a protected volume - allows the Rephub to read a >>> protected volume, to enable the protected hub to syncronize the content >>> of a protected volume. >> We were discussing using NBD as the protocol for any data that is >> transferred from/to the replication hub, so that we can use the existing >> NBD client and server code that qemu has. Seems you came to the >> conclusion to use different protocol? What are the reasons? >> >> The other message types could possibly be implemented as QMP commands. I >> guess we might need to attach multiple QMP monitors for this to work >> (one for libvirt, one for the rephub). I'm not sure if there is a >> fundamental problem with this or if it just needs to be done. > Agreed. You can already query block devices using QMP 'query-block'. > By adding in-process NBD server support you could then launch an NBD > server for each volume which you wish to replicate. However, in this > case it sounds almost like you want the reverse - you could provide an > NBD server on the rephub and QEMU would mirror writes to it (the NBD > client code is already in QEMU). > > There is also interest from other external software (like libvirt) to > be able to read volumes while the VM is running. > > BTW, do you poll the volumes or how do you handle hotplug? Does > anything special need to be done when a volume is unplugged? We assume that we handle he hotplug top-down - via the management system, and not from the VM. In general, we don't protect 'all volumes' of a VM - the management system (either RhevM or Rephub - depending on the design) specifically instructs to start protecting a volume. > Stefan