From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40746) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eOiYX-0004Ig-LO for qemu-devel@nongnu.org; Tue, 12 Dec 2017 06:17:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eOiYS-00038D-D0 for qemu-devel@nongnu.org; Tue, 12 Dec 2017 06:17:01 -0500 Date: Tue, 12 Dec 2017 12:16:38 +0100 From: Kevin Wolf Message-ID: <20171212111638.GD3879@localhost.localdomain> References: <8c61f43c-5f56-8c1b-a2fe-f954d34dc687@redhat.com> <40392ab9-ec2a-30ed-ddab-a557682a4192@virtuozzo.com> <4e4e28d7-aebc-4a86-e691-99afdcca27f5@redhat.com> <40da8b0d-8039-634c-f50e-1d6326d7fca5@virtuozzo.com> <37d0e96a-d572-3290-61e7-87e59de2f59b@redhat.com> <2a255dee-97c6-5d6d-3152-df0b7fc2d4f0@virtuozzo.com> <0c66a77a-c95a-2071-9422-a2ce0622dbbe@redhat.com> <20171211111529.GB7707@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH for-2.12 0/4] qmp dirty bitmap API List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: John Snow Cc: Vladimir Sementsov-Ogievskiy , qemu-devel@nongnu.org, qemu-block@nongnu.org, famz@redhat.com, armbru@redhat.com, mnestratov@virtuozzo.com, mreitz@redhat.com, nshirokovskiy@virtuozzo.com, stefanha@redhat.com, den@openvz.org, pbonzini@redhat.com Am 11.12.2017 um 19:40 hat John Snow geschrieben: >=20 >=20 > On 12/11/2017 06:15 AM, Kevin Wolf wrote: > > Am 09.12.2017 um 01:57 hat John Snow geschrieben: > >> Here's an idea of what this API might look like without revealing > >> explicit merge/split primitives. > >> > >> A new bitmap property that lets us set retention: > >> > >> :: block-dirty-bitmap-set-retention bitmap=3Dfoo slices=3D10 > >> > >> Or something similar, where the default property for all bitmaps is > >> zero -- the current behavior: no copies retained. > >> > >> By setting it to a non-zero positive integer, the incremental backup > >> mode will automatically save a disabled copy when possible. > >=20 > > -EMAGIC > >=20 > > Operations that create or delete user-visible objects should be > > explicit, not automatic. You're trying to implement management layer > > functionality in qemu here, but incomplete enough that the artifacts = of > > it are still visible externally. (A complete solution within qemu > > wouldn't expose low-level concepts such as bitmaps on an external > > interface, but you would expose something like checkpoints.) > >=20 > > Usually it's not a good idea to have a design where qemu implements > > enough to restrict management tools to whatever use case we had in mi= nd, > > but not enough to make the management tool's life substantially easie= r > > (by not having to care about some low-level concepts). > >=20 > >> "What happens if we exceed our retention?" > >> > >> (A) We push the last one out automatically, or > >> (B) We fail the operation immediately. > >> > >> A is more convenient, but potentially unsafe if the management tool = or > >> user wasn't aware that was going to happen. > >> B is more annoying, but definitely more safe as it means we cannot l= ose > >> a bitmap accidentally. > >=20 > > Both mean that the management layer has not only to deal with the > > deletion of bitmaps as it wants to have them, but also to keep the > > retention counter somewhere and predict what qemu is going to do to t= he > > bitmaps and whether any corrective action needs to be taken. > >=20 > > This is making things more complex rather than simpler. > >=20 > >> I would argue for B with perhaps a force-cycle=3Dtrue|false that def= aults > >> to false to let management tools say "Yes, go ahead, remove the old = one" > >> with additionally some return to let us know it happened: > >> > >> {"return": { > >> "dropped-slices": [ {"bitmap0": 0}, ...] > >> }} > >> > >> This would introduce some concept of bitmap slices into the mix as I= D'd > >> children of a bitmap. I would propose that these slices are numbered= and > >> monotonically increasing. "bitmap0" as an object starts with no slic= es, > >> but every incremental backup creates slice 0, slice 1, slice 2, and = so > >> on. Even after we start deleting some, they stay ordered. These numb= ers > >> then stand in for points in time. > >> > >> The counter can (must?) be reset and all slices forgotten when > >> performing a full backup while providing a bitmap argument. > >> > >> "How can a user make use of the slices once they're made?" > >> > >> Let's consider something like mode=3Dpartial in contrast to > >> mode=3Dincremental, and an example where we have 6 prior slices: > >> 0,1,2,3,4,5, (and, unnamed, the 'active' slice.) > >> > >> mode=3Dpartial bitmap=3Dfoo slice=3D4 > >> > >> This would create a backup from slice 4 to the current time =CE=B1. = This > >> includes all clusters from 4, 5, and the active bitmap. > >> > >> I don't think it is meaningful to define any end point that isn't th= e > >> current time, so I've omitted that as a possibility. > >=20 > > John, what are you doing here? This adds option after option, and eve= n > > additional slice object, only complicating an easy thing more and mor= e. > > I'm not sure if that was your intention, but I feel I'm starting to > > understand better how Linus's rants come about. > >=20 > > Let me summarise what this means for management layer: > >=20 > > * The management layer has to manage bitmaps. They have direct contro= l > > over creation and deletion of bitmaps. So far so good. > >=20 > > * It also has to manage slices in those bitmaps objects; and these > > slices are what contains the actual bitmaps. In order to identify a > > bitmap in qemu, you need: > >=20 > > a) the node name > > b) the bitmap ID, and > > c) the slice number > >=20 > > The slice number is assigned by qemu and libvirt has to wait until > > qemu tells it about the slice number of a newly created slice. If > > libvirt doesn't receive the reply to the command that started the > > block job, it needs to be able to query this information from qemu, > > e.g. in query-block-jobs. > >=20 > > * Slices are automatically created when you start a backup job with a > > bitmap. It doesn't matter whether you even intend to do an incremen= tal > > backup against this point in time. qemu knows better. > >=20 > > * In order to delete a slice that you don't need any more, you have t= o > > create more slices (by doing more backups), but you don't get to > > decide which one is dropped. qemu helpfully just drops the oldest o= ne. > > It doesn't matter if you want to keep an older one so you can do an > > incremental backup for a longer timespan. Don't worry about your > > backup strategy, qemu knows better. > >=20 > > * Of course, just creating a new backup job doesn't mean that removin= g > > the old slice works, even if you give the respective option. That's > > what the 'dropped-slices' return is for. So once again wait for > > whatever qemu did and reproduce it in the data structures of the > > management tool. It's also more information that needs to be expose= d > > in query-block-jobs because libvirt might miss the return value. > >=20 > > * Hmm... What happens if you start n backup block jobs, with n > slic= es? > > Sounds like a great way to introduce subtle bugs in both qemu and t= he > > management layer. > >=20 > > Do you really think working with this API would be fun for libvirt? > >=20 > >> "Does a partial backup create a new point in time?" > >> > >> If yes: This means that the next incremental backup must necessarily= be > >> based off of the last partial backup that was made. This seems a lit= tle > >> inconvenient. This would mean that point in time =CE=B1 becomes "sli= ce 6." > >=20 > > Or based off any of the previous points in time, provided that qemu > > didn't helpfully decide to delete it. Can't I still create a backup > > starting from slice 4 then? > >=20 > > Also, a more general question about incremental backup: How does it p= lay > > with snapshots? Shouldn't we expect that people sometimes use both > > snapshots and backups? Can we restrict the backup job to considering > > bitmaps only from a single node or should we be able to reference > > bitmaps of a backing file as well? > >=20 > >> If no: This means that we lose the point in time when we made the > >> partial and we cannot chain off of the partial backup. It does mean = that > >> the next incremental backup will work as normally expected, however. > >> This means that point in time =CE=B1 cannot again be referenced by t= he > >> management client. > >> > >> This mirrors the dynamic between "incremental" and "differential" ba= ckups. > >> > >> ..hmmm.. > >> > >> You know, incremental backups are just a special case of "partial" h= ere > >> where slice is the last recorded slice... Let's look at an API like = this: > >> > >> mode=3D bitmap=3D [slice=3DN] > >> > >> Incremental: We create a new slice if the bitmap has room for one. > >> Differential: We don't create a new slice. The data in the active bi= tmap > >> =CE=B1 does not get cleared after the bitmap operation. > >> > >> Slice: > >> If not specified, assume we want only the active slice. This is the > >> current behavior in QEMU 2.11. > >> If specified, we create a temporary merge between bitmaps [N..=CE=B1= ] and use > >> that for the backup operation. > >> > >> "Can we delete slices?" > >> > >> Sure. > >> > >> :: block-dirty-bitmap-slice-delete bitmap=3Dfoo slice=3D4 > >> > >> "Can we create a slice without making a bitmap?" > >> > >> It would be easy to do, but I'm not sure I see the utility. In using= it, > >> it means if you don't specify the slice manually for the next backup > >> that you will necessarily be getting something not usable. > >> > >> but we COULD do it, it would just be banking the changes in the acti= ve > >> bitmap into a new slice. > >=20 > > Okay, with explicit management this is getting a little more reasonab= le > > now. However, I don't understand what slices buy us then compared to > > just separate bitmaps. > >=20 > > Essentially, bitmaps form a second kind of backing chain. Backup alwa= ys > > wants to use the combined bitmaps of some subchain. I see two easy wa= ys > > to do this: Either pass an array of bitmaps to consider to the job, o= r > > store the "backing link" in the bitmap so that we can just specify a > > "base bitmap" like we usually do with normal backing files. > >=20 > > The backup block job can optionally append a new bitmap to the chain > > like external snapshots do for backing chains. Deleting a bitmap in t= he > > chain is the merge operation, similar to a commit block job for backi= ng > > chains. > >=20 > > We know these mechanism very well because the block layer has been us= ing > > them for ages. > >=20 > >>> I also have another idea: > >>> implement new object: point-in-time or checkpoint. The should have > >>> names, and the simple add/remove API. > >>> And they will be backed by dirty bitmaps. so checkpoint deletion is > >>> bitmap merge (and delete one of them), > >>> checkpoint creation is disabling of active-checkpoint-bitmap and > >>> starting new active-checkpoint-bitmap. > >> > >> Yes, exactly! I think that's pretty similar to what I am thinking of > >> with slices. > >> > >> This sounds a little safer to me in that we can examine an operation= to > >> see if it's sane or not. > >=20 > > Exposing checkpoints is a reasonable high-level API. The important pa= rt > > then is that you don't expose bitmaps + slices, but only checkpoints > > without bitmaps. The bitmaps are an implementation detail. > >=20 > >>> Then we can implement merging of several bitmaps (from one of > >>> checkpoints to current moment) in > >>> NBD meta-context-query handling. > >>> > >> Note: > >> > >> I should say that I've had discussions with Stefan in the past over > >> things like differential mode and the feeling I got from him was tha= t he > >> felt that data should be copied from QEMU precisely *once*, viewing = any > >> subsequent copying of the same data as redundant and wasteful. > >=20 > > That's a management layer decision. Apparently there are users who wa= nt > > to copy from qemu multiple times, otherwise we wouldn't be talking ab= out > > slices and retention. > >=20 > > Kevin >=20 > Sorry. *lol* Though I do hope that my rant was at least somewhat constructive, if only by making differently broken suggestions. ;-) Kevin