From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41450) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fs30Y-0001ID-Ft for qemu-devel@nongnu.org; Tue, 21 Aug 2018 05:31:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fs2zS-0004Ti-Fp for qemu-devel@nongnu.org; Tue, 21 Aug 2018 05:30:22 -0400 From: Vladimir Sementsov-Ogievskiy References: <20180814170126.56461-1-vsementsov@virtuozzo.com> <052a0e73-bef5-7ee8-5e24-3c96907247f7@virtuozzo.com> <78730056-1612-ba4f-af74-42c4c2f2ecf3@redhat.com> <7913fa3b-e50a-b508-848b-dd5b8419bdbb@redhat.com> Message-ID: <49d7c37e-e146-29c7-df75-7720e384ab61@virtuozzo.com> Date: Tue, 21 Aug 2018 12:29:50 +0300 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Content-Language: en-US Subject: Re: [Qemu-devel] [RFC v2] new, node-graph-based fleecing and backup List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Max Reitz , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: eblake@redhat.com, armbru@redhat.com, kwolf@redhat.com, famz@redhat.com, jsnow@redhat.com, pbonzini@redhat.com, stefanha@redhat.com, den@openvz.org 20.08.2018 21:30, Vladimir Sementsov-Ogievskiy wrote: > 20.08.2018 20:25, Max Reitz wrote: >> On 2018-08-20 16:49, Vladimir Sementsov-Ogievskiy wrote: >>> 20.08.2018 16:32, Max Reitz wrote: >>>> On 2018-08-20 11:42, Vladimir Sementsov-Ogievskiy wrote: >>>>> 18.08.2018 00:50, Max Reitz wrote: >>>>>> On 2018-08-14 19:01, Vladimir Sementsov-Ogievskiy wrote: >>>> [...] >>>> >>>>>>> Proposal: >>>>>>> >>>>>>> For fleecing we need two nodes: >>>>>>> >>>>>>> 1. fleecing hook. It's a filter which should be inserted on top=20 >>>>>>> of active >>>>>>> disk. It's main purpose is handling guest writes by=20 >>>>>>> copy-on-write operation, >>>>>>> i.e. it's a substitution for write-notifier in backup job. >>>>>>> >>>>>>> 2. fleecing cache. It's a target node for COW operations by=20 >>>>>>> fleecing-hook. >>>>>>> It also represents a point-in-time snapshot of active disk for=20 >>>>>>> the readers. >>>>>> It's not really COW, it's copy-before-write, isn't it? It's=20 >>>>>> something >>>>>> else entirely.=C2=A0 COW is about writing data to an overlay *instea= d* of >>>>>> writing it to the backing file.=C2=A0 Ideally, you don't copy anythi= ng, >>>>>> actually.=C2=A0 It's just a side effect that you need to copy things= =20 >>>>>> if your >>>>>> cluster size doesn't happen to match exactly what you're=20 >>>>>> overwriting. >>>>> Hmm. I'm not against. But COW term was already used in backup to >>>>> describe this. >>>> Bad enough. :-) >>> So, we agreed about new "CBW" abbreviation? :) >> It is already used for the USB mass-storage command block wrapper, but I >> suppose that is sufficiently different not to cause much confusion. :-) >> >> (Or at least that's the only other use I know of.) >> >> [...] >> >>>>> 2. We already have fleecing scheme, when we should create some=20 >>>>> subgraph >>>>> between nodes. >>>> Yes, but how do the permissions work right now, and why wouldn't they >>>> work with your schema? >>> now it uses backup job, with shared_perm =3D all for its source and=20 >>> target >>> nodes. >> Uh-huh. >> >> So the issue is...=C2=A0 Hm, what exactly?=C2=A0 The backup node probabl= y doesn't >> want to share WRITE for the source anymore, as there is no real point in >> doing so.=C2=A0 And for the target, the only problem may be to share >> CONSISTENT_READ.=C2=A0 It is OK to share that in the fleecing case, but = in >> other cases maybe it isn't.=C2=A0 But that's easy enough to distinguish = in >> the driver. >> >> The main issue I could see is that the overlay (the fleecing target) >> might not share write permissions on its backing file (the fleecing >> source)...=C2=A0 But your diagram shows (and bdrv_format_default_perms()= as >> well) that this is no the case, when the overlay is writable, the >> backing file may be written to, too. > > Hm, actually overlay may share write permission to clusters which are=20 > saved in overlay, or which are not needed (if we have dirty bitmap for=20 > incremental backup).. But we don't have such permission kind, and it=20 > looks not easy to implement it... And it may be too expensive in=20 > operation overhead. > >> >>> (ha, you can look at the picture in "[PATCH v2 0/3] block nodes >>> graph visualization") >> :-) >> >>>>> 3. If we move to filter-node instead of write_notifier, block job=20 >>>>> is not >>>>> actually needed for fleecing, and it is good to drop it from the >>>>> fleecing scheme, to simplify it, to make it more clear and=20 >>>>> transparent. >>>> If that's possible, why not.=C2=A0 But again, I'm not sure whether tha= t's >>>> enough of a reason for the endavour, because whether you start a block >>>> job or do some graph manipulation yourself is not really a=20 >>>> difference in >>>> complexity. >>> not "or" but "and": in current fleecing scheme we do both graph >>> manipulations and block-job stat/cancel.. >> Hm!=C2=A0 Interesting.=C2=A0 I didn't know blockdev-backup didn't set th= e target's >> backing file.=C2=A0 It makes sense, but I didn't think about it. >> >> Well, still, my point was whether you do a blockdev-backup + >> block-job-cancel, or a blockdev-add + blockdev-reopen + blockdev-reopen >> + blockdev-del...=C2=A0 If there is a difference, the former is going to= be >> simpler, probably. >> >> (But if there are things you can't do with the current blockdev-backup, >> then, well, that doesn't help you.) >> >>> Yes, I agree, that there no real benefit in difficulty. I just thing, >>> that if we have filter node which performs "CBW" operations, block-job >>> backup(sync=3Dnone) becomes actually empty, it will do nothing. >> On the code side, yes, that's true. >> >>>> But it's mostly your call, since I suppose you'd be doing most of=20 >>>> the work. >>>> >>>>> And finally, we will have unified filter-node-based scheme for backup >>>>> and fleecing, modular and customisable. >>>> [...] >>>> >>>>>>> Benefits, or, what can be done: >>>>>>> >>>>>>> 1. We can implement special Fleecing cache filter driver, which=20 >>>>>>> will be a real >>>>>>> cache: it will store some recently written clusters and RAM, it=20 >>>>>>> can have a >>>>>>> backing (or file?) qcow2 child, to flush some clusters to the=20 >>>>>>> disk, etc. So, >>>>>>> for each cluster of active disk we will have the following=20 >>>>>>> characteristics: >>>>>>> >>>>>>> - changed (changed in active disk since backup start) >>>>>>> - copy (we need this cluster for fleecing user. For example, in=20 >>>>>>> RFC patch all >>>>>>> clusters are "copy", cow_bitmap is initialized to all ones. We=20 >>>>>>> can use some >>>>>>> existent bitmap to initialize cow_bitmap, and it will provide an=20 >>>>>>> "incremental" >>>>>>> fleecing (for use in incremental backup push or pull) >>>>>>> - cached in RAM >>>>>>> - cached in disk >>>>>> Would it be possible to implement such a filter driver that could=20 >>>>>> just >>>>>> be used as a backup target? >>>>> for internal backup we need backup-job anyway, and we will be able to >>>>> create different schemes. >>>>> One of my goals is the scheme, when we store old data from CBW >>>>> operations into local cache, when >>>>> backup target is remote, relatively slow NBD node. In this case,=20 >>>>> cache >>>>> is backup source, not target. >>>> Sorry, my question was badly worded.=C2=A0 My main point was whether y= ou >>>> could implement the filter driver in such a generic way that it=20 >>>> wouldn't >>>> depend on the fleecing-hook. >>> yes, I want my filter nodes to be self-sufficient entities. However it >>> may be more effective to have some shared data, between them, for >>> example, dirty-bitmaps, specifying drive clusters, to know which >>> clusters are cached, which are changed, etc. >> I suppose having global dirty bitmaps may make sense. >> >>>> Judging from your answer and from the fact that you proposed=20 >>>> calling the >>>> filter node backup-filter and just using it for all backups, I suppose >>>> the answer is "yes".=C2=A0 So that's good. >>>> >>>> (Though I didn't quite understand why in your example the cache=20 >>>> would be >>>> the backup source, when the target is the slow node...) >>> cache is a point-in-time view of active disk (actual source) for >>> fleecing. So, we can start backup job to copy data from cache to=20 >>> target. >> But wouldn't the cache need to be the immediate fleecing target for >> this?=C2=A0 (And then you'd run another backup/mirror from it to copy th= e >> whole disk to the real target.) > > Yes, the cache is immediate fleecing target. > >> >>>>>>> On top of these characteristics we can implement the following=20 >>>>>>> features: >>>>>>> >>>>>>> 1. COR, we can cache clusters not only on writes but on reads=20 >>>>>>> too, if we have >>>>>>> free space in ram-cache (and if not, do not cache at all, don't=20 >>>>>>> write to >>>>>>> disk-cache). It may be done like bdrv_write(...,=20 >>>>>>> BDRV_REQ_UNNECESARY) >>>>>> You can do the same with backup by just putting a fast overlay=20 >>>>>> between >>>>>> source and the backup, if your source is so slow, and then do=20 >>>>>> COR, i.e.: >>>>>> >>>>>> slow source --> fast overlay --> COR node --> backup filter >>>>> How will we check ram-cache size to make COR optional in this scheme? >>>> Yes, well, if you have a caching driver already, I suppose you can=20 >>>> just >>>> use that. >>>> >>>> You could either write it a bit simpler to only cache on writes and=20 >>>> then >>>> put a COR node on top if desired; or you implement the read cache >>>> functionality directly in the node, which may make it a bit more >>>> complicated, but probably also faster. >>>> >>>> (I guess you indeed want to go for faster when already writing a RAM >>>> cache driver...) >>>> >>>> (I don't really understand what BDRV_REQ_UNNECESSARY is supposed to=20 >>>> do, >>>> though.) >>> When we do "CBW", we _must_ save data before guest write, so, we write >>> this data to the cache (or directly to target, like in current=20 >>> approach). >>> When we do "COR", we _may_ save data to our ram-cache. It's safe to not >>> save data, as we can read it from active disk (data is not changed=20 >>> yet). >>> BDRV_REQ_UNNECESSARY is a proposed interface to write this unnecessary >>> data to the cache: if ram-cache is full, cache will skip this write. >> Hm, OK...=C2=A0 But deciding for each request how much priority it shoul= d get >> in a potential cache node seems like an awful lot of work. Well, I >> don't even know what kind of requests you would deem unnecessary.=C2=A0 = If it >> has something to do with the state of a dirty bitmap, then having global >> dirty bitmaps might remove the need for such a request flag. > > Yes, if we have some "shared fleecing object", accessible by=20 > fleecing-hook filter, > fleecing-cache filter (and backup job, if it is an internal backup),=20 > we don't need > such flag. > >> >> [...] >> >>>> Hm.=C2=A0 So what you want here is a special block driver or at least = a >>>> special interface that can give information to an outside tool, namely >>>> the information you listed above. >>>> >>>> If you want information about RAM-cached clusters, well, you can only >>>> get that information from the RAM cache driver.=C2=A0 It probably woul= d be >>>> allocation information, do we have any way of getting that out? >>>> >>>> It seems you can get all of that (zero information and allocation >>>> information) over NBD.=C2=A0 Would that be enough? >>> it's a most generic and clean way, but I'm not sure that it will be >>> performance-effective. >> Intuitively I'd agree, but I suppose if NBD is written right, such a >> request should be very fast and the response basically just consists of >> the allocation information, so I don't suspect it can be much faster >> than that. >> >> (Unless you want some form of interrupts.=C2=A0 I suppose NBD would be t= he >> wrong interface, then.) > > Yes, for external backup through NBD it's ok to get block status, but=20 > for internal backup it seems faster to access shared fleecing object=20 > (or global bitmaps, etc). > > However, if we have some shared fleecing object, it's not a problem to=20 > export it as a blockstatus metadata through NBD export.. > >> >> [...] >> >>>>> I need several features, which are hard to implement using current=20 >>>>> scheme. >>>>> >>>>> 1. The scheme when we have a local cache as COW target and slow=20 >>>>> remote >>>>> backup target. >>>>> How to do it now? Using two backups, one with sync=3Dnone... Not=20 >>>>> sure that >>>>> this is right way. >>>> If it works... >>>> >>>> (I'd rather build simple building blocks that you can put together=20 >>>> than >>>> something complicated that works for a specific solution) >>> exactly, I want to implement simple building blocks =3D filter nodes, >>> instead of implementing all the features in backup job. >> Good, good. :-) >> >>>>> 3. Then, >>>>> we'll need a possibility for backup(sync=3Dnone) to >>>>> not COW clusters, which are already copied to backup, and so on. >>>> Isn't that the same as 2? >>> We can use one bitmap for 2 and 3, and drop bits from it, when >>> external-tool has read corresponding cluster from nbd-fleecing-export.. >> Oh, right, it needs to be modifiable from the outside.=C2=A0 I suppose t= hat >> would be possible in NBD, too.=C2=A0 (But I don't know exactly.) > > I think it's natural to implement it through discard operation on=20 > fleecing-cache node: if fleecing-user discard something, it will not=20 > read it more and we can drop it from the cache and clear bit in shared=20 > bitmap. > > Then we can improve it by creating flag READ_ONCE for each READ=20 > command or for the whole connection, to discard data after each read..=20 > Or pass this flag to bdrv_read, to handle it in one command.. > >> >> [...] >> >>>>>> I don't think that will be any simpler. >>>>>> >>>>>> I mean, it would make blockdev-copy simpler, because we could >>>>>> immediately replace backup by mirror, and then we just have mirror, >>>>>> which would then automatically become blockdev-copy... >>>>>> >>>>>> But it's not really going to be simpler, because whether you put the >>>>>> copy-before-write logic into a dedicated block driver, or into the >>>>>> backup filter driver, doesn't really make it simpler either way.=C2= =A0=20 >>>>>> Well, >>>>>> adding a new driver always is a bit more complicated, so there's=20 >>>>>> that. >>>>> what is the difference between separate filter driver and backup=20 >>>>> filter >>>>> driver? >>>> I thought we already had a backup filter node, so you wouldn't have=20 >>>> had >>>> to create a new driver in that case. >>>> >>>> But we don't, so there really is no difference.=C2=A0 Well, apart from= =20 >>>> being >>>> able to share state easier when the driver is in the same file as=20 >>>> the job. >>> But if we make it separate - it will be a separate "building block" to >>> be reused in different schemes. >> Absolutely true. >> >>>>>>> it should not care about guest writes, it copies clusters from a=20 >>>>>>> kind of >>>>>>> snapshot which is not changing in time. This job should follow=20 >>>>>>> recommendations >>>>>>> from fleecing scheme [7]. >>>>>>> >>>>>>> What about the target? >>>>>>> >>>>>>> We can use separate node as target, and copy from fleecing cache=20 >>>>>>> to the target. >>>>>>> If we have only ram-cache, it would be equal to current approach=20 >>>>>>> (data is copied >>>>>>> directly to the target, even on COW). If we have both ram- and=20 >>>>>>> disk- caches, it's >>>>>>> a cool solution for slow-target: instead of make guest wait for=20 >>>>>>> long write to >>>>>>> backup target (when ram-cache is full) we can write to=20 >>>>>>> disk-cache which is local >>>>>>> and fast. >>>>>> Or you backup to a fast overlay over a slow target, and run a live >>>>>> commit on the side. >>>>> I think it will lead to larger io overhead: all clusters will go=20 >>>>> through >>>>> overlay, not only guest-written clusters, for which we did not=20 >>>>> have time >>>>> to copy them.. >>>> Well, and it probably makes sense to have some form of RAM-cache=20 >>>> driver. >>>> =C2=A0 Then that'd be your fast overlay. >>> but there no reasons to copy all the data through the cache: we need it >>> only for CBW. >> Well, if there'd be a RAM-cache driver, you may use it for anything that >> seems useful (I seem to remember there were some patches on the list >> like three or four years ago...). >> >>> any way, I think it will be good if both schemes will be possible. >>> >>>>>>> Another option is to combine fleecing cache and target somehow=20 >>>>>>> (I didn't think >>>>>>> about this really). >>>>>>> >>>>>>> Finally, with one - two (three?) special filters we can=20 >>>>>>> implement all current >>>>>>> fleecing/backup schemes in unique and very configurable way=C2=A0 a= nd=20 >>>>>>> do a lot more >>>>>>> cool features and possibilities. >>>>>>> >>>>>>> What do you think? >>>>>> I think adding a specific fleecing target filter makes sense=20 >>>>>> because you >>>>>> gave many reasons for interesting new use cases that could emerge=20 >>>>>> from that. >>>>>> >>>>>> But I think adding a new fleecing-hook driver just means moving the >>>>>> implementation from backup to that new driver. >>>>> But in the same time you say that it's ok to create backup-filter >>>>> (instead of write_notifier) and make it insertable by qapi? So, if I >>>>> implement it in block/backup, it's ok? Why not do it separately? >>>> Because I thought we had it already.=C2=A0 But we don't.=C2=A0 So feel= free=20 >>>> to do >>>> it separately. :-) >>> Ok, that's good :) . Then, I'll try to reuse the filter in backup >>> instead of write-notifiers, and understand do we really need internal >>> state of backup block-job or not. >>> >>>> Max >>>> >>> PS: in background, I have unpublished work, aimed to parallelize >>> backup-job into several coroutines (like it is done for mirror,=20 >>> qemu-img >>> clone cmd). And it's really hard.It creates queues of requests with >>> different priority, to handle CBW requests in common pipeline, it's >>> mostly a rewrite of block/backup. If we split CBW from backup to >>> separate filter-node, backup becomes very simple thing (copy clusters >>> from constant storage) and its parallelization becomes simpler. >> If CBW is split from backup, maybe mirror could replace backup >> immediately.=C2=A0 You'd fleece to a RAM cache target and then mirror fr= om=20 >> there. > > Hmm, good option. It would be just one mirror iteration. > But then I'll need to=C2=A0 teach mirror to copy clusters with some=20 > priorities, to avoid ram-cache overloading (and guest io hang). > It may be better to have a separate simple (a lot simpler than mirror)=20 > block job for it. or use a backup. Anyway, it's a separate > building block, performance comparison will show better candidate. > >> >> (To be precise: The exact replacement would be an active mirror, so a >> mirror with copy-mode=3Dwrite-blocking, so it immediately writes the old >> block to the target when it is changed in the source, and thus the RAM >> cache could stay effectively empty.) > > Hmm, or this way. So, actually for such thing, we need a cache node=20 > which do absolutely nothing, write will be actually handled by mirror=20 > job. But in this case we cant control size of actual ram cache: if=20 > target is slow we will accumulate unfinished bdrv_mirror_top_pwritev=20 > calls, which has allocated memory and waiting in a queue to create=20 > mirror coroutine. Oh, sorry, no. active mirror copy data synchronously on write, so, it's=20 really should be the same copy pattern as in backup. > >> >>> I don't say throw the backup away, but I have several ideas, which may >>> alter current approach. They may live in parallel with current backup >>> path, or replace it in future, if they will be more effective. >> Thing is, contrary to the impression I've probably given, we do want to >> throw away backup sooner or later.=C2=A0 We want a single block job >> (blockdev-copy) that unifies mirror, backup, and commit. >> >> (mirror already basically supersedes commit, with live commit just being >> exactly mirror; the main problem is integrating backup.=C2=A0 But with a >> fleecing node and a RAM cache target, that would suddenly be really >> simple, I assume.) >> >> ((All that's missing is sync=3Dtop, where the mirror would need to not >> only check its source (which would be the RAM cache), but also its >> backing file; and sync=3Dincremental, which just isn't there with mirror >> at all.=C2=A0 OTOH, it may be possible to implement both modes simply in= the >> fleecing/backup node, so it only copies that respective data to the >> target and the mirror simply sees nothing else.)) > > Good idea. If we have fleecing-cache node as a "view" or "export", we=20 > can export only selected portions of data, marking the other as=20 > unallocated.. Or we need to share bitmaps (global bitmaps, shared=20 > fleecing state, etc) with a block-job. > >> >> Max >> > > --=20 Best regards, Vladimir