All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC: Qemu backup interface plans
@ 2021-05-17 12:07 Vladimir Sementsov-Ogievskiy
  2021-05-18 16:39 ` Max Reitz
  2021-05-21 22:05 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 2 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-05-17 12:07 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Eric Blake, John Snow, Kevin Wolf, Max Reitz,
	Denis V. Lunev, Nikolay Shirokovskiy, Dmitry Mishin, Igor Sukhih,
	yur, Peter Krempa, libvir-list

Hi all!

I'd like to share and discuss some plans on Qemu backup interface I have. (actually, most of this I've presented on KVM Forum long ago.. But now I'm a lot closer to realization:)


I'd start with a reminder about image fleecing:

We have image fleecing scheme to export point-in-time state of active
disk (iotest 222):


                                       backup(sync=none)
                      ┌───────────────────────────────────────┐
                      ▼                                       │
┌────────────┐     ┌────────────────┐  backing             ┌─────────────┐
│ NBD export │ ─── │ temp qcow2 img │ ───────────────────▶ │ active disk │
└────────────┘     └────────────────┘                      └─────────────┘
                                                              ▲
┌────────────┐                                               │
│ guest blk  │ ──────────────────────────────────────────────┘
└────────────┘


Actually, backup job inserts a backup-top filter, so in detail it looks
like:

                                       backup(sync=none)
                      ┌───────────────────────────────────────┐
                      ▼                                       │
┌────────────┐     ┌────────────────┐  backing             ┌─────────────┐
│ NBD export │ ─── │ temp qcow2 img │ ───────────────────▶ │ active disk │
└────────────┘     └────────────────┘                      └─────────────┘
                      ▲                                       ▲
                      │ target                                │
                      │                                       │
┌────────────┐     ┌────────────────┐  backing               │
│ guest blk  │ ──▶ │   backup-top   │ ───────────────────────┘
└────────────┘     └────────────────┘
  

This scheme is also called external backup or pull backup. It allows some external tool to write data to actual backup, and Qemu only provides this data.

We support also incremental external backup: Qemu can manage dirty bitmaps in any way user wants, and we can export bitmaps through NBD protocol. So, client of NBD export can get the bitmap, and read only "dirty" regions of exported image.

What we lack in this scheme:

1. handling dirty bitmap in backup-top filter: backup-top does copy-before-write operation on any guest write, when actually we are interested only in "dirty" regions for incremental backup

Probable solution would allowing specifying bitmap for sync=none mode of backup, but I think what I propose below is better.

2. [actually it's a tricky part of 1]: possibility to not do copy-before-write operations for regions that was already copied to final backup. With normal Qemu backup job, this is achieved by the fact that block-copy state with its internal bitmap is shared between backup job and copy-before-write filter.

3. Not a real problem but fact: backup block-job does nothing in the scheme, the whole job is done by filter. So, it would be interesting to have a possibility to simply insert/remove the filter, and avoid block-job creation and managing at all for external backup. (and I'd like to send another RFC on how to insert/remove filters, let's not discuss it here).


Next. Think about internal backup. It has one drawback too:
4. If target is remote with slow connection, copy-before-write operations will slow down guest writes appreciably.

It may be solved with help of image fleecing: we create temporary qcow2 image, setup fleecing scheme, and instead of exporting temp image through NBD we start a second backup with source = temporary image and target would be real backup target (NBD for example). Still, with such solution there are same [1,2] problems, 3 becomes worse:

5. We'll have two jobs and two automatically inserted filters, when actually one filter and one job are enough (as first job is needed only to insert a filter, second job doesn't need a filter at all).

Note also, that this (starting two backup jobs to make push backup with fleecing) doesn't work now, op-blockers will be against. It's simple to fix (and in Virtuozzo we live with downstream-only patch, which allows push backup with fleecing, based on starting two backup jobs).. But I never send a patch, as I have better plan, which will solve all listed problems.


So, what I propose:

1. We make backup-top filter public, so that it could be inserted/removed where user wants through QMP (how to properly insert/remove filter I'll post another RFC, as backup-top is not the only filter that can be usefully inserted somewhere). For this first step I've sent a series today:

   subject: [PATCH 00/21] block: publish backup-top filter
   id: <20210517064428.16223-1-vsementsov@virtuozzo.com>
   patchew: https://patchew.org/QEMU/20210517064428.16223-1-vsementsov@virtuozzo.com/

(note, that one of things in this series is rename s/backup-top/copy-before-write/, still, I call it backup-top in this letter)

This solves [3]. [4, 5] are solved partly: we still have one extra filter, created by backup block jobs, and also I didn't test does this work, probably some op-blockers or permissions should be tuned. So, let it be step 2:

2. Test, that we can start backup job with source = (target of backup-top filter), so that we have "push backup with fleecing". Make an option for backup to start without a filter, when we don't need copy-before-write operations, to not create extra superfluous filter.

3. Support bitmap in backup-top filter, to solve [1]

3.1 and make it possible to modify the bitmap externally, so that consumer of fleecing can say to backup-top filter: I've already copied these blocks, don't bother with copying them to temp image". This is to solve [2].

Still, how consumer of fleecing will reset shared bitmap after copying blocks? I have the following idea: we make a "discard-bitmap-filter" filter driver, that own some bitmap and on discard request unset corresponding bits. Also, on read, if read from the region with unset bits the EINVAL returned immediately. This way both consumers (backup job and NBD client) are able to use this interface:

Backup job can simply call discard on source, we can add an option for this.
External backup tool will send TRIM request after reading some region. This way disk space will be freed and no extra copy-before-write operations will be done. I also have a side idea that we can implement READ_ONCE flag, so that READ and TRIM can be done in one NBD command. But this works only for clients that don't want to implement any kind of retrying.



So, finally, how will it look (here I call backup-top with a new name, and "file" child is used instead of "backing", as this change I propose in "[PATCH 00/21] block: publish backup-top filter"):

Pull backup:

┌────────────────────────────────────┐
│             NBD export             │
└────────────────────────────────────┘
   │
   │
┌────────────────────────────────────┐  file   ┌───────────────────────────────────────┐  backing   ┌─────────────┐
│ discard-bitmap filter (bitmap=bm0) │ ──────▶ │            temp qcow2 img             │ ─────────▶ │ active disk │
└────────────────────────────────────┘         └───────────────────────────────────────┘            └─────────────┘
                                                  ▲                                                    ▲
                                                  │ target                                             │
                                                  │                                                    │
┌────────────────────────────────────┐         ┌───────────────────────────────────────┐  file        │
│             guest blk              │ ──────▶ │ copy-before-write filter (bitmap=bm0) │ ─────────────┘
└────────────────────────────────────┘         └───────────────────────────────────────┘



Operations:

- Create temp qcow2 image
- blockdev-add to add the new image, setup its backing, and add two filters
- some command to actually set backup-top filter as child of guest blk. That's a "point-in-time" of backup. Should be done during fs-freeze.
- start NBD export on top of discard filter (and we can export bitmap bm0 as well, for the client)

Now NBD client (our external backup tool) can:

  - import the bitmap
  - READ the data
  - send DISCARD requests on already handled areas to save disk space and avoid extra copy-before-write operations on host node


Push backup with fleecing:

┌─────────────────────┐
│ final backup target │
└─────────────────────┘
   ▲
   │ backup job (bitmap=bm0, insert-filter=False, discard-source=True)
   │
┌────────────────────────────────────┐  file   ┌───────────────────────────────────────┐  backing   ┌─────────────┐
│ discard-bitmap filter (bitmap=bm0) │ ──────▶ │            temp qcow2 img             │ ─────────▶ │ active disk │
└────────────────────────────────────┘         └───────────────────────────────────────┘            └─────────────┘
                                                  ▲                                                    ▲
                                                  │ target                                             │
                                                  │                                                    │
┌────────────────────────────────────┐         ┌───────────────────────────────────────┐  file        │
│             guest blk              │ ──────▶ │ copy-before-write filter (bitmap=bm0) │ ─────────────┘
└────────────────────────────────────┘         └───────────────────────────────────────┘


Note, that the whole fleecing part is the same, we only need to run backup job instead of NBD export.



Additional future idea. Why we need push backup with fleecing? To handle cases with slow connection to backup target. In any case when writing to remote target is slower than writing to local file, push-backup-with-fleecing will less disturb the running guest than simple backup job. But this is not free:

1. We need additional disk space on source. No way to fix that (that's a core idea of the scheme, store data locally), still discard-source option for backup job will help

2. If connection is not too slow than probably part of CBW (copy before write) operations could go to final target immediately. But with the scheme above all CBW operations go to qcow2 temporary image. This can be solved with help of ram-cache format driver (to be implemented, of course):


┌─────────────────────┐
│ final backup target │
└─────────────────────┘
   ▲
   │ backup job (bitmap=bm0, insert-filter=False, discard-source=True)
   │
┌────────────────────────────────────┐         ┌───────────────────────────────────────┐  backing   ┌─────────────┐
│ discard-bitmap filter (bitmap=bm0) │         │            temp qcow2 img             │ ─────────▶ │ active disk │
└────────────────────────────────────┘         └───────────────────────────────────────┘            └─────────────┘
                                  │               ▲                                                    ▲
                                  │               │ file                                               │
                                  │               │                                                    │
                                  │      file   ┌───────────┐                                          │
                                  └───────────▶ │ ram-cache │                                          │
                                                └───────────┘                                          │
                                                  ▲                                                    │
                                                  │ target                                             │
                                                  │                                                    │
┌────────────────────────────────────┐         ┌───────────────────────────────────────┐  file        │
│             guest blk              │ ──────▶ │ copy-before-write filter (bitmap=bm0) │ ─────────────┘
└────────────────────────────────────┘         └───────────────────────────────────────┘

This way data from copy-before-write filter goes first to ram-cache, and backup job could read it from ram. ram-cache will automatically flush data to temp qcow2 image, when ram-usage limit is reached. We'll also need a way to say backup-job that it should first read clusters that are cached in ram, and only then other clusters. So, we'll have a priory for clusters to be copied by block-copy:
1. clusters in ram-cache
2. clusters not in temp img (to avoid copy-before-write operations in future)
3. clusters in temp img.

This will be a kind of block_status() thing, that allows a block driver to give recommendations on sequence of reading to be effective. Not also, that there is another benefit of such thing: we'll implement this callback in qcow2 driver, so that backup will read clusters not in guest cluster order, but in host cluster order, to read more sequentially, which should bring better performance on rotating disks.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-17 12:07 RFC: Qemu backup interface plans Vladimir Sementsov-Ogievskiy
@ 2021-05-18 16:39 ` Max Reitz
  2021-05-19  6:11   ` Vladimir Sementsov-Ogievskiy
  2021-05-21 22:05 ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 9+ messages in thread
From: Max Reitz @ 2021-05-18 16:39 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Peter Krempa, libvir-list, Dmitry Mishin,
	Igor Sukhih, qemu-devel, yur, Nikolay Shirokovskiy,
	Denis V. Lunev, John Snow

Hi,

Your proposal sounds good to me in general.  Some small independent 
building blocks that seems to make sense to me.


On 17.05.21 14:07, Vladimir Sementsov-Ogievskiy wrote:

[...]

> What we lack in this scheme:
> 
> 1. handling dirty bitmap in backup-top filter: backup-top does 
> copy-before-write operation on any guest write, when actually we are 
> interested only in "dirty" regions for incremental backup
> 
> Probable solution would allowing specifying bitmap for sync=none mode of 
> backup, but I think what I propose below is better.
> 
> 2. [actually it's a tricky part of 1]: possibility to not do 
> copy-before-write operations for regions that was already copied to 
> final backup. With normal Qemu backup job, this is achieved by the fact 
> that block-copy state with its internal bitmap is shared between backup 
> job and copy-before-write filter.
> 
> 3. Not a real problem but fact: backup block-job does nothing in the 
> scheme, the whole job is done by filter. So, it would be interesting to 
> have a possibility to simply insert/remove the filter, and avoid 
> block-job creation and managing at all for external backup. (and I'd 
> like to send another RFC on how to insert/remove filters, let's not 
> discuss it here).
> 
> 
> Next. Think about internal backup. It has one drawback too:
> 4. If target is remote with slow connection, copy-before-write 
> operations will slow down guest writes appreciably.
> 
> It may be solved with help of image fleecing: we create temporary qcow2 
> image, setup fleecing scheme, and instead of exporting temp image 
> through NBD we start a second backup with source = temporary image and 
> target would be real backup target (NBD for example).

How would a second backup work here?  Wouldn’t one want a mirror job to 
copy the data off to the real target?

(Because I see backup as something intrinsically synchronous, whereas 
mirror by default is rather lazy.)

[Note from future me where I read more below: I see you acknowledge that 
you’ll need to modify backup to do what you need here, i.e. not do any 
CBW operations.  So it’s effectively the same as a mirror that ignores 
new dirty areas.  Which could work without changing mirror if block-copy 
were to set BDRV_REQ_WRITE_UNCHANGED for the fleecing case, and 
bdrv_co_write_req_finish() would skip bdrv_set_dirty() for such writes.]

I mean, still has the problem that the mirror job can’t tell the CBW 
filter which areas are already copied off and so don’t need to be 
preserved anymore, but...

> Still, with such 
> solution there are same [1,2] problems, 3 becomes worse:

Not sure how 3 can become worse when you said above it isn’t a real 
problem (to which I agree).

> 5. We'll have two jobs and two automatically inserted filters, when 
> actually one filter and one job are enough (as first job is needed only 
> to insert a filter, second job doesn't need a filter at all).
> 
> Note also, that this (starting two backup jobs to make push backup with 
> fleecing) doesn't work now, op-blockers will be against. It's simple to 
> fix (and in Virtuozzo we live with downstream-only patch, which allows 
> push backup with fleecing, based on starting two backup jobs).. But I 
> never send a patch, as I have better plan, which will solve all listed 
> problems.
> 
> 
> So, what I propose:
> 
> 1. We make backup-top filter public, so that it could be 
> inserted/removed where user wants through QMP (how to properly 
> insert/remove filter I'll post another RFC, as backup-top is not the 
> only filter that can be usefully inserted somewhere). For this first 
> step I've sent a series today:
> 
>    subject: [PATCH 00/21] block: publish backup-top filter
>    id: <20210517064428.16223-1-vsementsov@virtuozzo.com>
>    patchew: 
> https://patchew.org/QEMU/20210517064428.16223-1-vsementsov@virtuozzo.com/
> 
> (note, that one of things in this series is rename 
> s/backup-top/copy-before-write/, still, I call it backup-top in this 
> letter)
> 
> This solves [3]. [4, 5] are solved partly: we still have one extra 
> filter, created by backup block jobs, and also I didn't test does this 
> work, probably some op-blockers or permissions should be tuned. So, let 
> it be step 2:
> 
> 2. Test, that we can start backup job with source = (target of 
> backup-top filter), so that we have "push backup with fleecing". Make an 
> option for backup to start without a filter, when we don't need 
> copy-before-write operations, to not create extra superfluous filter.

OK, so the backup job is not really a backup job, but just anything that 
copies data.

> 3. Support bitmap in backup-top filter, to solve [1]
> 
> 3.1 and make it possible to modify the bitmap externally, so that 
> consumer of fleecing can say to backup-top filter: I've already copied 
> these blocks, don't bother with copying them to temp image". This is to 
> solve [2].
> 
> Still, how consumer of fleecing will reset shared bitmap after copying 
> blocks? I have the following idea: we make a "discard-bitmap-filter" 
> filter driver, that own some bitmap and on discard request unset 
> corresponding bits. Also, on read, if read from the region with unset 
> bits the EINVAL returned immediately. This way both consumers (backup 
> job and NBD client) are able to use this interface:

Sounds almost like a 'bitmap' protocol block driver that, given some 
dirty bitmap, basically just represents that bitmap as a block device. 
*shrug*

Anyway.  I think I’m wrong, it’s something very different, and that’s 
clear when I turn your proposal around:  What this “filter” would do 
primarily is to restrict access to its filtered node based on the bitmap 
given to it, i.e. only dirty areas can be read.  (I say “filter” because 
I’m not sure it’s a filter if it restricts the data that can be read.) 
Secondarily, the bitmap can be cleared by sending discards.

You know what, that would allow implement backing files for formats that 
don’t support it.  Like, the overlay and the backing file are both 
children of a FIFO quorum node, where the overlay has the bitmap filter 
on top, and is the first child.  If writing to the bitmap filter then 
also marks the bitmap dirty there (which it logically should, I 
think)...  Don’t know if that’s good or not. :)

> Backup job can simply call discard on source, we can add an option for 
> this.

Hm.  I would have expected the most straightforward solution would be to 
share the job’s (backup or mirror, doesn’t matter) dirty bitmap with the 
CBW node, so that the latter only copies what the former still considers 
dirty.  Is the bitmap filter really still necessary then?

Oh, I see, discarding also helps to save disk space.  Makes sense then.

> External backup tool will send TRIM request after reading some region. 
> This way disk space will be freed and no extra copy-before-write 
> operations will be done. I also have a side idea that we can implement 
> READ_ONCE flag, so that READ and TRIM can be done in one NBD command. 
> But this works only for clients that don't want to implement any kind of 
> retrying.

[...]

> This way data from copy-before-write filter goes first to ram-cache, and 
> backup job could read it from ram. ram-cache will automatically flush 
> data to temp qcow2 image, when ram-usage limit is reached. We'll also 
> need a way to say backup-job that it should first read clusters that are 
> cached in ram, and only then other clusters. So, we'll have a priory for 
> clusters to be copied by block-copy:
> 1. clusters in ram-cache
> 2. clusters not in temp img (to avoid copy-before-write operations in 
> future)
> 3. clusters in temp img.
> 
> This will be a kind of block_status() thing, that allows a block driver 
> to give recommendations on sequence of reading to be effective.

You mean block_status should give that recommendation?  Is that really 
necessary?  I think this is a rather special case, so block-copy could 
figure that out itself.  All it needs to do is for any dirty area 
determine how deep in the backing chain it is: Is it in the ram-cache, 
is it in temp image, or is it below both?  It should be able to figure 
that out with the *file information that block_status returns.

> Not 
> also, that there is another benefit of such thing: we'll implement this 
> callback in qcow2 driver, so that backup will read clusters not in guest 
> cluster order, but in host cluster order, to read more sequentially, 
> which should bring better performance on rotating disks.

I’m not exactly sure how you envision this to work, but block_status 
also already gives you the host offset in *map.

Max



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-18 16:39 ` Max Reitz
@ 2021-05-19  6:11   ` Vladimir Sementsov-Ogievskiy
  2021-05-19 11:20     ` Kevin Wolf
  2021-05-25  8:50     ` Max Reitz
  0 siblings, 2 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-05-19  6:11 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: qemu-devel, Eric Blake, John Snow, Kevin Wolf, Denis V. Lunev,
	Nikolay Shirokovskiy, Dmitry Mishin, Igor Sukhih, yur,
	Peter Krempa, libvir-list

18.05.2021 19:39, Max Reitz wrote:
> Hi,
> 
> Your proposal sounds good to me in general.  Some small independent building blocks that seems to make sense to me.

Thanks! I hope it's not too difficult to read and understand my English.

> 
> 
> On 17.05.21 14:07, Vladimir Sementsov-Ogievskiy wrote:
> 
> [...]
> 
>> What we lack in this scheme:
>>
>> 1. handling dirty bitmap in backup-top filter: backup-top does copy-before-write operation on any guest write, when actually we are interested only in "dirty" regions for incremental backup
>>
>> Probable solution would allowing specifying bitmap for sync=none mode of backup, but I think what I propose below is better.
>>
>> 2. [actually it's a tricky part of 1]: possibility to not do copy-before-write operations for regions that was already copied to final backup. With normal Qemu backup job, this is achieved by the fact that block-copy state with its internal bitmap is shared between backup job and copy-before-write filter.
>>
>> 3. Not a real problem but fact: backup block-job does nothing in the scheme, the whole job is done by filter. So, it would be interesting to have a possibility to simply insert/remove the filter, and avoid block-job creation and managing at all for external backup. (and I'd like to send another RFC on how to insert/remove filters, let's not discuss it here).
>>
>>
>> Next. Think about internal backup. It has one drawback too:
>> 4. If target is remote with slow connection, copy-before-write operations will slow down guest writes appreciably.
>>
>> It may be solved with help of image fleecing: we create temporary qcow2 image, setup fleecing scheme, and instead of exporting temp image through NBD we start a second backup with source = temporary image and target would be real backup target (NBD for example).
> 
> How would a second backup work here?  Wouldn’t one want a mirror job to copy the data off to the real target?
> 
> (Because I see backup as something intrinsically synchronous, whereas mirror by default is rather lazy.)
> 
> [Note from future me where I read more below: I see you acknowledge that you’ll need to modify backup to do what you need here, i.e. not do any CBW operations.  So it’s effectively the same as a mirror that ignores new dirty areas.  Which could work without changing mirror if block-copy were to set BDRV_REQ_WRITE_UNCHANGED for the fleecing case, and bdrv_co_write_req_finish() would skip bdrv_set_dirty() for such writes.]

I just feel myself closer with backup block-job than with mirror :) Finally, yes, there is no real difference in interface. But in realization, I prefer to continue developing block-copy. I hope, finally all jobs and img-convert would work through block-copy.

(and I'll need BDRV_REQ_WRITE_UNCHANGED anyway for fleecing, so user can use mirror or backup)

> 
> I mean, still has the problem that the mirror job can’t tell the CBW filter which areas are already copied off and so don’t need to be preserved anymore, but...
> 
>> Still, with such solution there are same [1,2] problems, 3 becomes worse:
> 
> Not sure how 3 can become worse when you said above it isn’t a real problem (to which I agree).

It's my perfectionism :) Yes, it's still isn't a problem, but number of extra user-visible objects in architecture increases, which is not good I think.

> 
>> 5. We'll have two jobs and two automatically inserted filters, when actually one filter and one job are enough (as first job is needed only to insert a filter, second job doesn't need a filter at all).
>>
>> Note also, that this (starting two backup jobs to make push backup with fleecing) doesn't work now, op-blockers will be against. It's simple to fix (and in Virtuozzo we live with downstream-only patch, which allows push backup with fleecing, based on starting two backup jobs).. But I never send a patch, as I have better plan, which will solve all listed problems.
>>
>>
>> So, what I propose:
>>
>> 1. We make backup-top filter public, so that it could be inserted/removed where user wants through QMP (how to properly insert/remove filter I'll post another RFC, as backup-top is not the only filter that can be usefully inserted somewhere). For this first step I've sent a series today:
>>
>>    subject: [PATCH 00/21] block: publish backup-top filter
>>    id: <20210517064428.16223-1-vsementsov@virtuozzo.com>
>>    patchew: https://patchew.org/QEMU/20210517064428.16223-1-vsementsov@virtuozzo.com/
>>
>> (note, that one of things in this series is rename s/backup-top/copy-before-write/, still, I call it backup-top in this letter)
>>
>> This solves [3]. [4, 5] are solved partly: we still have one extra filter, created by backup block jobs, and also I didn't test does this work, probably some op-blockers or permissions should be tuned. So, let it be step 2:
>>
>> 2. Test, that we can start backup job with source = (target of backup-top filter), so that we have "push backup with fleecing". Make an option for backup to start without a filter, when we don't need copy-before-write operations, to not create extra superfluous filter.
> 
> OK, so the backup job is not really a backup job, but just anything that copies data.

Not quite. For backup without a filter we should protect source from changing, by unsharing WRITE permission on it.

I'll try to avoid adding an option. The logic should work like in commit job: if there are current writers on source we create filter. If there no writers, we just unshare writes and go without a filter. And for this copy-before-write filter should be able to do WRITE_UNCHANGED in case of fleecing.

> 
>> 3. Support bitmap in backup-top filter, to solve [1]
>>
>> 3.1 and make it possible to modify the bitmap externally, so that consumer of fleecing can say to backup-top filter: I've already copied these blocks, don't bother with copying them to temp image". This is to solve [2].
>>
>> Still, how consumer of fleecing will reset shared bitmap after copying blocks? I have the following idea: we make a "discard-bitmap-filter" filter driver, that own some bitmap and on discard request unset corresponding bits. Also, on read, if read from the region with unset bits the EINVAL returned immediately. This way both consumers (backup job and NBD client) are able to use this interface:
> 
> Sounds almost like a 'bitmap' protocol block driver that, given some dirty bitmap, basically just represents that bitmap as a block device. *shrug*
> 
> Anyway.  I think I’m wrong, it’s something very different, and that’s clear when I turn your proposal around:  What this “filter” would do primarily is to restrict access to its filtered node based on the bitmap given to it, i.e. only dirty areas can be read.  (I say “filter” because I’m not sure it’s a filter if it restricts the data that can be read.) Secondarily, the bitmap can be cleared by sending discards.

What rethink filters as "drivers that not contain any data, so filter can always be removed from the chain without loosing any data"? And allow filters to restrict access to data areas. Or even change the data (and raw format becomes filter).

Note that backup-top filter already may restrict access: if copy-before-write operation failed, the filter doesn't propagate write operation to file child but return an error to the guest.

> 
> You know what, that would allow implement backing files for formats that don’t support it.  Like, the overlay and the backing file are both children of a FIFO quorum node, where the overlay has the bitmap filter on top, and is the first child.  If writing to the bitmap filter then also marks the bitmap dirty there (which it logically should, I think)...  Don’t know if that’s good or not. :)

That's interesting. I've never heard of requesting such a feature still.. But it may influence how to call this filter. Maybe not discard-bitmap-fitler, but just bitmap-filter. And we can add different options to setup, how filter will handle requests.

For example, I need the following:

+-------+------------------+---------------------------+-------------------+
|       | read             | write                     | discard           |
+-------+------------------+---------------------------+-------------------+
| clear | EINVAL           | propagate to file         | EINVAL            |
|       |                  | mark dirty                |                   |
|       |                  | (or may be better EINVAL) |                   |
+-------+------------------+---------------------------+-------------------+
| dirty | propaget to file | propagate to file         | propagate to file |
|       |                  | (or may be better EINVAL) | mark clean        |
+-------+------------------+---------------------------+-------------------+

And to realize backing behavior:

+-------+----------------------+-------------------+--------------------+
|       | read                 | write             | discard            |
+-------+----------------------+-------------------+--------------------+
| clear | propagate to backing | propagate to file | maybe zeroize file |
|       |                      | mark dirty        | mark dirty         |
+-------+----------------------+-------------------+--------------------+
| dirty | propagete to file    | propagate to file | propagate to file  |
+-------+----------------------+-------------------+--------------------+

> 
>> Backup job can simply call discard on source, we can add an option for this.
> 
> Hm.  I would have expected the most straightforward solution would be to share the job’s (backup or mirror, doesn’t matter) dirty bitmap with the CBW node, so that the latter only copies what the former still considers dirty.  Is the bitmap filter really still necessary then?

Yes, I think the user given bitmap should be shared between all the consumers. Still, internal bitmaps of block-copy entities would be different:

For example, second block-copy which do copy to final target clears bits at start of operation, and may reset them to 1 if copy failed for that operation. If in a meantime guest write, copy-before-write filter should do block-copy, and it doesn't matter that in second block-copy bits are zero.

Or another: first block-copy (in copy-before-write filter) marks bits of its internal bitmap zero when cluster copied. But that bit has not relation to should second block-copy do copy of this cluster or not.

Now internal bitmap of block-copy just initialized from user given bitmap and then user given bitmap is unchanged (except for backup job finalization if options says to transactionally update user given bitmap, but I think this mode is not for our case). We'll need a possibility to modify user given bitmap so that it influence block-copy. So block-copy will have to consider both bitmaps and make AND of them.. Or something like this. We'll see, how I implement this :)

> 
> Oh, I see, discarding also helps to save disk space.  Makes sense then.

Note also, that I want the fleecing scheme to work in the same way for push-backup-with-fleecing and for pull-backup, so that user will implement it once. Of course, block-copy could have simpler options than adding a filter and additional discard logic. But for external backup it seems the most straightforward solution. So, let's just reuse it for push-backup-with-fleecing.

> 
>> External backup tool will send TRIM request after reading some region. This way disk space will be freed and no extra copy-before-write operations will be done. I also have a side idea that we can implement READ_ONCE flag, so that READ and TRIM can be done in one NBD command. But this works only for clients that don't want to implement any kind of retrying.
> 
> [...]
> 
>> This way data from copy-before-write filter goes first to ram-cache, and backup job could read it from ram. ram-cache will automatically flush data to temp qcow2 image, when ram-usage limit is reached. We'll also need a way to say backup-job that it should first read clusters that are cached in ram, and only then other clusters. So, we'll have a priory for clusters to be copied by block-copy:
>> 1. clusters in ram-cache
>> 2. clusters not in temp img (to avoid copy-before-write operations in future)
>> 3. clusters in temp img.
>>
>> This will be a kind of block_status() thing, that allows a block driver to give recommendations on sequence of reading to be effective.
> 
> You mean block_status should give that recommendation?  Is that really necessary?  I think this is a rather special case, so block-copy could figure that out itself.  All it needs to do is for any dirty area determine how deep in the backing chain it is: Is it in the ram-cache, is it in temp image, or is it below both?  It should be able to figure that out with the *file information that block_status returns.

No, I don't propose to extend block_status(), it should be separate interface

> 
>> Not also, that there is another benefit of such thing: we'll implement this callback in qcow2 driver, so that backup will read clusters not in guest cluster order, but in host cluster order, to read more sequentially, which should bring better performance on rotating disks.
> 
> I’m not exactly sure how you envision this to work, but block_status also already gives you the host offset in *map.
> 

But block-status doesn't give a possibility to read sequentially. For this, user should call block-status several times until the whole disk covered, then sort the segments by host offset. I wonder, could it be implemented as some iterator, like

read_iter = bdrv_get_sequential_read_iter(source)

while (extents = bdrv_read_next(read_iter)):
   for ext in extents:
     start_writing_task(target, ext.offset, ext.bytes, ext.qiov)

where bdrv_read_next will read guest data in host-cluster-sequence..


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-19  6:11   ` Vladimir Sementsov-Ogievskiy
@ 2021-05-19 11:20     ` Kevin Wolf
  2021-05-19 11:49       ` Vladimir Sementsov-Ogievskiy
  2021-05-25  8:50     ` Max Reitz
  1 sibling, 1 reply; 9+ messages in thread
From: Kevin Wolf @ 2021-05-19 11:20 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Peter Krempa, qemu-block, libvir-list, Dmitry Mishin,
	Igor Sukhih, qemu-devel, Max Reitz, yur, Nikolay Shirokovskiy,
	Denis V. Lunev, John Snow

Am 19.05.2021 um 08:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 2. Test, that we can start backup job with source = (target of
> > > backup-top filter), so that we have "push backup with fleecing".
> > > Make an option for backup to start without a filter, when we don't
> > > need copy-before-write operations, to not create extra superfluous
> > > filter.
> > 
> > OK, so the backup job is not really a backup job, but just anything
> > that copies data.
> 
> Not quite. For backup without a filter we should protect source from
> changing, by unsharing WRITE permission on it.
> 
> I'll try to avoid adding an option. The logic should work like in
> commit job: if there are current writers on source we create filter.
> If there no writers, we just unshare writes and go without a filter.
> And for this copy-before-write filter should be able to do
> WRITE_UNCHANGED in case of fleecing.

If we ever get to the point where we would make a block-copy job visible
to the user, I wouldn't copy the restrictions from the current jobs, but
keep it really generic to cover all cases.

There is no way for the QMP command starting the job to know what the
user is planning to do with the image in the future. Even if it's
currently read-only, the user may want to add a writer later.

I think this means that we want to always add a filter node, and then
possibly dynamically switch between modes if being read-only provides a
significant advantage for the job.

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-19 11:20     ` Kevin Wolf
@ 2021-05-19 11:49       ` Vladimir Sementsov-Ogievskiy
  2021-05-19 12:00         ` Kevin Wolf
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-05-19 11:49 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Max Reitz, qemu-block, qemu-devel, Eric Blake, John Snow,
	Denis V. Lunev, Nikolay Shirokovskiy, Dmitry Mishin, Igor Sukhih,
	yur, Peter Krempa, libvir-list

19.05.2021 14:20, Kevin Wolf wrote:
> Am 19.05.2021 um 08:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 2. Test, that we can start backup job with source = (target of
>>>> backup-top filter), so that we have "push backup with fleecing".
>>>> Make an option for backup to start without a filter, when we don't
>>>> need copy-before-write operations, to not create extra superfluous
>>>> filter.
>>>
>>> OK, so the backup job is not really a backup job, but just anything
>>> that copies data.
>>
>> Not quite. For backup without a filter we should protect source from
>> changing, by unsharing WRITE permission on it.
>>
>> I'll try to avoid adding an option. The logic should work like in
>> commit job: if there are current writers on source we create filter.
>> If there no writers, we just unshare writes and go without a filter.
>> And for this copy-before-write filter should be able to do
>> WRITE_UNCHANGED in case of fleecing.
> 
> If we ever get to the point where we would make a block-copy job visible
> to the user, I wouldn't copy the restrictions from the current jobs, but
> keep it really generic to cover all cases.
> 
> There is no way for the QMP command starting the job to know what the
> user is planning to do with the image in the future. Even if it's
> currently read-only, the user may want to add a writer later.
> 
> I think this means that we want to always add a filter node, and then
> possibly dynamically switch between modes if being read-only provides a
> significant advantage for the job.
> 
> Kevin
> 

Still, in push-backup-with-fleecing scheme we really don't need the second filter, so why to insert extra thing to block graph?

I see your point still, that user may want to add writer later. Still, I'd be surprised if such use-cases exist now.

What about the following:

add some source-mode tri-state parameter for backup:

auto: insert filter iff there are existing writers [default]
filtered: insert filter unconditionally
immutable: don't insert filter. will fail if there are existing writers, and creating writers during block-job would be impossible

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-19 11:49       ` Vladimir Sementsov-Ogievskiy
@ 2021-05-19 12:00         ` Kevin Wolf
  0 siblings, 0 replies; 9+ messages in thread
From: Kevin Wolf @ 2021-05-19 12:00 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: Peter Krempa, qemu-block, libvir-list, Dmitry Mishin,
	Igor Sukhih, qemu-devel, Max Reitz, yur, Nikolay Shirokovskiy,
	Denis V. Lunev, John Snow

Am 19.05.2021 um 13:49 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.05.2021 14:20, Kevin Wolf wrote:
> > Am 19.05.2021 um 08:11 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > 2. Test, that we can start backup job with source = (target of
> > > > > backup-top filter), so that we have "push backup with fleecing".
> > > > > Make an option for backup to start without a filter, when we don't
> > > > > need copy-before-write operations, to not create extra superfluous
> > > > > filter.
> > > > 
> > > > OK, so the backup job is not really a backup job, but just anything
> > > > that copies data.
> > > 
> > > Not quite. For backup without a filter we should protect source from
> > > changing, by unsharing WRITE permission on it.
> > > 
> > > I'll try to avoid adding an option. The logic should work like in
> > > commit job: if there are current writers on source we create filter.
> > > If there no writers, we just unshare writes and go without a filter.
> > > And for this copy-before-write filter should be able to do
> > > WRITE_UNCHANGED in case of fleecing.
> > 
> > If we ever get to the point where we would make a block-copy job visible
> > to the user, I wouldn't copy the restrictions from the current jobs, but
> > keep it really generic to cover all cases.
> > 
> > There is no way for the QMP command starting the job to know what the
> > user is planning to do with the image in the future. Even if it's
> > currently read-only, the user may want to add a writer later.
> > 
> > I think this means that we want to always add a filter node, and then
> > possibly dynamically switch between modes if being read-only provides a
> > significant advantage for the job.
> 
> Still, in push-backup-with-fleecing scheme we really don't need the
> second filter, so why to insert extra thing to block graph?
> 
> I see your point still, that user may want to add writer later. Still,
> I'd be surprised if such use-cases exist now.
> 
> What about the following:
> 
> add some source-mode tri-state parameter for backup:
> 
> auto: insert filter iff there are existing writers [default]
> filtered: insert filter unconditionally
> immutable: don't insert filter. will fail if there are existing
> writers, and creating writers during block-job would be impossible

Yes, that's an option, too.

Kevin



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-17 12:07 RFC: Qemu backup interface plans Vladimir Sementsov-Ogievskiy
  2021-05-18 16:39 ` Max Reitz
@ 2021-05-21 22:05 ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-05-21 22:05 UTC (permalink / raw)
  To: qemu-block
  Cc: qemu-devel, Eric Blake, John Snow, Kevin Wolf, Max Reitz,
	Denis V. Lunev, Nikolay Shirokovskiy, Dmitry Mishin, Igor Sukhih,
	yur, Peter Krempa, libvir-list

17.05.2021 15:07, Vladimir Sementsov-Ogievskiy wrote:
> 3.1 and make it possible to modify the bitmap externally, so that consumer of fleecing can say to backup-top filter: I've already copied these blocks, don't bother with copying them to temp image". This is to solve [2].
> 
> Still, how consumer of fleecing will reset shared bitmap after copying blocks? I have the following idea: we make a "discard-bitmap-filter" filter driver, that own some bitmap and on discard request unset corresponding bits. Also, on read, if read from the region with unset bits the EINVAL returned immediately. This way both consumers (backup job and NBD client) are able to use this interface:
> 
> Backup job can simply call discard on source, we can add an option for this.
> External backup tool will send TRIM request after reading some region. This way disk space will be freed and no extra copy-before-write operations will be done. I also have a side idea that we can implement READ_ONCE flag, so that READ and TRIM can be done in one NBD command. But this works only for clients that don't want to implement any kind of retrying.
> 
> 
> 
> So, finally, how will it look (here I call backup-top with a new name, and "file" child is used instead of "backing", as this change I propose in "[PATCH 00/21] block: publish backup-top filter"):
> 
> Pull backup:
> 
> ┌────────────────────────────────────┐
> │             NBD export             │
> └────────────────────────────────────┘
>    │
>    │
> ┌────────────────────────────────────┐  file   ┌───────────────────────────────────────┐  backing   ┌─────────────┐
> │ discard-bitmap filter (bitmap=bm0) │ ──────▶ │            temp qcow2 img             │ ─────────▶ │ active disk │
> └────────────────────────────────────┘         └───────────────────────────────────────┘            └─────────────┘
>                                                   ▲                                                    ▲
>                                                   │ target                                             │
>                                                   │                                                    │
> ┌────────────────────────────────────┐         ┌───────────────────────────────────────┐  file        │
> │             guest blk              │ ──────▶ │ copy-before-write filter (bitmap=bm0) │ ─────────────┘
> └────────────────────────────────────┘         └───────────────────────────────────────┘


Now I think that discard-bitmap is not really good idea:

1. If it is separate of internal BlockCopyState::copy_bitmap, than we'll have to consider both bitmaps inside block-copy code, it's not a small complication.

2. Using BlockCopyState::copy_bitmap directly seems possible:

User creates copy_bitmap by hand, and pass to copy-before-write filter as an option, block-copy will use this bitmap directly

Same bitmap is passed to discard-bitmap filter, so that it can clear bits on discards.

Pros:
  - we don't make block-copy code more complicated

Note then, that BlockCopyState::copy_bitmap is used in block-copy process as follows:

  - copy tasks are created to copy dirty areas
  - when task is created, corresponding dirty area is cleared (so that creating tasks on same area can't happen)
  - if task failed, corresponding area becomes dirty again (so that it can be retried)

So, we have
Cons:
  - if discard comes when task is in-flight, bits are already clean. Then if task failed bits become dirty again. That's wrong. Not very bad, and not worth doing coplications of [1]. But still, keep it in mind [*]
  - we have to use separate bitmap in discard-filter to fail reads from non-dirty areas (because BlockCopyState::copy_bitmap is cleared by block-copy process, not only by discards). That is worse.. It just means that discard-filter is strange thing and we don't have good understanding of what should it do. Good documentation will help of course, but...

...this all leads me to new idea:

Let's go without discard-bitmap filter with the simple logic:

If discard is done on block-copy target, let's inform block-copy about it. So, block-copy can:

  - clear corresponding area in its internal bitmap
  - [not really improtant, but it addresses [*]]: cancel in-flight tasks in discarded area.

Pros: avoid extra filter

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-19  6:11   ` Vladimir Sementsov-Ogievskiy
  2021-05-19 11:20     ` Kevin Wolf
@ 2021-05-25  8:50     ` Max Reitz
  2021-05-25  9:19       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 9+ messages in thread
From: Max Reitz @ 2021-05-25  8:50 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, Peter Krempa, libvir-list, Dmitry Mishin,
	Igor Sukhih, qemu-devel, yur, Nikolay Shirokovskiy,
	Denis V. Lunev, John Snow

On 19.05.21 08:11, Vladimir Sementsov-Ogievskiy wrote:
> 18.05.2021 19:39, Max Reitz wrote:

[...]

>> On 17.05.21 14:07, Vladimir Sementsov-Ogievskiy wrote:

[...]

>>> Not also, that there is another benefit of such thing: we'll 
>>> implement this callback in qcow2 driver, so that backup will read 
>>> clusters not in guest cluster order, but in host cluster order, to 
>>> read more sequentially, which should bring better performance on 
>>> rotating disks.
>>
>> I’m not exactly sure how you envision this to work, but block_status 
>> also already gives you the host offset in *map.
>>
> 
> But block-status doesn't give a possibility to read sequentially. For 
> this, user should call block-status several times until the whole disk 
> covered, then sort the segments by host offset. I wonder, could it be 
> implemented as some iterator, like
> 
> read_iter = bdrv_get_sequential_read_iter(source)
> 
> while (extents = bdrv_read_next(read_iter)):
>    for ext in extents:
>      start_writing_task(target, ext.offset, ext.bytes, ext.qiov)
> 
> where bdrv_read_next will read guest data in host-cluster-sequence..

How would you implement this, though?  qcow2 doesn’t have a reverse 
mapping either, so it too would need to read all L2 table entries and 
sort them, wouldn’t it?

Max



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RFC: Qemu backup interface plans
  2021-05-25  8:50     ` Max Reitz
@ 2021-05-25  9:19       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-05-25  9:19 UTC (permalink / raw)
  To: Max Reitz, qemu-block
  Cc: qemu-devel, Eric Blake, John Snow, Kevin Wolf, Denis V. Lunev,
	Nikolay Shirokovskiy, Dmitry Mishin, Igor Sukhih, yur,
	Peter Krempa, libvir-list

25.05.2021 11:50, Max Reitz wrote:
> On 19.05.21 08:11, Vladimir Sementsov-Ogievskiy wrote:
>> 18.05.2021 19:39, Max Reitz wrote:
> 
> [...]
> 
>>> On 17.05.21 14:07, Vladimir Sementsov-Ogievskiy wrote:
> 
> [...]
> 
>>>> Not also, that there is another benefit of such thing: we'll implement this callback in qcow2 driver, so that backup will read clusters not in guest cluster order, but in host cluster order, to read more sequentially, which should bring better performance on rotating disks.
>>>
>>> I’m not exactly sure how you envision this to work, but block_status also already gives you the host offset in *map.
>>>
>>
>> But block-status doesn't give a possibility to read sequentially. For this, user should call block-status several times until the whole disk covered, then sort the segments by host offset. I wonder, could it be implemented as some iterator, like
>>
>> read_iter = bdrv_get_sequential_read_iter(source)
>>
>> while (extents = bdrv_read_next(read_iter)):
>>    for ext in extents:
>>      start_writing_task(target, ext.offset, ext.bytes, ext.qiov)
>>
>> where bdrv_read_next will read guest data in host-cluster-sequence..
> 
> How would you implement this, though?

I don't know :) That's why I wrote "I wonder".. Anyway I have enough work with all previous steps.

> qcow2 doesn’t have a reverse mapping either, so it too would need to read all L2 table entries and sort them, wouldn’t it?
> 

Hmm, yes. With current qcow2, it seems to be the only way. And we'll be limited by memory, so probably, read several L2 tables, do sort, return sorted extents, read next bunch of L2 tables and so on.

And then, I agree, we can just implement with help of current block_status() implementation.

Or probably we can implement reverse mapping in qcow2 as extension. But I doubt that it worth the complexity.. Still, it depends on how much extra IO will it cost (almost nothing, as should work through qcow2 cache) and how much performance benefit will it bring (no idea, it should be measured).

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-05-25  9:20 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17 12:07 RFC: Qemu backup interface plans Vladimir Sementsov-Ogievskiy
2021-05-18 16:39 ` Max Reitz
2021-05-19  6:11   ` Vladimir Sementsov-Ogievskiy
2021-05-19 11:20     ` Kevin Wolf
2021-05-19 11:49       ` Vladimir Sementsov-Ogievskiy
2021-05-19 12:00         ` Kevin Wolf
2021-05-25  8:50     ` Max Reitz
2021-05-25  9:19       ` Vladimir Sementsov-Ogievskiy
2021-05-21 22:05 ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.