All of lore.kernel.org
 help / color / mirror / Atom feed
* backup_calculate_cluster_size does not consider source
@ 2019-11-05 10:02 Dietmar Maurer
  2019-11-06  8:32 ` Stefan Hajnoczi
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Maurer @ 2019-11-05 10:02 UTC (permalink / raw)
  To: qemu-devel

Example: Backup from ceph disk (rbd_cache=false) to local disk:

backup_calculate_cluster_size returns 64K (correct for my local .raw image)

Then the backup job starts to read 64K blocks from ceph.

But ceph always reads 4M block, so this is incredibly slow and produces
way too much network traffic.

Why does backup_calculate_cluster_size does not consider the block size from
the source disk? 

cluster_size = MAX(block_size_source, block_size_target)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-05 10:02 backup_calculate_cluster_size does not consider source Dietmar Maurer
@ 2019-11-06  8:32 ` Stefan Hajnoczi
  2019-11-06  9:37   ` Max Reitz
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2019-11-06  8:32 UTC (permalink / raw)
  To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 658 bytes --]

On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
> Example: Backup from ceph disk (rbd_cache=false) to local disk:
> 
> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
> 
> Then the backup job starts to read 64K blocks from ceph.
> 
> But ceph always reads 4M block, so this is incredibly slow and produces
> way too much network traffic.
> 
> Why does backup_calculate_cluster_size does not consider the block size from
> the source disk? 
> 
> cluster_size = MAX(block_size_source, block_size_target)

CCing block maintainers so they see your email and you get a response
more quickly.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06  8:32 ` Stefan Hajnoczi
@ 2019-11-06  9:37   ` Max Reitz
  2019-11-06 10:18     ` Dietmar Maurer
  2019-11-06 10:34     ` Wolfgang Bumiller
  0 siblings, 2 replies; 15+ messages in thread
From: Max Reitz @ 2019-11-06  9:37 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 1656 bytes --]

On 06.11.19 09:32, Stefan Hajnoczi wrote:
> On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
>> Example: Backup from ceph disk (rbd_cache=false) to local disk:
>>
>> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
>>
>> Then the backup job starts to read 64K blocks from ceph.
>>
>> But ceph always reads 4M block, so this is incredibly slow and produces
>> way too much network traffic.
>>
>> Why does backup_calculate_cluster_size does not consider the block size from
>> the source disk? 
>>
>> cluster_size = MAX(block_size_source, block_size_target)

So Ceph always transmits 4 MB over the network, no matter what is
actually needed?  That sounds, well, interesting.

backup_calculate_cluster_size() doesn’t consider the source size because
to my knowledge there is no other medium that behaves this way.  So I
suppose the assumption was always that the block size of the source
doesn’t matter, because a partial read is always possible (without
having to read everything).


What would make sense to me is to increase the buffer size in general.
I don’t think we need to copy clusters at a time, and
0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
backup writes that are triggered by guest writes.  We haven’t yet
increased the copy size for background writes, though.  We can do that,
of course.  (And probably should.)

The thing is, it just seems unnecessary to me to take the source cluster
size into account in general.  It seems weird that a medium only allows
4 MB reads, because, well, guests aren’t going to take that into account.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06  9:37   ` Max Reitz
@ 2019-11-06 10:18     ` Dietmar Maurer
  2019-11-06 10:37       ` Max Reitz
  2019-11-06 10:34     ` Wolfgang Bumiller
  1 sibling, 1 reply; 15+ messages in thread
From: Dietmar Maurer @ 2019-11-06 10:18 UTC (permalink / raw)
  To: Max Reitz, Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, qemu-block

> The thing is, it just seems unnecessary to me to take the source cluster
> size into account in general.  It seems weird that a medium only allows
> 4 MB reads, because, well, guests aren’t going to take that into account.

Maybe it is strange, but it is quite obvious that there is an optimal cluster
size for each storage type (4M in case of ceph)...



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06  9:37   ` Max Reitz
  2019-11-06 10:18     ` Dietmar Maurer
@ 2019-11-06 10:34     ` Wolfgang Bumiller
  2019-11-06 10:42       ` Max Reitz
  1 sibling, 1 reply; 15+ messages in thread
From: Wolfgang Bumiller @ 2019-11-06 10:34 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Stefan Hajnoczi, Dietmar Maurer, qemu-block, qemu-devel

On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote:
> On 06.11.19 09:32, Stefan Hajnoczi wrote:
> > On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
> >> Example: Backup from ceph disk (rbd_cache=false) to local disk:
> >>
> >> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
> >>
> >> Then the backup job starts to read 64K blocks from ceph.
> >>
> >> But ceph always reads 4M block, so this is incredibly slow and produces
> >> way too much network traffic.
> >>
> >> Why does backup_calculate_cluster_size does not consider the block size from
> >> the source disk? 
> >>
> >> cluster_size = MAX(block_size_source, block_size_target)
> 
> So Ceph always transmits 4 MB over the network, no matter what is
> actually needed?  That sounds, well, interesting.

Or at least it generates that much I/O - in the end, it can slow down
the backup by up to a multi-digit factor...

> backup_calculate_cluster_size() doesn’t consider the source size because
> to my knowledge there is no other medium that behaves this way.  So I
> suppose the assumption was always that the block size of the source
> doesn’t matter, because a partial read is always possible (without
> having to read everything).

Unless you enable qemu-side caching this only works until the
block/cluster size of the source exceeds the one of the target.

> What would make sense to me is to increase the buffer size in general.
> I don’t think we need to copy clusters at a time, and
> 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
> backup writes that are triggered by guest writes.  We haven’t yet
> increased the copy size for background writes, though.  We can do that,
> of course.  (And probably should.)
> 
> The thing is, it just seems unnecessary to me to take the source cluster
> size into account in general.  It seems weird that a medium only allows
> 4 MB reads, because, well, guests aren’t going to take that into account.

But guests usually have a page cache, which is why in many setups qemu
(and thereby the backup process) often doesn't.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 10:18     ` Dietmar Maurer
@ 2019-11-06 10:37       ` Max Reitz
  0 siblings, 0 replies; 15+ messages in thread
From: Max Reitz @ 2019-11-06 10:37 UTC (permalink / raw)
  To: Dietmar Maurer, Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 1020 bytes --]

On 06.11.19 11:18, Dietmar Maurer wrote:
>> The thing is, it just seems unnecessary to me to take the source cluster
>> size into account in general.  It seems weird that a medium only allows
>> 4 MB reads, because, well, guests aren’t going to take that into account.
> 
> Maybe it is strange, but it is quite obvious that there is an optimal cluster
> size for each storage type (4M in case of ceph)...

Sure, but usually one can always read sub-cluster ranges; at least, if
the cluster size is larger than 4 kB.  (For example, it’s perfectly fine
to read any bit of data from a qcow2 file with whatever cluster size it
has.  The same applies to filesystems.  The only limitation is what the
storage itself allows (with O_DIRECT), but that alignment is generally
not greater than 4 kB.)

As I said, I wonder how that even works when you attach such a volume to
a VM and let the guest read from it.  Surely it won’t issue just 4 MB
requests, so the network overhead must be tremendous?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 10:34     ` Wolfgang Bumiller
@ 2019-11-06 10:42       ` Max Reitz
  2019-11-06 11:18         ` Dietmar Maurer
  0 siblings, 1 reply; 15+ messages in thread
From: Max Reitz @ 2019-11-06 10:42 UTC (permalink / raw)
  To: Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, Dietmar Maurer, qemu-block, qemu-devel


[-- Attachment #1.1: Type: text/plain, Size: 2914 bytes --]

On 06.11.19 11:34, Wolfgang Bumiller wrote:
> On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote:
>> On 06.11.19 09:32, Stefan Hajnoczi wrote:
>>> On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
>>>> Example: Backup from ceph disk (rbd_cache=false) to local disk:
>>>>
>>>> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
>>>>
>>>> Then the backup job starts to read 64K blocks from ceph.
>>>>
>>>> But ceph always reads 4M block, so this is incredibly slow and produces
>>>> way too much network traffic.
>>>>
>>>> Why does backup_calculate_cluster_size does not consider the block size from
>>>> the source disk? 
>>>>
>>>> cluster_size = MAX(block_size_source, block_size_target)
>>
>> So Ceph always transmits 4 MB over the network, no matter what is
>> actually needed?  That sounds, well, interesting.
> 
> Or at least it generates that much I/O - in the end, it can slow down
> the backup by up to a multi-digit factor...

Oh, so I understand ceph internally resolves the 4 MB block and then
transmits the subcluster range.  That makes sense.

>> backup_calculate_cluster_size() doesn’t consider the source size because
>> to my knowledge there is no other medium that behaves this way.  So I
>> suppose the assumption was always that the block size of the source
>> doesn’t matter, because a partial read is always possible (without
>> having to read everything).
> 
> Unless you enable qemu-side caching this only works until the
> block/cluster size of the source exceeds the one of the target.
> 
>> What would make sense to me is to increase the buffer size in general.
>> I don’t think we need to copy clusters at a time, and
>> 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
>> backup writes that are triggered by guest writes.  We haven’t yet
>> increased the copy size for background writes, though.  We can do that,
>> of course.  (And probably should.)
>>
>> The thing is, it just seems unnecessary to me to take the source cluster
>> size into account in general.  It seems weird that a medium only allows
>> 4 MB reads, because, well, guests aren’t going to take that into account.
> 
> But guests usually have a page cache, which is why in many setups qemu
> (and thereby the backup process) often doesn't.

But this still doesn’t make sense to me.  Linux doesn’t issue 4 MB
requests to pre-fill the page cache, does it?

And if it issues a smaller request, there is no way for a guest device
to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
around it, maybe you’d like to take that as well...?”

I understand wanting to increase the backup buffer size, but I don’t
quite understand why we’d want it to increase to the source cluster size
when the guest also has no idea what the source cluster size is.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 10:42       ` Max Reitz
@ 2019-11-06 11:18         ` Dietmar Maurer
  2019-11-06 11:22           ` Max Reitz
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Maurer @ 2019-11-06 11:18 UTC (permalink / raw)
  To: Max Reitz, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block

> And if it issues a smaller request, there is no way for a guest device
> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
> around it, maybe you’d like to take that as well...?”
> 
> I understand wanting to increase the backup buffer size, but I don’t
> quite understand why we’d want it to increase to the source cluster size
> when the guest also has no idea what the source cluster size is.

Because it is more efficent.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 11:18         ` Dietmar Maurer
@ 2019-11-06 11:22           ` Max Reitz
  2019-11-06 11:37             ` Max Reitz
  0 siblings, 1 reply; 15+ messages in thread
From: Max Reitz @ 2019-11-06 11:22 UTC (permalink / raw)
  To: Dietmar Maurer, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 541 bytes --]

On 06.11.19 12:18, Dietmar Maurer wrote:
>> And if it issues a smaller request, there is no way for a guest device
>> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
>> around it, maybe you’d like to take that as well...?”
>>
>> I understand wanting to increase the backup buffer size, but I don’t
>> quite understand why we’d want it to increase to the source cluster size
>> when the guest also has no idea what the source cluster size is.
> 
> Because it is more efficent.

For rbd.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 11:22           ` Max Reitz
@ 2019-11-06 11:37             ` Max Reitz
  2019-11-06 13:09               ` Dietmar Maurer
  0 siblings, 1 reply; 15+ messages in thread
From: Max Reitz @ 2019-11-06 11:37 UTC (permalink / raw)
  To: Dietmar Maurer, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 1975 bytes --]

On 06.11.19 12:22, Max Reitz wrote:
> On 06.11.19 12:18, Dietmar Maurer wrote:
>>> And if it issues a smaller request, there is no way for a guest device
>>> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk
>>> around it, maybe you’d like to take that as well...?”
>>>
>>> I understand wanting to increase the backup buffer size, but I don’t
>>> quite understand why we’d want it to increase to the source cluster size
>>> when the guest also has no idea what the source cluster size is.
>>
>> Because it is more efficent.
> 
> For rbd.

Let me elaborate: Yes, a cluster size generally means that it is most
“efficient” to access the storage at that size.  But there’s a tradeoff.
 At some point, reading the data takes sufficiently long that reading a
bit of metadata doesn’t matter anymore (usually, that is).

There is a bit of a problem with making the backup copy size rather
large, and that is the fact that backup’s copy-before-write causes guest
writes to stall.  So if the guest just writes a bit of data, a 4 MB
buffer size may mean that in the background it will have to wait for 4
MB of data to be copied.[1]

Hm.  OTOH, we have the same problem already with the target’s cluster
size, which can of course be 4 MB as well.  But I can imagine it to
actually be important for the target, because otherwise there might be
read-modify-write cycles.

But for the source, I still don’t quite understand why rbd has such a
problem with small read requests.  I don’t doubt that it has (as you
explained), but again, how is it then even possible to use rbd as the
backend for a guest that has no idea of this requirement?  Does Linux
really prefill the page cache with 4 MB of data for each read?

Max


[1] I suppose what we could do is decouple the copy buffer size from the
bitmap granularity, but that would be more work than just a MAX() in
backup_calculate_cluster_size().


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 11:37             ` Max Reitz
@ 2019-11-06 13:09               ` Dietmar Maurer
  2019-11-06 13:17                 ` Max Reitz
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Maurer @ 2019-11-06 13:09 UTC (permalink / raw)
  To: Max Reitz, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block

> Let me elaborate: Yes, a cluster size generally means that it is most
> “efficient” to access the storage at that size.  But there’s a tradeoff.
>  At some point, reading the data takes sufficiently long that reading a
> bit of metadata doesn’t matter anymore (usually, that is).

Any network storage suffers from long network latencies, so it always
matters if you do more IOs than necessary.

> There is a bit of a problem with making the backup copy size rather
> large, and that is the fact that backup’s copy-before-write causes guest
> writes to stall. So if the guest just writes a bit of data, a 4 MB
> buffer size may mean that in the background it will have to wait for 4
> MB of data to be copied.[1]

We use this for several years now in production, and it is not a problem.
(Ceph storage is mostly on 10G (or faster) network equipment).

> Hm.  OTOH, we have the same problem already with the target’s cluster
> size, which can of course be 4 MB as well.  But I can imagine it to
> actually be important for the target, because otherwise there might be
> read-modify-write cycles.
> 
> But for the source, I still don’t quite understand why rbd has such a
> problem with small read requests.  I don’t doubt that it has (as you
> explained), but again, how is it then even possible to use rbd as the
> backend for a guest that has no idea of this requirement?  Does Linux
> really prefill the page cache with 4 MB of data for each read?

No idea. I just observed that upstream qemu backups with ceph are 
quite unusable this way.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 13:09               ` Dietmar Maurer
@ 2019-11-06 13:17                 ` Max Reitz
  2019-11-06 13:34                   ` Dietmar Maurer
  0 siblings, 1 reply; 15+ messages in thread
From: Max Reitz @ 2019-11-06 13:17 UTC (permalink / raw)
  To: Dietmar Maurer, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 2027 bytes --]

On 06.11.19 14:09, Dietmar Maurer wrote:
>> Let me elaborate: Yes, a cluster size generally means that it is most
>> “efficient” to access the storage at that size.  But there’s a tradeoff.
>>  At some point, reading the data takes sufficiently long that reading a
>> bit of metadata doesn’t matter anymore (usually, that is).
> 
> Any network storage suffers from long network latencies, so it always
> matters if you do more IOs than necessary.

Yes, exactly, that’s why I’m saying it makes sense to me to increase the
buffer size from the measly 64 kB that we currently have.  I just don’t
see the point of increasing it exactly to the source cluster size.

>> There is a bit of a problem with making the backup copy size rather
>> large, and that is the fact that backup’s copy-before-write causes guest
>> writes to stall. So if the guest just writes a bit of data, a 4 MB
>> buffer size may mean that in the background it will have to wait for 4
>> MB of data to be copied.[1]
> 
> We use this for several years now in production, and it is not a problem.
> (Ceph storage is mostly on 10G (or faster) network equipment).

So you mean for cases where backup already chooses a 4 MB buffer size
because the target has that cluster size?

>> Hm.  OTOH, we have the same problem already with the target’s cluster
>> size, which can of course be 4 MB as well.  But I can imagine it to
>> actually be important for the target, because otherwise there might be
>> read-modify-write cycles.
>>
>> But for the source, I still don’t quite understand why rbd has such a
>> problem with small read requests.  I don’t doubt that it has (as you
>> explained), but again, how is it then even possible to use rbd as the
>> backend for a guest that has no idea of this requirement?  Does Linux
>> really prefill the page cache with 4 MB of data for each read?
> 
> No idea. I just observed that upstream qemu backups with ceph are 
> quite unusable this way.

Hm, OK.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 13:17                 ` Max Reitz
@ 2019-11-06 13:34                   ` Dietmar Maurer
  2019-11-06 13:52                     ` Max Reitz
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Maurer @ 2019-11-06 13:34 UTC (permalink / raw)
  To: Max Reitz, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block


> On 6 November 2019 14:17 Max Reitz <mreitz@redhat.com> wrote:
> 
>  
> On 06.11.19 14:09, Dietmar Maurer wrote:
> >> Let me elaborate: Yes, a cluster size generally means that it is most
> >> “efficient” to access the storage at that size.  But there’s a tradeoff.
> >>  At some point, reading the data takes sufficiently long that reading a
> >> bit of metadata doesn’t matter anymore (usually, that is).
> > 
> > Any network storage suffers from long network latencies, so it always
> > matters if you do more IOs than necessary.
> 
> Yes, exactly, that’s why I’m saying it makes sense to me to increase the
> buffer size from the measly 64 kB that we currently have.  I just don’t
> see the point of increasing it exactly to the source cluster size.
> 
> >> There is a bit of a problem with making the backup copy size rather
> >> large, and that is the fact that backup’s copy-before-write causes guest
> >> writes to stall. So if the guest just writes a bit of data, a 4 MB
> >> buffer size may mean that in the background it will have to wait for 4
> >> MB of data to be copied.[1]
> > 
> > We use this for several years now in production, and it is not a problem.
> > (Ceph storage is mostly on 10G (or faster) network equipment).
> 
> So you mean for cases where backup already chooses a 4 MB buffer size
> because the target has that cluster size?

To make it clear. Backups from Ceph as source are slow.

That is why we use a patched qemu version, which uses:

cluster_size = Max_Block_Size(source, target)

(I guess this only triggers for ceph)



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 13:34                   ` Dietmar Maurer
@ 2019-11-06 13:52                     ` Max Reitz
  2019-11-06 14:39                       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 15+ messages in thread
From: Max Reitz @ 2019-11-06 13:52 UTC (permalink / raw)
  To: Dietmar Maurer, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block


[-- Attachment #1.1: Type: text/plain, Size: 3005 bytes --]

On 06.11.19 14:34, Dietmar Maurer wrote:
> 
>> On 6 November 2019 14:17 Max Reitz <mreitz@redhat.com> wrote:
>>
>>  
>> On 06.11.19 14:09, Dietmar Maurer wrote:
>>>> Let me elaborate: Yes, a cluster size generally means that it is most
>>>> “efficient” to access the storage at that size.  But there’s a tradeoff.
>>>>  At some point, reading the data takes sufficiently long that reading a
>>>> bit of metadata doesn’t matter anymore (usually, that is).
>>>
>>> Any network storage suffers from long network latencies, so it always
>>> matters if you do more IOs than necessary.
>>
>> Yes, exactly, that’s why I’m saying it makes sense to me to increase the
>> buffer size from the measly 64 kB that we currently have.  I just don’t
>> see the point of increasing it exactly to the source cluster size.
>>
>>>> There is a bit of a problem with making the backup copy size rather
>>>> large, and that is the fact that backup’s copy-before-write causes guest
>>>> writes to stall. So if the guest just writes a bit of data, a 4 MB
>>>> buffer size may mean that in the background it will have to wait for 4
>>>> MB of data to be copied.[1]
>>>
>>> We use this for several years now in production, and it is not a problem.
>>> (Ceph storage is mostly on 10G (or faster) network equipment).
>>
>> So you mean for cases where backup already chooses a 4 MB buffer size
>> because the target has that cluster size?
> 
> To make it clear. Backups from Ceph as source are slow.

Yep, but if the target would be another ceph instance, the backup buffer
size would be chosen to be 4 MB (AFAIU), so I was wondering whether you
are referring to this effect, or to...

> That is why we use a patched qemu version, which uses:
> 
> cluster_size = Max_Block_Size(source, target)

...this.

The main problem with the stall I mentioned is that I think one of the
main use cases of backup is having a fast source and a slow (off-site)
target.  In such cases, I suppose it becomes annoying if some guest
writes (which were fast before the backup started) take a long time
because the backup needs to copy quite a bit of data to off-site storage.

(And blindly taking the source cluster size would mean that such things
could happen if you use local qcow2 files with 2 MB clusters.)


So I’d prefer decoupling the backup buffer size and the bitmap
granularity, and then set the buffer size to maybe the MAX of source and
target cluster sizes.  But I don’t know when I can get around to do that.

And then probably also cap it at 4 MB or 8 MB, because that happens to
be what you need, but I’d prefer for it not to use tons of memory.  (The
mirror job uses 1 MB per request, for up to 16 parallel requests; and
the backup copy-before-write implementation currently (on master) copies
1 MB at a time (per concurrent request), and the whole memory usage of
backup is limited at 128 MB.)

(OTOH, the minimum should probably be 1 MB.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: backup_calculate_cluster_size does not consider source
  2019-11-06 13:52                     ` Max Reitz
@ 2019-11-06 14:39                       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 15+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2019-11-06 14:39 UTC (permalink / raw)
  To: Max Reitz, Dietmar Maurer, Wolfgang Bumiller
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block

06.11.2019 16:52, Max Reitz wrote:
> On 06.11.19 14:34, Dietmar Maurer wrote:
>>
>>> On 6 November 2019 14:17 Max Reitz <mreitz@redhat.com> wrote:
>>>
>>>   
>>> On 06.11.19 14:09, Dietmar Maurer wrote:
>>>>> Let me elaborate: Yes, a cluster size generally means that it is most
>>>>> “efficient” to access the storage at that size.  But there’s a tradeoff.
>>>>>   At some point, reading the data takes sufficiently long that reading a
>>>>> bit of metadata doesn’t matter anymore (usually, that is).
>>>>
>>>> Any network storage suffers from long network latencies, so it always
>>>> matters if you do more IOs than necessary.
>>>
>>> Yes, exactly, that’s why I’m saying it makes sense to me to increase the
>>> buffer size from the measly 64 kB that we currently have.  I just don’t
>>> see the point of increasing it exactly to the source cluster size.
>>>
>>>>> There is a bit of a problem with making the backup copy size rather
>>>>> large, and that is the fact that backup’s copy-before-write causes guest
>>>>> writes to stall. So if the guest just writes a bit of data, a 4 MB
>>>>> buffer size may mean that in the background it will have to wait for 4
>>>>> MB of data to be copied.[1]
>>>>
>>>> We use this for several years now in production, and it is not a problem.
>>>> (Ceph storage is mostly on 10G (or faster) network equipment).
>>>
>>> So you mean for cases where backup already chooses a 4 MB buffer size
>>> because the target has that cluster size?
>>
>> To make it clear. Backups from Ceph as source are slow.
> 
> Yep, but if the target would be another ceph instance, the backup buffer
> size would be chosen to be 4 MB (AFAIU), so I was wondering whether you
> are referring to this effect, or to...
> 
>> That is why we use a patched qemu version, which uses:
>>
>> cluster_size = Max_Block_Size(source, target)
> 
> ...this.
> 
> The main problem with the stall I mentioned is that I think one of the
> main use cases of backup is having a fast source and a slow (off-site)
> target.  In such cases, I suppose it becomes annoying if some guest
> writes (which were fast before the backup started) take a long time
> because the backup needs to copy quite a bit of data to off-site storage.
> 
> (And blindly taking the source cluster size would mean that such things
> could happen if you use local qcow2 files with 2 MB clusters.)
> 
> 
> So I’d prefer decoupling the backup buffer size and the bitmap
> granularity, and then set the buffer size to maybe the MAX of source and
> target cluster sizes.  But I don’t know when I can get around to do that.

Note, that problem is not only in copy-before-write operations: if we have big
in-flight backup request from backup job itself, all new upcoming guest writes
to this area will have to wait.

> 
> And then probably also cap it at 4 MB or 8 MB, because that happens to
> be what you need, but I’d prefer for it not to use tons of memory.  (The
> mirror job uses 1 MB per request, for up to 16 parallel requests; and
> the backup copy-before-write implementation currently (on master) copies
> 1 MB at a time (per concurrent request), and the whole memory usage of
> backup is limited at 128 MB.)
> 
> (OTOH, the minimum should probably be 1 MB.)
> 

Hmmm, I am preparing a patch set about backup, which includes increasing
of copied chunk size.. And somehow it leads to performance degradation on my
hdd.


===


What about the following solution: add empty qcow2 with cluster_size = 4M (ohh,
2M is maximum unfortunately) above ceph, enable copy-on-read on this node and start
backup from it? The qcow2 node will be a local cache, which will solve both problem
with unaligned read from ceph and copy-before-write time?

-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-11-06 14:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-05 10:02 backup_calculate_cluster_size does not consider source Dietmar Maurer
2019-11-06  8:32 ` Stefan Hajnoczi
2019-11-06  9:37   ` Max Reitz
2019-11-06 10:18     ` Dietmar Maurer
2019-11-06 10:37       ` Max Reitz
2019-11-06 10:34     ` Wolfgang Bumiller
2019-11-06 10:42       ` Max Reitz
2019-11-06 11:18         ` Dietmar Maurer
2019-11-06 11:22           ` Max Reitz
2019-11-06 11:37             ` Max Reitz
2019-11-06 13:09               ` Dietmar Maurer
2019-11-06 13:17                 ` Max Reitz
2019-11-06 13:34                   ` Dietmar Maurer
2019-11-06 13:52                     ` Max Reitz
2019-11-06 14:39                       ` Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.