* backup_calculate_cluster_size does not consider source @ 2019-11-05 10:02 Dietmar Maurer 2019-11-06 8:32 ` Stefan Hajnoczi 0 siblings, 1 reply; 15+ messages in thread From: Dietmar Maurer @ 2019-11-05 10:02 UTC (permalink / raw) To: qemu-devel Example: Backup from ceph disk (rbd_cache=false) to local disk: backup_calculate_cluster_size returns 64K (correct for my local .raw image) Then the backup job starts to read 64K blocks from ceph. But ceph always reads 4M block, so this is incredibly slow and produces way too much network traffic. Why does backup_calculate_cluster_size does not consider the block size from the source disk? cluster_size = MAX(block_size_source, block_size_target) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-05 10:02 backup_calculate_cluster_size does not consider source Dietmar Maurer @ 2019-11-06 8:32 ` Stefan Hajnoczi 2019-11-06 9:37 ` Max Reitz 0 siblings, 1 reply; 15+ messages in thread From: Stefan Hajnoczi @ 2019-11-06 8:32 UTC (permalink / raw) To: Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel, qemu-block, Max Reitz [-- Attachment #1: Type: text/plain, Size: 658 bytes --] On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote: > Example: Backup from ceph disk (rbd_cache=false) to local disk: > > backup_calculate_cluster_size returns 64K (correct for my local .raw image) > > Then the backup job starts to read 64K blocks from ceph. > > But ceph always reads 4M block, so this is incredibly slow and produces > way too much network traffic. > > Why does backup_calculate_cluster_size does not consider the block size from > the source disk? > > cluster_size = MAX(block_size_source, block_size_target) CCing block maintainers so they see your email and you get a response more quickly. Stefan [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 8:32 ` Stefan Hajnoczi @ 2019-11-06 9:37 ` Max Reitz 2019-11-06 10:18 ` Dietmar Maurer 2019-11-06 10:34 ` Wolfgang Bumiller 0 siblings, 2 replies; 15+ messages in thread From: Max Reitz @ 2019-11-06 9:37 UTC (permalink / raw) To: Stefan Hajnoczi, Dietmar Maurer; +Cc: Kevin Wolf, qemu-devel, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 1656 bytes --] On 06.11.19 09:32, Stefan Hajnoczi wrote: > On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote: >> Example: Backup from ceph disk (rbd_cache=false) to local disk: >> >> backup_calculate_cluster_size returns 64K (correct for my local .raw image) >> >> Then the backup job starts to read 64K blocks from ceph. >> >> But ceph always reads 4M block, so this is incredibly slow and produces >> way too much network traffic. >> >> Why does backup_calculate_cluster_size does not consider the block size from >> the source disk? >> >> cluster_size = MAX(block_size_source, block_size_target) So Ceph always transmits 4 MB over the network, no matter what is actually needed? That sounds, well, interesting. backup_calculate_cluster_size() doesn’t consider the source size because to my knowledge there is no other medium that behaves this way. So I suppose the assumption was always that the block size of the source doesn’t matter, because a partial read is always possible (without having to read everything). What would make sense to me is to increase the buffer size in general. I don’t think we need to copy clusters at a time, and 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for backup writes that are triggered by guest writes. We haven’t yet increased the copy size for background writes, though. We can do that, of course. (And probably should.) The thing is, it just seems unnecessary to me to take the source cluster size into account in general. It seems weird that a medium only allows 4 MB reads, because, well, guests aren’t going to take that into account. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 9:37 ` Max Reitz @ 2019-11-06 10:18 ` Dietmar Maurer 2019-11-06 10:37 ` Max Reitz 2019-11-06 10:34 ` Wolfgang Bumiller 1 sibling, 1 reply; 15+ messages in thread From: Dietmar Maurer @ 2019-11-06 10:18 UTC (permalink / raw) To: Max Reitz, Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, qemu-block > The thing is, it just seems unnecessary to me to take the source cluster > size into account in general. It seems weird that a medium only allows > 4 MB reads, because, well, guests aren’t going to take that into account. Maybe it is strange, but it is quite obvious that there is an optimal cluster size for each storage type (4M in case of ceph)... ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 10:18 ` Dietmar Maurer @ 2019-11-06 10:37 ` Max Reitz 0 siblings, 0 replies; 15+ messages in thread From: Max Reitz @ 2019-11-06 10:37 UTC (permalink / raw) To: Dietmar Maurer, Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 1020 bytes --] On 06.11.19 11:18, Dietmar Maurer wrote: >> The thing is, it just seems unnecessary to me to take the source cluster >> size into account in general. It seems weird that a medium only allows >> 4 MB reads, because, well, guests aren’t going to take that into account. > > Maybe it is strange, but it is quite obvious that there is an optimal cluster > size for each storage type (4M in case of ceph)... Sure, but usually one can always read sub-cluster ranges; at least, if the cluster size is larger than 4 kB. (For example, it’s perfectly fine to read any bit of data from a qcow2 file with whatever cluster size it has. The same applies to filesystems. The only limitation is what the storage itself allows (with O_DIRECT), but that alignment is generally not greater than 4 kB.) As I said, I wonder how that even works when you attach such a volume to a VM and let the guest read from it. Surely it won’t issue just 4 MB requests, so the network overhead must be tremendous? Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 9:37 ` Max Reitz 2019-11-06 10:18 ` Dietmar Maurer @ 2019-11-06 10:34 ` Wolfgang Bumiller 2019-11-06 10:42 ` Max Reitz 1 sibling, 1 reply; 15+ messages in thread From: Wolfgang Bumiller @ 2019-11-06 10:34 UTC (permalink / raw) To: Max Reitz Cc: Kevin Wolf, Stefan Hajnoczi, Dietmar Maurer, qemu-block, qemu-devel On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote: > On 06.11.19 09:32, Stefan Hajnoczi wrote: > > On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote: > >> Example: Backup from ceph disk (rbd_cache=false) to local disk: > >> > >> backup_calculate_cluster_size returns 64K (correct for my local .raw image) > >> > >> Then the backup job starts to read 64K blocks from ceph. > >> > >> But ceph always reads 4M block, so this is incredibly slow and produces > >> way too much network traffic. > >> > >> Why does backup_calculate_cluster_size does not consider the block size from > >> the source disk? > >> > >> cluster_size = MAX(block_size_source, block_size_target) > > So Ceph always transmits 4 MB over the network, no matter what is > actually needed? That sounds, well, interesting. Or at least it generates that much I/O - in the end, it can slow down the backup by up to a multi-digit factor... > backup_calculate_cluster_size() doesn’t consider the source size because > to my knowledge there is no other medium that behaves this way. So I > suppose the assumption was always that the block size of the source > doesn’t matter, because a partial read is always possible (without > having to read everything). Unless you enable qemu-side caching this only works until the block/cluster size of the source exceeds the one of the target. > What would make sense to me is to increase the buffer size in general. > I don’t think we need to copy clusters at a time, and > 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for > backup writes that are triggered by guest writes. We haven’t yet > increased the copy size for background writes, though. We can do that, > of course. (And probably should.) > > The thing is, it just seems unnecessary to me to take the source cluster > size into account in general. It seems weird that a medium only allows > 4 MB reads, because, well, guests aren’t going to take that into account. But guests usually have a page cache, which is why in many setups qemu (and thereby the backup process) often doesn't. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 10:34 ` Wolfgang Bumiller @ 2019-11-06 10:42 ` Max Reitz 2019-11-06 11:18 ` Dietmar Maurer 0 siblings, 1 reply; 15+ messages in thread From: Max Reitz @ 2019-11-06 10:42 UTC (permalink / raw) To: Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, Dietmar Maurer, qemu-block, qemu-devel [-- Attachment #1.1: Type: text/plain, Size: 2914 bytes --] On 06.11.19 11:34, Wolfgang Bumiller wrote: > On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote: >> On 06.11.19 09:32, Stefan Hajnoczi wrote: >>> On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote: >>>> Example: Backup from ceph disk (rbd_cache=false) to local disk: >>>> >>>> backup_calculate_cluster_size returns 64K (correct for my local .raw image) >>>> >>>> Then the backup job starts to read 64K blocks from ceph. >>>> >>>> But ceph always reads 4M block, so this is incredibly slow and produces >>>> way too much network traffic. >>>> >>>> Why does backup_calculate_cluster_size does not consider the block size from >>>> the source disk? >>>> >>>> cluster_size = MAX(block_size_source, block_size_target) >> >> So Ceph always transmits 4 MB over the network, no matter what is >> actually needed? That sounds, well, interesting. > > Or at least it generates that much I/O - in the end, it can slow down > the backup by up to a multi-digit factor... Oh, so I understand ceph internally resolves the 4 MB block and then transmits the subcluster range. That makes sense. >> backup_calculate_cluster_size() doesn’t consider the source size because >> to my knowledge there is no other medium that behaves this way. So I >> suppose the assumption was always that the block size of the source >> doesn’t matter, because a partial read is always possible (without >> having to read everything). > > Unless you enable qemu-side caching this only works until the > block/cluster size of the source exceeds the one of the target. > >> What would make sense to me is to increase the buffer size in general. >> I don’t think we need to copy clusters at a time, and >> 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for >> backup writes that are triggered by guest writes. We haven’t yet >> increased the copy size for background writes, though. We can do that, >> of course. (And probably should.) >> >> The thing is, it just seems unnecessary to me to take the source cluster >> size into account in general. It seems weird that a medium only allows >> 4 MB reads, because, well, guests aren’t going to take that into account. > > But guests usually have a page cache, which is why in many setups qemu > (and thereby the backup process) often doesn't. But this still doesn’t make sense to me. Linux doesn’t issue 4 MB requests to pre-fill the page cache, does it? And if it issues a smaller request, there is no way for a guest device to tell it “OK, here’s your data, but note we have a whole 4 MB chunk around it, maybe you’d like to take that as well...?” I understand wanting to increase the backup buffer size, but I don’t quite understand why we’d want it to increase to the source cluster size when the guest also has no idea what the source cluster size is. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 10:42 ` Max Reitz @ 2019-11-06 11:18 ` Dietmar Maurer 2019-11-06 11:22 ` Max Reitz 0 siblings, 1 reply; 15+ messages in thread From: Dietmar Maurer @ 2019-11-06 11:18 UTC (permalink / raw) To: Max Reitz, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block > And if it issues a smaller request, there is no way for a guest device > to tell it “OK, here’s your data, but note we have a whole 4 MB chunk > around it, maybe you’d like to take that as well...?” > > I understand wanting to increase the backup buffer size, but I don’t > quite understand why we’d want it to increase to the source cluster size > when the guest also has no idea what the source cluster size is. Because it is more efficent. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 11:18 ` Dietmar Maurer @ 2019-11-06 11:22 ` Max Reitz 2019-11-06 11:37 ` Max Reitz 0 siblings, 1 reply; 15+ messages in thread From: Max Reitz @ 2019-11-06 11:22 UTC (permalink / raw) To: Dietmar Maurer, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 541 bytes --] On 06.11.19 12:18, Dietmar Maurer wrote: >> And if it issues a smaller request, there is no way for a guest device >> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk >> around it, maybe you’d like to take that as well...?” >> >> I understand wanting to increase the backup buffer size, but I don’t >> quite understand why we’d want it to increase to the source cluster size >> when the guest also has no idea what the source cluster size is. > > Because it is more efficent. For rbd. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 11:22 ` Max Reitz @ 2019-11-06 11:37 ` Max Reitz 2019-11-06 13:09 ` Dietmar Maurer 0 siblings, 1 reply; 15+ messages in thread From: Max Reitz @ 2019-11-06 11:37 UTC (permalink / raw) To: Dietmar Maurer, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 1975 bytes --] On 06.11.19 12:22, Max Reitz wrote: > On 06.11.19 12:18, Dietmar Maurer wrote: >>> And if it issues a smaller request, there is no way for a guest device >>> to tell it “OK, here’s your data, but note we have a whole 4 MB chunk >>> around it, maybe you’d like to take that as well...?” >>> >>> I understand wanting to increase the backup buffer size, but I don’t >>> quite understand why we’d want it to increase to the source cluster size >>> when the guest also has no idea what the source cluster size is. >> >> Because it is more efficent. > > For rbd. Let me elaborate: Yes, a cluster size generally means that it is most “efficient” to access the storage at that size. But there’s a tradeoff. At some point, reading the data takes sufficiently long that reading a bit of metadata doesn’t matter anymore (usually, that is). There is a bit of a problem with making the backup copy size rather large, and that is the fact that backup’s copy-before-write causes guest writes to stall. So if the guest just writes a bit of data, a 4 MB buffer size may mean that in the background it will have to wait for 4 MB of data to be copied.[1] Hm. OTOH, we have the same problem already with the target’s cluster size, which can of course be 4 MB as well. But I can imagine it to actually be important for the target, because otherwise there might be read-modify-write cycles. But for the source, I still don’t quite understand why rbd has such a problem with small read requests. I don’t doubt that it has (as you explained), but again, how is it then even possible to use rbd as the backend for a guest that has no idea of this requirement? Does Linux really prefill the page cache with 4 MB of data for each read? Max [1] I suppose what we could do is decouple the copy buffer size from the bitmap granularity, but that would be more work than just a MAX() in backup_calculate_cluster_size(). [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 11:37 ` Max Reitz @ 2019-11-06 13:09 ` Dietmar Maurer 2019-11-06 13:17 ` Max Reitz 0 siblings, 1 reply; 15+ messages in thread From: Dietmar Maurer @ 2019-11-06 13:09 UTC (permalink / raw) To: Max Reitz, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block > Let me elaborate: Yes, a cluster size generally means that it is most > “efficient” to access the storage at that size. But there’s a tradeoff. > At some point, reading the data takes sufficiently long that reading a > bit of metadata doesn’t matter anymore (usually, that is). Any network storage suffers from long network latencies, so it always matters if you do more IOs than necessary. > There is a bit of a problem with making the backup copy size rather > large, and that is the fact that backup’s copy-before-write causes guest > writes to stall. So if the guest just writes a bit of data, a 4 MB > buffer size may mean that in the background it will have to wait for 4 > MB of data to be copied.[1] We use this for several years now in production, and it is not a problem. (Ceph storage is mostly on 10G (or faster) network equipment). > Hm. OTOH, we have the same problem already with the target’s cluster > size, which can of course be 4 MB as well. But I can imagine it to > actually be important for the target, because otherwise there might be > read-modify-write cycles. > > But for the source, I still don’t quite understand why rbd has such a > problem with small read requests. I don’t doubt that it has (as you > explained), but again, how is it then even possible to use rbd as the > backend for a guest that has no idea of this requirement? Does Linux > really prefill the page cache with 4 MB of data for each read? No idea. I just observed that upstream qemu backups with ceph are quite unusable this way. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 13:09 ` Dietmar Maurer @ 2019-11-06 13:17 ` Max Reitz 2019-11-06 13:34 ` Dietmar Maurer 0 siblings, 1 reply; 15+ messages in thread From: Max Reitz @ 2019-11-06 13:17 UTC (permalink / raw) To: Dietmar Maurer, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 2027 bytes --] On 06.11.19 14:09, Dietmar Maurer wrote: >> Let me elaborate: Yes, a cluster size generally means that it is most >> “efficient” to access the storage at that size. But there’s a tradeoff. >> At some point, reading the data takes sufficiently long that reading a >> bit of metadata doesn’t matter anymore (usually, that is). > > Any network storage suffers from long network latencies, so it always > matters if you do more IOs than necessary. Yes, exactly, that’s why I’m saying it makes sense to me to increase the buffer size from the measly 64 kB that we currently have. I just don’t see the point of increasing it exactly to the source cluster size. >> There is a bit of a problem with making the backup copy size rather >> large, and that is the fact that backup’s copy-before-write causes guest >> writes to stall. So if the guest just writes a bit of data, a 4 MB >> buffer size may mean that in the background it will have to wait for 4 >> MB of data to be copied.[1] > > We use this for several years now in production, and it is not a problem. > (Ceph storage is mostly on 10G (or faster) network equipment). So you mean for cases where backup already chooses a 4 MB buffer size because the target has that cluster size? >> Hm. OTOH, we have the same problem already with the target’s cluster >> size, which can of course be 4 MB as well. But I can imagine it to >> actually be important for the target, because otherwise there might be >> read-modify-write cycles. >> >> But for the source, I still don’t quite understand why rbd has such a >> problem with small read requests. I don’t doubt that it has (as you >> explained), but again, how is it then even possible to use rbd as the >> backend for a guest that has no idea of this requirement? Does Linux >> really prefill the page cache with 4 MB of data for each read? > > No idea. I just observed that upstream qemu backups with ceph are > quite unusable this way. Hm, OK. Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 13:17 ` Max Reitz @ 2019-11-06 13:34 ` Dietmar Maurer 2019-11-06 13:52 ` Max Reitz 0 siblings, 1 reply; 15+ messages in thread From: Dietmar Maurer @ 2019-11-06 13:34 UTC (permalink / raw) To: Max Reitz, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block > On 6 November 2019 14:17 Max Reitz <mreitz@redhat.com> wrote: > > > On 06.11.19 14:09, Dietmar Maurer wrote: > >> Let me elaborate: Yes, a cluster size generally means that it is most > >> “efficient” to access the storage at that size. But there’s a tradeoff. > >> At some point, reading the data takes sufficiently long that reading a > >> bit of metadata doesn’t matter anymore (usually, that is). > > > > Any network storage suffers from long network latencies, so it always > > matters if you do more IOs than necessary. > > Yes, exactly, that’s why I’m saying it makes sense to me to increase the > buffer size from the measly 64 kB that we currently have. I just don’t > see the point of increasing it exactly to the source cluster size. > > >> There is a bit of a problem with making the backup copy size rather > >> large, and that is the fact that backup’s copy-before-write causes guest > >> writes to stall. So if the guest just writes a bit of data, a 4 MB > >> buffer size may mean that in the background it will have to wait for 4 > >> MB of data to be copied.[1] > > > > We use this for several years now in production, and it is not a problem. > > (Ceph storage is mostly on 10G (or faster) network equipment). > > So you mean for cases where backup already chooses a 4 MB buffer size > because the target has that cluster size? To make it clear. Backups from Ceph as source are slow. That is why we use a patched qemu version, which uses: cluster_size = Max_Block_Size(source, target) (I guess this only triggers for ceph) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 13:34 ` Dietmar Maurer @ 2019-11-06 13:52 ` Max Reitz 2019-11-06 14:39 ` Vladimir Sementsov-Ogievskiy 0 siblings, 1 reply; 15+ messages in thread From: Max Reitz @ 2019-11-06 13:52 UTC (permalink / raw) To: Dietmar Maurer, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block [-- Attachment #1.1: Type: text/plain, Size: 3005 bytes --] On 06.11.19 14:34, Dietmar Maurer wrote: > >> On 6 November 2019 14:17 Max Reitz <mreitz@redhat.com> wrote: >> >> >> On 06.11.19 14:09, Dietmar Maurer wrote: >>>> Let me elaborate: Yes, a cluster size generally means that it is most >>>> “efficient” to access the storage at that size. But there’s a tradeoff. >>>> At some point, reading the data takes sufficiently long that reading a >>>> bit of metadata doesn’t matter anymore (usually, that is). >>> >>> Any network storage suffers from long network latencies, so it always >>> matters if you do more IOs than necessary. >> >> Yes, exactly, that’s why I’m saying it makes sense to me to increase the >> buffer size from the measly 64 kB that we currently have. I just don’t >> see the point of increasing it exactly to the source cluster size. >> >>>> There is a bit of a problem with making the backup copy size rather >>>> large, and that is the fact that backup’s copy-before-write causes guest >>>> writes to stall. So if the guest just writes a bit of data, a 4 MB >>>> buffer size may mean that in the background it will have to wait for 4 >>>> MB of data to be copied.[1] >>> >>> We use this for several years now in production, and it is not a problem. >>> (Ceph storage is mostly on 10G (or faster) network equipment). >> >> So you mean for cases where backup already chooses a 4 MB buffer size >> because the target has that cluster size? > > To make it clear. Backups from Ceph as source are slow. Yep, but if the target would be another ceph instance, the backup buffer size would be chosen to be 4 MB (AFAIU), so I was wondering whether you are referring to this effect, or to... > That is why we use a patched qemu version, which uses: > > cluster_size = Max_Block_Size(source, target) ...this. The main problem with the stall I mentioned is that I think one of the main use cases of backup is having a fast source and a slow (off-site) target. In such cases, I suppose it becomes annoying if some guest writes (which were fast before the backup started) take a long time because the backup needs to copy quite a bit of data to off-site storage. (And blindly taking the source cluster size would mean that such things could happen if you use local qcow2 files with 2 MB clusters.) So I’d prefer decoupling the backup buffer size and the bitmap granularity, and then set the buffer size to maybe the MAX of source and target cluster sizes. But I don’t know when I can get around to do that. And then probably also cap it at 4 MB or 8 MB, because that happens to be what you need, but I’d prefer for it not to use tons of memory. (The mirror job uses 1 MB per request, for up to 16 parallel requests; and the backup copy-before-write implementation currently (on master) copies 1 MB at a time (per concurrent request), and the whole memory usage of backup is limited at 128 MB.) (OTOH, the minimum should probably be 1 MB.) Max [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: backup_calculate_cluster_size does not consider source 2019-11-06 13:52 ` Max Reitz @ 2019-11-06 14:39 ` Vladimir Sementsov-Ogievskiy 0 siblings, 0 replies; 15+ messages in thread From: Vladimir Sementsov-Ogievskiy @ 2019-11-06 14:39 UTC (permalink / raw) To: Max Reitz, Dietmar Maurer, Wolfgang Bumiller Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, qemu-block 06.11.2019 16:52, Max Reitz wrote: > On 06.11.19 14:34, Dietmar Maurer wrote: >> >>> On 6 November 2019 14:17 Max Reitz <mreitz@redhat.com> wrote: >>> >>> >>> On 06.11.19 14:09, Dietmar Maurer wrote: >>>>> Let me elaborate: Yes, a cluster size generally means that it is most >>>>> “efficient” to access the storage at that size. But there’s a tradeoff. >>>>> At some point, reading the data takes sufficiently long that reading a >>>>> bit of metadata doesn’t matter anymore (usually, that is). >>>> >>>> Any network storage suffers from long network latencies, so it always >>>> matters if you do more IOs than necessary. >>> >>> Yes, exactly, that’s why I’m saying it makes sense to me to increase the >>> buffer size from the measly 64 kB that we currently have. I just don’t >>> see the point of increasing it exactly to the source cluster size. >>> >>>>> There is a bit of a problem with making the backup copy size rather >>>>> large, and that is the fact that backup’s copy-before-write causes guest >>>>> writes to stall. So if the guest just writes a bit of data, a 4 MB >>>>> buffer size may mean that in the background it will have to wait for 4 >>>>> MB of data to be copied.[1] >>>> >>>> We use this for several years now in production, and it is not a problem. >>>> (Ceph storage is mostly on 10G (or faster) network equipment). >>> >>> So you mean for cases where backup already chooses a 4 MB buffer size >>> because the target has that cluster size? >> >> To make it clear. Backups from Ceph as source are slow. > > Yep, but if the target would be another ceph instance, the backup buffer > size would be chosen to be 4 MB (AFAIU), so I was wondering whether you > are referring to this effect, or to... > >> That is why we use a patched qemu version, which uses: >> >> cluster_size = Max_Block_Size(source, target) > > ...this. > > The main problem with the stall I mentioned is that I think one of the > main use cases of backup is having a fast source and a slow (off-site) > target. In such cases, I suppose it becomes annoying if some guest > writes (which were fast before the backup started) take a long time > because the backup needs to copy quite a bit of data to off-site storage. > > (And blindly taking the source cluster size would mean that such things > could happen if you use local qcow2 files with 2 MB clusters.) > > > So I’d prefer decoupling the backup buffer size and the bitmap > granularity, and then set the buffer size to maybe the MAX of source and > target cluster sizes. But I don’t know when I can get around to do that. Note, that problem is not only in copy-before-write operations: if we have big in-flight backup request from backup job itself, all new upcoming guest writes to this area will have to wait. > > And then probably also cap it at 4 MB or 8 MB, because that happens to > be what you need, but I’d prefer for it not to use tons of memory. (The > mirror job uses 1 MB per request, for up to 16 parallel requests; and > the backup copy-before-write implementation currently (on master) copies > 1 MB at a time (per concurrent request), and the whole memory usage of > backup is limited at 128 MB.) > > (OTOH, the minimum should probably be 1 MB.) > Hmmm, I am preparing a patch set about backup, which includes increasing of copied chunk size.. And somehow it leads to performance degradation on my hdd. === What about the following solution: add empty qcow2 with cluster_size = 4M (ohh, 2M is maximum unfortunately) above ceph, enable copy-on-read on this node and start backup from it? The qcow2 node will be a local cache, which will solve both problem with unaligned read from ceph and copy-before-write time? -- Best regards, Vladimir ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-11-06 14:41 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-11-05 10:02 backup_calculate_cluster_size does not consider source Dietmar Maurer 2019-11-06 8:32 ` Stefan Hajnoczi 2019-11-06 9:37 ` Max Reitz 2019-11-06 10:18 ` Dietmar Maurer 2019-11-06 10:37 ` Max Reitz 2019-11-06 10:34 ` Wolfgang Bumiller 2019-11-06 10:42 ` Max Reitz 2019-11-06 11:18 ` Dietmar Maurer 2019-11-06 11:22 ` Max Reitz 2019-11-06 11:37 ` Max Reitz 2019-11-06 13:09 ` Dietmar Maurer 2019-11-06 13:17 ` Max Reitz 2019-11-06 13:34 ` Dietmar Maurer 2019-11-06 13:52 ` Max Reitz 2019-11-06 14:39 ` Vladimir Sementsov-Ogievskiy
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.