All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Denis V. Lunev" <den@openvz.org>
To: Kevin Wolf <kwolf@redhat.com>
Cc: John Snow <jsnow@redhat.com>,
	qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Max Reitz <mreitz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Tue, 18 Apr 2017 20:30:17 +0300	[thread overview]
Message-ID: <7e278fc7-675a-e42f-94d9-bff3f347a2c4@openvz.org> (raw)
In-Reply-To: <20170418112215.GC9236@noname.redhat.com>

On 04/18/2017 02:22 PM, Kevin Wolf wrote:
> Am 14.04.2017 um 06:17 hat Denis V. Lunev geschrieben:
>> [skipped...]
>>
>>> Hi Denis,
>>>
>>> I've read this entire thread now and I really like Berto's summary which
>>> I think is one of the best recaps of existing qcow2 problems and this
>>> discussion so far.
>>>
>>> I understand your opinion that we should focus on compatible changes
>>> before incompatible ones, and I also understand that you are very
>>> concerned about physical fragmentation for reducing long-term IO.
>>>
>>> What I don't understand is why you think that subclusters will increase
>>> fragmentation. If we admit that fragmentation is a problem now, surely
>>> increasing cluster sizes to 1 or 2 MB will only help to *reduce*
>>> physical fragmentation, right?
>>>
>>> Subclusters as far as I am understanding them will not actually allow
>>> subclusters to be located at virtually disparate locations, we will
>>> continue to allocate clusters as normal -- we'll just keep track of
>>> which portions of the cluster we've written to to help us optimize COW*.
>>>
>>> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
>>> we write just the first subcluster, we'll have a map like:
>>>
>>> X---------------
>>>
>>> Whatever actually happens to exist in this space, whether it be a hole
>>> we punched via fallocate or literal zeroes, this space is known to the
>>> filesystem to be contiguous.
>>>
>>> If we write to the last subcluster, we'll get:
>>>
>>> X--------------X
>>>
>>> And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
>>> but the last subcluster is located virtually exactly 15 subclusters
>>> behind the first, they're not physically contiguous. We've saved the
>>> space between them. Future out-of-order writes won't contribute to any
>>> fragmentation, at least at this level.
>>>
>>> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
>>> the subclusters right, we'll have *zero*, won't we?
>>>
>>> As far as I can tell, this lets us do a lot of good things all at once:
>>>
>>> (1) We get some COW optimizations (for reasons Berto and Kevin have
>>> detailed)
>> Yes. We are fine with COW. Let us assume that we will have issued read
>> entire cluster command after the COW, in the situation
>>
>> X--------------X
>>
>> with a backing store. This is possible even with 1-2 Mb cluster size.
>> I have seen 4-5 Mb requests from the guest in the real life. In this
>> case we will have 3 IOP:
>>     read left X area, read backing, read right X.
>> This is the real drawback of the approach, if sub-cluster size is really
>> small enough, which should be the case for optimal COW. Thus we
>> will have random IO in the host instead of sequential one in guest.
>> Thus we have optimized COW at the cost of long term reads. This
>> is what I am worrying about as we can have a lot of such reads before
>> any further data change.
> So just to avoid misunderstandings about what you're comparing here:
> You get these 3 iops for 2 MB clusters with 64k subclusters, whereas you
> would get only a single operation for 2 MB clusters without subclusters.
> Today's 64k clusters without subclusters behave no better than the
> 2M/64k version, but that's not what you're comparing.
>
> Yes, you are correct about this observation. But it is a tradeoff that
> you're intentionally making when using backing files. In the extreme,
> there is an alternative that performs so much better: Instead of using a
> backing file, use 'qemu-img convert' to copy (and defragment) the whole
> image upfront. No COW whatsoever, no fragmentation, fast reads. The
> downside is that it takes a while to copy the whole image upfront, and
> it also costs quite a bit of disk space.
in general, for production environments, this is total pain. We
have a lot of customers with Tb images. Free space is also
a real problem for them.


> So once we acknowledge that we're dealing with a tradeoff here, it
> becomes obvious that neither the extreme setup for performance (copy the
> whole image upfront) nor the extreme setup for sparseness (COW on a
> sector level) are the right default for the average case, nor is
> optimising one-sidedly a good idea. It is good if we can provide
> solutions for extreme cases, but by default we need to cater for the
> average case, which cares both about reasonable performance and disk
> usage.
yes, I agree. But 64kb cluster size by default for big images (not for
backup!)
is another extreme ;) Who will care with 1 Tb image or 10 Tb image about
several Mbs.

Pls note, that 1 Mb is better for the default block size as with this size
sequential write is equal to the random write for non-SSD disks.

Den

  reply	other threads:[~2017-04-18 21:04 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
2017-04-06 16:40 ` Eric Blake
2017-04-07  8:49   ` Alberto Garcia
2017-04-07 12:41   ` Kevin Wolf
2017-04-07 14:24     ` Alberto Garcia
2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
2017-04-22 17:56     ` Max Reitz
2017-04-24 11:45       ` Kevin Wolf
2017-04-24 12:46       ` Alberto Garcia
2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
2017-04-07 12:24   ` Alberto Garcia
2017-04-07 13:01   ` Kevin Wolf
2017-04-10 15:32     ` Stefan Hajnoczi
2017-04-07 17:10 ` Max Reitz
2017-04-10  8:42   ` Kevin Wolf
2017-04-10 15:03     ` Max Reitz
2017-04-11 12:56   ` Alberto Garcia
2017-04-11 14:04     ` Max Reitz
2017-04-11 14:31       ` Alberto Garcia
2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
2017-04-12 12:41           ` Alberto Garcia
2017-04-12 14:10             ` Max Reitz
2017-04-13  8:05               ` Alberto Garcia
2017-04-13  9:02                 ` Kevin Wolf
2017-04-13  9:05                   ` Alberto Garcia
2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
2017-04-11 14:58           ` Eric Blake
2017-04-11 14:59           ` Max Reitz
2017-04-11 15:08             ` Eric Blake
2017-04-11 15:18               ` Max Reitz
2017-04-11 15:29                 ` Kevin Wolf
2017-04-11 15:29                   ` Max Reitz
2017-04-11 15:30                 ` Eric Blake
2017-04-11 15:34                   ` Max Reitz
2017-04-12 12:47           ` Alberto Garcia
2017-04-12 16:54 ` Denis V. Lunev
2017-04-13 11:58   ` Alberto Garcia
2017-04-13 12:44     ` Denis V. Lunev
2017-04-13 13:05       ` Kevin Wolf
2017-04-13 13:09         ` Denis V. Lunev
2017-04-13 13:36           ` Alberto Garcia
2017-04-13 14:06             ` Denis V. Lunev
2017-04-13 13:21       ` Alberto Garcia
2017-04-13 13:30         ` Denis V. Lunev
2017-04-13 13:59           ` Kevin Wolf
2017-04-13 15:04           ` Alberto Garcia
2017-04-13 15:17             ` Denis V. Lunev
2017-04-18 11:52               ` Alberto Garcia
2017-04-18 17:27                 ` Denis V. Lunev
2017-04-13 13:51         ` Kevin Wolf
2017-04-13 14:15           ` Alberto Garcia
2017-04-13 14:27             ` Kevin Wolf
2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
2017-04-12 17:55 ` Denis V. Lunev
2017-04-12 18:20   ` Eric Blake
2017-04-12 19:02     ` Denis V. Lunev
2017-04-13  9:44       ` Kevin Wolf
2017-04-13 10:19         ` Denis V. Lunev
2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
2017-04-14  4:17             ` Denis V. Lunev
2017-04-18 11:22               ` Kevin Wolf
2017-04-18 17:30                 ` Denis V. Lunev [this message]
2017-04-14  7:40             ` Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e278fc7-675a-e42f-94d9-bff3f347a2c4@openvz.org \
    --to=den@openvz.org \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.