All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Denis V. Lunev" <den@openvz.org>
To: John Snow <jsnow@redhat.com>, Kevin Wolf <kwolf@redhat.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org,
	Max Reitz <mreitz@redhat.com>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] [RFC] Proposed qcow2 extension: subcluster allocation
Date: Fri, 14 Apr 2017 07:17:52 +0300	[thread overview]
Message-ID: <6727667f-5422-13d7-be70-b8515f3ff589@openvz.org> (raw)
In-Reply-To: <d34a20e9-8cb7-3b83-e96f-4631f91045cd@redhat.com>

[skipped...]

> Hi Denis,
>
> I've read this entire thread now and I really like Berto's summary which
> I think is one of the best recaps of existing qcow2 problems and this
> discussion so far.
>
> I understand your opinion that we should focus on compatible changes
> before incompatible ones, and I also understand that you are very
> concerned about physical fragmentation for reducing long-term IO.
>
> What I don't understand is why you think that subclusters will increase
> fragmentation. If we admit that fragmentation is a problem now, surely
> increasing cluster sizes to 1 or 2 MB will only help to *reduce*
> physical fragmentation, right?
>
> Subclusters as far as I am understanding them will not actually allow
> subclusters to be located at virtually disparate locations, we will
> continue to allocate clusters as normal -- we'll just keep track of
> which portions of the cluster we've written to to help us optimize COW*.
>
> So if we have a 1MB cluster with 64k subclusters as a hypothetical, if
> we write just the first subcluster, we'll have a map like:
>
> X---------------
>
> Whatever actually happens to exist in this space, whether it be a hole
> we punched via fallocate or literal zeroes, this space is known to the
> filesystem to be contiguous.
>
> If we write to the last subcluster, we'll get:
>
> X--------------X
>
> And again, maybe the dashes are a fallocate hole, maybe they're zeroes.
> but the last subcluster is located virtually exactly 15 subclusters
> behind the first, they're not physically contiguous. We've saved the
> space between them. Future out-of-order writes won't contribute to any
> fragmentation, at least at this level.
>
> You might be able to reduce COW from 5 IOPs to 3 IOPs, but if we tune
> the subclusters right, we'll have *zero*, won't we?
>
> As far as I can tell, this lets us do a lot of good things all at once:
>
> (1) We get some COW optimizations (for reasons Berto and Kevin have
> detailed)
Yes. We are fine with COW. Let us assume that we will have issued read
entire cluster command after the COW, in the situation

X--------------X

with a backing store. This is possible even with 1-2 Mb cluster size.
I have seen 4-5 Mb requests from the guest in the real life. In this
case we will have 3 IOP:
    read left X area, read backing, read right X.
This is the real drawback of the approach, if sub-cluster size is really
small enough, which should be the case for optimal COW. Thus we
will have random IO in the host instead of sequential one in guest.
Thus we have optimized COW at the cost of long term reads. This
is what I am worrying about as we can have a lot of such reads before
any further data change.

With real holes the situation is even worse. If we have real hole
(obtained with truncate), we are in the case mentioned by Roman.
Virtually the space is continuous, but we have host fragmentation,
equal to sub-cluster size.

We are in the tough position even with COW. If sub-cluster is
not equals to the size of the guest filesystem block, we still need
to do COW. The only win is the amount of data copied, but the loss
in the amount of IOPSes is the same. On the other hand, to get
real win, we should have properly aligned guest partitions, know
exactly guest filesystem block size etc etc etc. This is not that
easy as can seen.

> (2) We can increase our cluster size
> (3) L2 cache can cover bigger files with less memory
> (4) Fragmentation decreases.
yes. I like (2)-(4) a lot. But at my opinion we could stick to 1 Mb
without sub-clusters and still get (2)-(4).

>
> Is this not a win all around? We can improve throughput for a lot of
> different reasons all at once, here. Have I misunderstood the discussion
> so far, anyone?
>
> Please correct me where I am wrong and _ELI5_.
Have tried that, sorry for my illiteracy ;)

Den

  reply	other threads:[~2017-04-14  7:51 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 15:01 [Qemu-devel] [RFC] Proposed qcow2 extension: subcluster allocation Alberto Garcia
2017-04-06 16:40 ` Eric Blake
2017-04-07  8:49   ` Alberto Garcia
2017-04-07 12:41   ` Kevin Wolf
2017-04-07 14:24     ` Alberto Garcia
2017-04-21 21:09   ` [Qemu-devel] proposed qcow2 extension: cluster reservations [was: " Eric Blake
2017-04-22 17:56     ` Max Reitz
2017-04-24 11:45       ` Kevin Wolf
2017-04-24 12:46       ` Alberto Garcia
2017-04-07 12:20 ` [Qemu-devel] " Stefan Hajnoczi
2017-04-07 12:24   ` Alberto Garcia
2017-04-07 13:01   ` Kevin Wolf
2017-04-10 15:32     ` Stefan Hajnoczi
2017-04-07 17:10 ` Max Reitz
2017-04-10  8:42   ` Kevin Wolf
2017-04-10 15:03     ` Max Reitz
2017-04-11 12:56   ` Alberto Garcia
2017-04-11 14:04     ` Max Reitz
2017-04-11 14:31       ` Alberto Garcia
2017-04-11 14:45         ` [Qemu-devel] [Qemu-block] " Eric Blake
2017-04-12 12:41           ` Alberto Garcia
2017-04-12 14:10             ` Max Reitz
2017-04-13  8:05               ` Alberto Garcia
2017-04-13  9:02                 ` Kevin Wolf
2017-04-13  9:05                   ` Alberto Garcia
2017-04-11 14:49         ` [Qemu-devel] " Kevin Wolf
2017-04-11 14:58           ` Eric Blake
2017-04-11 14:59           ` Max Reitz
2017-04-11 15:08             ` Eric Blake
2017-04-11 15:18               ` Max Reitz
2017-04-11 15:29                 ` Kevin Wolf
2017-04-11 15:29                   ` Max Reitz
2017-04-11 15:30                 ` Eric Blake
2017-04-11 15:34                   ` Max Reitz
2017-04-12 12:47           ` Alberto Garcia
2017-04-12 16:54 ` Denis V. Lunev
2017-04-13 11:58   ` Alberto Garcia
2017-04-13 12:44     ` Denis V. Lunev
2017-04-13 13:05       ` Kevin Wolf
2017-04-13 13:09         ` Denis V. Lunev
2017-04-13 13:36           ` Alberto Garcia
2017-04-13 14:06             ` Denis V. Lunev
2017-04-13 13:21       ` Alberto Garcia
2017-04-13 13:30         ` Denis V. Lunev
2017-04-13 13:59           ` Kevin Wolf
2017-04-13 15:04           ` Alberto Garcia
2017-04-13 15:17             ` Denis V. Lunev
2017-04-18 11:52               ` Alberto Garcia
2017-04-18 17:27                 ` Denis V. Lunev
2017-04-13 13:51         ` Kevin Wolf
2017-04-13 14:15           ` Alberto Garcia
2017-04-13 14:27             ` Kevin Wolf
2017-04-13 16:42               ` [Qemu-devel] [Qemu-block] " Roman Kagan
2017-04-13 14:42           ` [Qemu-devel] " Denis V. Lunev
2017-04-12 17:55 ` Denis V. Lunev
2017-04-12 18:20   ` Eric Blake
2017-04-12 19:02     ` Denis V. Lunev
2017-04-13  9:44       ` Kevin Wolf
2017-04-13 10:19         ` Denis V. Lunev
2017-04-14  1:06           ` [Qemu-devel] [Qemu-block] " John Snow
2017-04-14  4:17             ` Denis V. Lunev [this message]
2017-04-18 11:22               ` Kevin Wolf
2017-04-18 17:30                 ` Denis V. Lunev
2017-04-14  7:40             ` Roman Kagan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6727667f-5422-13d7-be70-b8515f3ff589@openvz.org \
    --to=den@openvz.org \
    --cc=jsnow@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.