qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	qemu-block@nongnu.org
Cc: fam@euphon.net, kwolf@redhat.com, qemu-devel@nongnu.org,
	armbru@redhat.com, nsoffer@redhat.com, stefanha@redhat.com,
	den@openvz.org
Subject: Re: [PATCH v5 07/10] block: introduce preallocate filter
Date: Wed, 26 Aug 2020 11:58:39 +0200	[thread overview]
Message-ID: <78c778e7-ee15-027e-aab5-4d7e9935e01d@redhat.com> (raw)
In-Reply-To: <09281829-fbe3-7007-ca03-b57b6581c3bc@virtuozzo.com>


[-- Attachment #1.1: Type: text/plain, Size: 9071 bytes --]

On 26.08.20 11:15, Vladimir Sementsov-Ogievskiy wrote:
> 26.08.2020 11:52, Max Reitz wrote:
>> On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote:
>>> 25.08.2020 18:11, Max Reitz wrote:
>>>> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote:
>>>>> It's intended to be inserted between format and protocol nodes to
>>>>> preallocate additional space (expanding protocol file) on writes
>>>>> crossing EOF. It improves performance for file-systems with slow
>>>>> allocation.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> ---
>>>>>    docs/system/qemu-block-drivers.rst.inc |  26 +++
>>>>>    qapi/block-core.json                   |  20 +-
>>>>>    block/preallocate.c                    | 291
>>>>> +++++++++++++++++++++++++
>>>>>    block/Makefile.objs                    |   1 +
>>>>>    4 files changed, 337 insertions(+), 1 deletion(-)
>>>>>    create mode 100644 block/preallocate.c
>>
>> [...]
>>
>>>>> diff --git a/block/preallocate.c b/block/preallocate.c
>>>>> new file mode 100644
>>>>> index 0000000000..bdf54dbd2f
>>>>> --- /dev/null
>>>>> +++ b/block/preallocate.c
>>>>> @@ -0,0 +1,291 @@
>>>>> +/*
>>>>> + * preallocate filter driver
>>>>> + *
>>>>> + * The driver performs preallocate operation: it is injected above
>>>>> + * some node, and before each write over EOF it does additional
>>>>> preallocating
>>>>> + * write-zeroes request.
>>>>> + *
>>>>> + * Copyright (c) 2020 Virtuozzo International GmbH.
>>>>> + *
>>>>> + * Author:
>>>>> + *  Sementsov-Ogievskiy Vladimir <vsementsov@virtuozzo.com>
>>>>> + *
>>>>> + * This program is free software; you can redistribute it and/or
>>>>> modify
>>>>> + * it under the terms of the GNU General Public License as
>>>>> published by
>>>>> + * the Free Software Foundation; either version 2 of the License, or
>>>>> + * (at your option) any later version.
>>>>> + *
>>>>> + * This program is distributed in the hope that it will be useful,
>>>>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>>>>> + * GNU General Public License for more details.
>>>>> + *
>>>>> + * You should have received a copy of the GNU General Public License
>>>>> + * along with this program. If not, see
>>>>> <http://www.gnu.org/licenses/>.
>>>>> + */
>>>>> +
>>>>> +#include "qemu/osdep.h"
>>>>> +
>>>>> +#include "qapi/error.h"
>>>>> +#include "qemu/module.h"
>>>>> +#include "qemu/option.h"
>>>>> +#include "qemu/units.h"
>>>>> +#include "block/block_int.h"
>>>>> +
>>>>> +
>>>>> +typedef struct BDRVPreallocateState {
>>>>> +    int64_t prealloc_size;
>>>>> +    int64_t prealloc_align;
>>>>> +
>>>>> +    /*
>>>>> +     * Filter is started as not-active, so it doesn't do any
>>>>> preallocations nor
>>>>> +     * requires BLK_PERM_RESIZE on its child. This is needed to
>>>>> create filter
>>>>> +     * above another node-child and then do bdrv_replace_node to
>>>>> insert the
>>>>> +     * filter.
>>>>
>>>> A bit weird, but seems fair.  It’s basically just a cache for whether
>>>> this node has a writer on it or not.
>>>>
>>>> Apart from the weirdness, I don’t understand the “another node-child”.
>>>> Say you have “format” -> “proto”, and then you want to insert
>>>> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then
>>>> blockdev-replace format.file=prealloc?
>>>
>>> Yes something like this. Actually, I'm about inserting the filter
>>> automatically from block layer code, but your reasoning is about same
>>> thing and is better.
>>>
>>>> So what is that “other node-child”?
>>>
>>> Hmm, just my misleading wording. At least '-' in wrong place. Just
>>> "other node child", or "child of another node"..
>>
>> OK.
>>
>>>>> +     *
>>>>> +     * Filter becomes active the first time it detects that its
>>>>> parents have
>>>>> +     * BLK_PERM_RESIZE on it.
>>>>> +     * Filter becomes active forever: it doesn't lose active status
>>>>> if parents
>>>>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink
>>>>> the file on
>>>>> +     * filter close.
>>>>
>>>> Oh, the file is shrunk?  That wasn’t clear from the documentation.
>>>>
>>>> Hm.  Seems like useful behavior.  I just wonder if keeping the
>>>> permission around indefinitely makes sense.  Why not just truncate it
>>>> when the permission is revoked?
>>>
>>> How? Parent is closed earlier, so on close we will not have the
>>> permission. So, we force-keep the permission up to our close().
>>
>> I thought that preallocate_child_perm() would be invoked when the parent
>> is detached, so we could do the truncate there, before relinquishing
>> preallocate’s RESIZE permission.
>>
> 
> Hmm, I can check it. I just a bit afraid of doing something serious like
> truncation in .bdrv_child_perm() handler, which doesn't seem to imply
> such usage.

I’m a bit conflicted.  On one hand, I share your concern.  On the other,
I find it completely reasonable that drivers might want to do I/O when
permissions change.

Usually, this is done as part of reopen, like in qcow2 when a drive
changes from R/W to RO and caches need to be flushed.  But I actually
think it makes more sense as part of the permission system, because of
course a reopen is not the only time when permissions can change.

So that would be another alternative, to implement
.bdrv_reopen_prepare(), and to check reopen_state->perm there.  If
RESIZE is about to go away, then we truncate.  We could keep track of
whether we did any preallocations after the last such truncate, and if
we did, do another truncate on close.

This would cover reopen at least.  Seems better than nothing, but, well...

(One big problem I see with truncating in .bdrv_child_perm() is that
that function is in no way committing.  It’s just a request: “If your
parents need this of you, what do you need of your children?”
But I think that could be addressed by doing it in .bdrv_set_perm()
instead.  Of note is that file-posix actually does do I/O in its
raw_set_perm() function, in that it commits to file locks.)

[...]

>>>>> +static int coroutine_fn
>>>>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset,
>>>>> +                        bool exact, PreallocMode prealloc,
>>>>> +                        BdrvRequestFlags flags, Error **errp)
>>>>> +{
>>>>> +    BDRVPreallocateState *s = bs->opaque;
>>>>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc,
>>>>> flags, errp);
>>>>> +
>>>>> +    /* s->data_end may become negative here, which means unknown
>>>>> data end */
>>>>> +    s->data_end = bdrv_getlength(bs->file->bs);
>>>>
>>>> What would be the problem with just setting data_end = offset?  We only
>>>> use data_end to truncate down on close, so basically repeat the
>>>> bdrv_co_truncate() call here, which seems correct.
>>>
>>> We can use offset only if ret >= 0 && exact is true..
>>
>> Well, on close, we ignore any errors from truncate() anyway.  So even if
>> we used exact=false here, it shouldn’t matter.
> 
> Here we handle truncate from user. If I understand "exact" correctly the
> following is possible:
> 
> 1. parent does truncate 1M -> 2M, exact=false
> 2. driver decides to truncate to 5M (exact condition is sutisfied)
> 3. but we remember s->data_end = 2M, and on close will shrink to 2M

In theory, yes; in practice, exact=false just means to ignore failures
due to unsupported sizes.  So in this example, the driver would have
resized to 5M because 2M is an impossible size, and thus the shrink on
close would just fail.

> Doesn't seem a real problem.. So, we just consider the preallocation
> done by driver as our preallocation to be dropped on close().
> 
> Could there be problems on user shrink?
> 
> 1. parent does truncate 2M -> 1M, exact=false
> 2. driver decides to do notihng (why not)
> 3. on close() we'll shrink to 1M..
> 
> Again, seems no real problems.

Same as above, in practice the only difference between exact=false and
exact=true is how failures are reported.

> But in case when bdrv_co_truncate failed, we should call bdrv_getlength
> anyway, as we don't know the result of failed truncation. Or we can just
> set s->data_end = -1, and not care too much about fail-scenarios.
> 
> So, probably we can do
> s->data_end = ret < 0 : ret : offset;
> 
> But I just feel safer with additional bdrv_getlength().

Yes, let’s just leave it as you wrote it.

(It’s a bit silly of me to argue based on how drivers handle exact=false
in practice right now.  It can always change.  Also, I shouldn’t pretend
calling bdrv_getlength() would be an issue whatsoever, because it just
isn’t.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2020-08-26  9:59 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-21 14:11 [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 01/10] block: simplify comment to BDRV_REQ_SERIALISING Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 02/10] block/io.c: drop assertion on double waiting for request serialisation Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 03/10] block/io: split out bdrv_find_conflicting_request Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 04/10] block/io: bdrv_wait_serialising_requests_locked: drop extra bs arg Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 05/10] block: bdrv_mark_request_serialising: split non-waiting function Vladimir Sementsov-Ogievskiy
2020-08-25 12:43   ` Max Reitz
2020-08-21 14:11 ` [PATCH v5 06/10] block: introduce BDRV_REQ_NO_WAIT flag Vladimir Sementsov-Ogievskiy
2020-08-25 13:10   ` Max Reitz
2020-08-26  6:26     ` Vladimir Sementsov-Ogievskiy
2020-08-26  8:36       ` Max Reitz
2020-08-26  8:57         ` Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 07/10] block: introduce preallocate filter Vladimir Sementsov-Ogievskiy
2020-08-24 17:52   ` Vladimir Sementsov-Ogievskiy
2020-08-26  7:06     ` Vladimir Sementsov-Ogievskiy
2020-08-25 15:11   ` Max Reitz
2020-08-26  6:49     ` Vladimir Sementsov-Ogievskiy
2020-08-26  8:52       ` Max Reitz
2020-08-26  9:15         ` Vladimir Sementsov-Ogievskiy
2020-08-26  9:58           ` Max Reitz [this message]
2020-08-26 11:29             ` Vladimir Sementsov-Ogievskiy
2020-08-26 13:51     ` David Edmondson
2020-08-27  9:19       ` Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 08/10] iotests.py: add verify_o_direct helper Vladimir Sementsov-Ogievskiy
2020-08-21 17:29   ` Nir Soffer
2020-08-21 14:11 ` [PATCH v5 09/10] iotests.py: add filter_img_check Vladimir Sementsov-Ogievskiy
2020-08-21 14:11 ` [PATCH v5 10/10] iotests: add 298 to test new preallocate filter driver Vladimir Sementsov-Ogievskiy
2020-08-27 21:08 ` [PATCH v5 00/10] preallocate filter Vladimir Sementsov-Ogievskiy
2020-09-01 15:07   ` Max Reitz
2020-09-17  9:00     ` Vladimir Sementsov-Ogievskiy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78c778e7-ee15-027e-aab5-4d7e9935e01d@redhat.com \
    --to=mreitz@redhat.com \
    --cc=armbru@redhat.com \
    --cc=den@openvz.org \
    --cc=fam@euphon.net \
    --cc=kwolf@redhat.com \
    --cc=nsoffer@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).