On 26.08.20 08:49, Vladimir Sementsov-Ogievskiy wrote: > 25.08.2020 18:11, Max Reitz wrote: >> On 21.08.20 16:11, Vladimir Sementsov-Ogievskiy wrote: >>> It's intended to be inserted between format and protocol nodes to >>> preallocate additional space (expanding protocol file) on writes >>> crossing EOF. It improves performance for file-systems with slow >>> allocation. >>> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy >>> --- >>>   docs/system/qemu-block-drivers.rst.inc |  26 +++ >>>   qapi/block-core.json                   |  20 +- >>>   block/preallocate.c                    | 291 +++++++++++++++++++++++++ >>>   block/Makefile.objs                    |   1 + >>>   4 files changed, 337 insertions(+), 1 deletion(-) >>>   create mode 100644 block/preallocate.c [...] >>> diff --git a/block/preallocate.c b/block/preallocate.c >>> new file mode 100644 >>> index 0000000000..bdf54dbd2f >>> --- /dev/null >>> +++ b/block/preallocate.c >>> @@ -0,0 +1,291 @@ >>> +/* >>> + * preallocate filter driver >>> + * >>> + * The driver performs preallocate operation: it is injected above >>> + * some node, and before each write over EOF it does additional >>> preallocating >>> + * write-zeroes request. >>> + * >>> + * Copyright (c) 2020 Virtuozzo International GmbH. >>> + * >>> + * Author: >>> + *  Sementsov-Ogievskiy Vladimir >>> + * >>> + * This program is free software; you can redistribute it and/or modify >>> + * it under the terms of the GNU General Public License as published by >>> + * the Free Software Foundation; either version 2 of the License, or >>> + * (at your option) any later version. >>> + * >>> + * This program is distributed in the hope that it will be useful, >>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the >>> + * GNU General Public License for more details. >>> + * >>> + * You should have received a copy of the GNU General Public License >>> + * along with this program. If not, see . >>> + */ >>> + >>> +#include "qemu/osdep.h" >>> + >>> +#include "qapi/error.h" >>> +#include "qemu/module.h" >>> +#include "qemu/option.h" >>> +#include "qemu/units.h" >>> +#include "block/block_int.h" >>> + >>> + >>> +typedef struct BDRVPreallocateState { >>> +    int64_t prealloc_size; >>> +    int64_t prealloc_align; >>> + >>> +    /* >>> +     * Filter is started as not-active, so it doesn't do any >>> preallocations nor >>> +     * requires BLK_PERM_RESIZE on its child. This is needed to >>> create filter >>> +     * above another node-child and then do bdrv_replace_node to >>> insert the >>> +     * filter. >> >> A bit weird, but seems fair.  It’s basically just a cache for whether >> this node has a writer on it or not. >> >> Apart from the weirdness, I don’t understand the “another node-child”. >> Say you have “format” -> “proto”, and then you want to insert >> “prealloc”.  Wouldn’t you blockdev-add prealloc,file=proto and then >> blockdev-replace format.file=prealloc? > > Yes something like this. Actually, I'm about inserting the filter > automatically from block layer code, but your reasoning is about same > thing and is better. > >> So what is that “other node-child”? > > Hmm, just my misleading wording. At least '-' in wrong place. Just > "other node child", or "child of another node".. OK. >>> +     * >>> +     * Filter becomes active the first time it detects that its >>> parents have >>> +     * BLK_PERM_RESIZE on it. >>> +     * Filter becomes active forever: it doesn't lose active status >>> if parents >>> +     * lose BLK_PERM_RESIZE, otherwise we'll not be able to shrink >>> the file on >>> +     * filter close. >> >> Oh, the file is shrunk?  That wasn’t clear from the documentation. >> >> Hm.  Seems like useful behavior.  I just wonder if keeping the >> permission around indefinitely makes sense.  Why not just truncate it >> when the permission is revoked? > > How? Parent is closed earlier, so on close we will not have the > permission. So, we force-keep the permission up to our close(). I thought that preallocate_child_perm() would be invoked when the parent is detached, so we could do the truncate there, before relinquishing preallocate’s RESIZE permission. [...] >>> +static void preallocate_close(BlockDriverState *bs) >>> +{ >>> +    BDRVPreallocateState *s = bs->opaque; >>> + >>> +    if (s->active && s->data_end >= 0 && >>> +        bdrv_getlength(bs->file->bs) > s->data_end) >>> +    { >>> +        bdrv_truncate(bs->file, s->data_end, true, >>> PREALLOC_MODE_OFF, 0, NULL); >> >> Now that I think more about it...  What if there are other writers on >> bs->file?  This may throw data away. > > Good point. Actually, if bs->file has other writing parents, the logic > of the filter > around "data_end" is broken. So we must unshare WRITE and RESIZE > permissions. That’s certainly a heavy hammer, but it’d work. >>  Maybe BDS.wr_highest_offset can >> help to make a better call? > > Anyway, we'll have to use wr_highest_offset of the filter not the child > node > (in the child wr_highest_offset will track preallocations as well), That’s true. > so we want to unshare WRITE/RESIZE permissions. [...] >>> +static int coroutine_fn >>> +preallocate_co_truncate(BlockDriverState *bs, int64_t offset, >>> +                        bool exact, PreallocMode prealloc, >>> +                        BdrvRequestFlags flags, Error **errp) >>> +{ >>> +    BDRVPreallocateState *s = bs->opaque; >>> +    int ret = bdrv_co_truncate(bs->file, offset, exact, prealloc, >>> flags, errp); >>> + >>> +    /* s->data_end may become negative here, which means unknown >>> data end */ >>> +    s->data_end = bdrv_getlength(bs->file->bs); >> >> What would be the problem with just setting data_end = offset?  We only >> use data_end to truncate down on close, so basically repeat the >> bdrv_co_truncate() call here, which seems correct. > > We can use offset only if ret >= 0 && exact is true.. Well, on close, we ignore any errors from truncate() anyway. So even if we used exact=false here, it shouldn’t matter. > But simpler is > just call > bdrv_getlength() anyway. Certainly simpler than thinking about potential edge cases, true. Max