All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
	Eric Blake <eblake@redhat.com>,
	qemu-devel@nongnu.org, qemu-block@nongnu.org
Subject: Re: [PATCH v2 2/6] block: block-status cache for data regions
Date: Mon, 12 Jul 2021 09:45:49 +0200	[thread overview]
Message-ID: <c36ba38a-2346-8289-f01c-1ccc812a30f5@redhat.com> (raw)
In-Reply-To: <YOSNB2PHpkA4Je0S@redhat.com>

On 06.07.21 19:04, Kevin Wolf wrote:
> Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
>> As we have attempted before
>> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
>> "file-posix: Cache lseek result for data regions";
>> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
>> "file-posix: Cache next hole"), this patch seeks to reduce the number of
>> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
>> main difference is that this time it is implemented as part of the
>> general block layer code.
>>
>> The problem we face is that on some filesystems or in some
>> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
>> implementation is outside of qemu, there is little we can do about its
>> performance.
>>
>> We have already introduced the want_zero parameter to
>> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
>> unless we really want zero information; but sometimes we do want that
>> information, because for files that consist largely of zero areas,
>> special-casing those areas can give large performance boosts.  So the
>> real problem is with files that consist largely of data, so that
>> inquiring the block status does not gain us much performance, but where
>> such an inquiry itself takes a lot of time.
>>
>> To address this, we want to cache data regions.  Most of the time, when
>> bad performance is reported, it is in places where the image is iterated
>> over from start to end (qemu-img convert or the mirror job), so a simple
>> yet effective solution is to cache only the current data region.
>>
>> (Note that only caching data regions but not zero regions means that
>> returning false information from the cache is not catastrophic: Treating
>> zeroes as data is fine.  While we try to invalidate the cache on zero
>> writes and discards, such incongruences may still occur when there are
>> other processes writing to the image.)
>>
>> We only use the cache for nodes without children (i.e. protocol nodes),
>> because that is where the problem is: Drivers that rely on block-status
>> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
>>
>> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> Since you indicated that you'll respin the patch, I'll add my minor
> comments:
>
>> @@ -2442,9 +2445,58 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
>>       aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>>   
>>       if (bs->drv->bdrv_co_block_status) {
>> -        ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
>> -                                            aligned_bytes, pnum, &local_map,
>> -                                            &local_file);
>> +        bool from_cache = false;
>> +
>> +        /*
>> +         * Use the block-status cache only for protocol nodes: Format
>> +         * drivers are generally quick to inquire the status, but protocol
>> +         * drivers often need to get information from outside of qemu, so
>> +         * we do not have control over the actual implementation.  There
>> +         * have been cases where inquiring the status took an unreasonably
>> +         * long time, and we can do nothing in qemu to fix it.
>> +         * This is especially problematic for images with large data areas,
>> +         * because finding the few holes in them and giving them special
>> +         * treatment does not gain much performance.  Therefore, we try to
>> +         * cache the last-identified data region.
>> +         *
>> +         * Second, limiting ourselves to protocol nodes allows us to assume
>> +         * the block status for data regions to be DATA | OFFSET_VALID, and
>> +         * that the host offset is the same as the guest offset.
>> +         *
>> +         * Note that it is possible that external writers zero parts of
>> +         * the cached regions without the cache being invalidated, and so
>> +         * we may report zeroes as data.  This is not catastrophic,
>> +         * however, because reporting zeroes as data is fine.
>> +         */
>> +        if (QLIST_EMPTY(&bs->children)) {
>> +            if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
>> +                ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
>> +                local_file = bs;
>> +                local_map = aligned_offset;
>> +
>> +                from_cache = true;
>> +            }
>> +        }
>> +
>> +        if (!from_cache) {
> Is having a separate variable from_cache really useful? This looks like
> it could just be:
>
>      if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
>          // The code above
>      } else {
>          // The code below
>      }

Oh, yes.

(I guess this was mainly an artifact from v1 where there was a mutex 
around the bdrv_bsc_is_data() block.  Now it’s better to just roll both 
conditions into one, yes.)

>> +            ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
>> +                                                aligned_bytes, pnum, &local_map,
>> +                                                &local_file);
>> +
>> +            /*
>> +             * Note that checking QLIST_EMPTY(&bs->children) is also done when
>> +             * the cache is queried above.  Technically, we do not need to check
>> +             * it here; the worst that can happen is that we fill the cache for
>> +             * non-protocol nodes, and then it is never used.  However, filling
>> +             * the cache requires an RCU update, so double check here to avoid
>> +             * such an update if possible.
>> +             */
>> +            if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
>> +                QLIST_EMPTY(&bs->children))
>> +            {
> Would it be worth asserting that local_map == aligned_offset, because
> otherwise with a buggy protocol driver, the result from the cache could
> be different from the first call without us noticing?

I think it would be indeed.

Max

>> +                bdrv_bsc_fill(bs, aligned_offset, *pnum);
>> +            }
>> +        }
>>       } else {
>>           /* Default code for filters */
> Kevin
>



  reply	other threads:[~2021-07-12  7:47 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-23 15:01 [PATCH v2 0/6] block: block-status cache for data regions Max Reitz
2021-06-23 15:01 ` [PATCH v2 1/6] block: Drop BDS comment regarding bdrv_append() Max Reitz
2021-06-23 15:01 ` [PATCH v2 2/6] block: block-status cache for data regions Max Reitz
2021-06-24 10:05   ` Vladimir Sementsov-Ogievskiy
2021-06-24 11:11     ` Max Reitz
2021-07-06 17:04   ` Kevin Wolf
2021-07-12  7:45     ` Max Reitz [this message]
2021-06-23 15:01 ` [PATCH v2 3/6] block: Clarify that @bytes is no limit on *pnum Max Reitz
2021-06-24  9:15   ` Vladimir Sementsov-Ogievskiy
2021-06-24 10:16     ` Max Reitz
2021-06-24 10:25       ` Vladimir Sementsov-Ogievskiy
2021-06-24 11:12         ` Max Reitz
2021-06-28 19:10       ` Eric Blake
2021-07-12  7:47         ` Max Reitz
2021-06-23 15:01 ` [PATCH v2 4/6] block/file-posix: Do not force-cap *pnum Max Reitz
2021-06-23 15:01 ` [PATCH v2 5/6] block/gluster: " Max Reitz
2021-06-23 15:01 ` [PATCH v2 6/6] block/iscsi: " Max Reitz
2021-07-06 17:06 ` [PATCH v2 0/6] block: block-status cache for data regions Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c36ba38a-2346-8289-f01c-1ccc812a30f5@redhat.com \
    --to=mreitz@redhat.com \
    --cc=eblake@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=vsementsov@virtuozzo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.