From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D48B3C433ED for ; Wed, 19 May 2021 13:47:26 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 30490611AE for ; Wed, 19 May 2021 13:47:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 30490611AE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kamp.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:43142 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ljMXl-00004x-BM for qemu-devel@archiver.kernel.org; Wed, 19 May 2021 09:47:25 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:40190) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ljMBz-0000LZ-UY for qemu-devel@nongnu.org; Wed, 19 May 2021 09:24:57 -0400 Received: from kerio.kamp.de ([195.62.97.192]:37355) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ljMBv-0002SQ-Lk for qemu-devel@nongnu.org; Wed, 19 May 2021 09:24:55 -0400 X-Footer: a2FtcC5kZQ== Received: from submission.kamp.de ([195.62.97.28]) by kerio.kamp.de with ESMTPS (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256 bits)) for qemu-devel@nongnu.org; Wed, 19 May 2021 15:24:39 +0200 Received: (qmail 18859 invoked from network); 19 May 2021 13:24:41 -0000 Received: from ac40.vpn.kamp-intra.net (HELO ?172.20.250.40?) (pl@kamp.de@::ffff:172.20.250.40) by submission.kamp.de with ESMTPS (DHE-RSA-AES128-SHA encrypted) ESMTPA; 19 May 2021 13:24:41 -0000 Subject: Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection To: Vladimir Sementsov-Ogievskiy , Kevin Wolf References: <20210415152214.279844-1-kwolf@redhat.com> <20210415152214.279844-3-kwolf@redhat.com> <06e1910c-102a-e41d-116f-00458f41243c@virtuozzo.com> From: Peter Lieven Message-ID: Date: Wed, 19 May 2021 15:24:40 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US Received-SPF: pass client-ip=195.62.97.192; envelope-from=pl@kamp.de; helo=kerio.kamp.de X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=unavailable autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org, mreitz@redhat.com Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy: > 20.04.2021 18:04, Kevin Wolf wrote: >> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben: >>> 15.04.2021 18:22, Kevin Wolf wrote: >>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas >>>> like non-zero data if the end of the checked area isn't aligned. This >>>> can improve the efficiency of the conversion and was introduced in >>>> commit 8dcd3c9b91a. >>>> >>>> However, it comes with a correctness problem: qemu-img convert is >>>> supposed to sparsify areas that contain only zeros, which it doesn't do >>>> any more. It turns out that this even happens when not only the >>>> unaligned area is zeroed, but also the blocks before and after it. In >>>> the bug report, conversion of a fragmented 10G image containing only >>>> zeros resulted in an image consuming 2.82 GiB even though the expected >>>> size is only 4 KiB. >>>> >>>> As a tradeoff between both, let's ignore zeroed sectors only after >>>> non-zero data to fix the alignment, but if we're only looking at zeros, >>>> keep them as such, even if it may mean additional RMW cycles. >>>> >>> >>> Hmm.. If I understand correctly, we are going to do unaligned >>> write-zero. And that helps. >> >> This can happen (mostly raw images on block devices, I think?), but >> usually it just means skipping the write because we know that the target >> image is already zeroed. >> >> What it does mean is that if the next part is data, we'll have an >> unaligned data write. >> >>> Doesn't that mean that alignment is wrongly detected? >> >> The problem is that you can have bdrv_block_status_above() return the >> same allocation status multiple times in a row, but *pnum can be >> unaligned for the conversion. >> >> We only look at a single range returned by it when detecting the >> alignment, so it could be that we have zero buffers for both 0-11 and >> 12-16 and detect two misaligned ranges, when both together are a >> perfectly aligned zeroed range. >> >> In theory we could try to do some lookahead and merge ranges where >> possible, which should give us the perfect result, but it would make the >> code considerably more complicated. (Whether we want to merge them >> doesn't only depend on the block status, but possibly also on the >> content of a DATA range.) >> >> Kevin >> > > Oh, I understand now the problem, thanks for explanation. > > Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors must not align it down, to be possibly "merged" with next chunk if it is zero too. > > But it's still good to align zeroes down, if data starts somewhere inside the buf, isn't it? > > what about something like this: > > diff --git a/qemu-img.c b/qemu-img.c > index babb5573ab..d1704584a0 100644 > --- a/qemu-img.c > +++ b/qemu-img.c > @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, int n, int *pnum, >          } >      } >   > +    if (i == n) { > +        /* > +         * The whole buf is the same. > +         * > +         * if it's data, just return it. It's the old behavior. > +         * > +         * if it's zero, just return too. It will work good if target is alredy > +         * zeroed. And if next chunk is zero too we'll have no RMW and no reason > +         * to write data. > +         */ > +        *pnum = i; > +        return !is_zero; > +    } > + >      tail = (sector_num + i) & (alignment - 1); >      if (tail) { >          if (is_zero && i <= tail) { > -            /* treat unallocated areas which only consist > -             * of a small tail as allocated. */ > +            /* > +             * For sure next sector after i is data, and it will rewrite this > +             * tail anyway due to RMW. So, let's just write data now. > +             */ >              is_zero = false; >          } >          if (!is_zero) { > -            /* align up end offset of allocated areas. */ > +            /* If possible, align up end offset of allocated areas. */ >              i += alignment - tail; >              i = MIN(i, n); >          } else { > -            /* align down end offset of zero areas. */ > +            /* > +             * For sure next sector after i is data, and it will rewrite this > +             * tail anyway due to RMW. Better is avoid RMW and write zeroes up > +             * to aligned bound. > +             */ >              i -= tail; >          } >      } > > I think we forgot to follow up on this. Has anyone tested this suggestion? Otherwise, I would try to rerun the tests I did with the my old and Kevins suggestion. Peter