From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57903) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bPnfN-00054a-VX for qemu-devel@nongnu.org; Wed, 20 Jul 2016 05:19:47 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bPnfJ-0004vT-Du for qemu-devel@nongnu.org; Wed, 20 Jul 2016 05:19:45 -0400 Date: Wed, 20 Jul 2016 05:19:37 -0400 (EDT) From: Paolo Bonzini Message-ID: <1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com> In-Reply-To: <20160720073836.GF10539@ad.usersys.redhat.com> References: <1468901281-22858-1-git-send-email-eblake@redhat.com> <578E4708.5080308@redhat.com> <20160720033402.GA7641@ad.usersys.redhat.com> <578EF446.70202@redhat.com> <20160720043709.GA10539@ad.usersys.redhat.com> <913397c9-6edc-2561-3d2e-e32032f9db22@redhat.com> <20160720073836.GF10539@ad.usersys.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Fam Zheng Cc: Eric Blake , Kevin Wolf , qemu-block@nongnu.org, qemu-devel@nongnu.org, Max Reitz , Lukas Czerner , Dave Chinner , P@draigBrady.com, Niels de Vos Adding ext4 and XFS guys (Lukas and Dave respectively). As a quick recap, the issue here is the semantics of FIEMAP and SEEK_HOLE/SEEK_DATA, which we use in "qemu-img map". This command prints metadata about a virtual disk image---which in the case of a raw image amounts to detecting holes and unwritten extents. First, it seems like SEEK_HOLE and SEEK_DATA really should be "SEEK_NONZERO" and "SEEK_ZERO", on both ext4 and XFS. You can see that unwritten extents are reported by "qemu-img map" as holes: $ dd if=/dev/urandom of=test.img bs=1M count=100 $ fallocate -z -o 10M -l 10M test.img $ du -h test.img $ qemu-img map --output=json test.img [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 10485760}, { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}] On the second line, zero=true data=false identifies a hole. The right output would either have zero=true data=true (unwritten extent) or just [{ "start": 0, "length": 104857600, "depth": 0, "zero": false, "data": true, "offset": 0}, since the zero flag is advisory (it doesn't try to detect zeroes beyond what the filesystem says). This leads to the second question, which is about FIEMAP and FIEMAP_FLAG_SYNC in particular. Until 2014, QEMU used FIEMAP to implement "qemu-img map". I resurrected that cod eand indeed it works: $ dd if=/dev/urandom of=test.img bs=1M count=100 $ du -h test.img $ fallocate -z -o 10M -l 10M test.img $ ./qemu-img map --output=json test.img [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": true, "offset": 10485760}, { "start": 20971520, "length": 83886080, "depth": 0, "zero": false, "data": true, "offset": 20971520}] This time qemu-img correctly reports an unwritten extent on the second line. It reports correctly holes too; continuing the previous example: $ fallocate -p -o 20M -l 10M test.img $ ./qemu-img map --output=json test.img [{ "start": 0, "length": 10485760, "depth": 0, "zero": false, "data": true, "offset": 0}, { "start": 10485760, "length": 10485760, "depth": 0, "zero": true, "data": true, "offset": 10485760}, { "start": 20971520, "length": 10485760, "depth": 0, "zero": true, "data": false, "offset": 20971520}, { "start": 31457280, "length": 73400320, "depth": 0, "zero": false, "data": true, "offset": 31457280}] Notice that you have data=true on the second line (unwritten extent) but data=false (hole) on the third. The reason why we disabled FIEMAP was a combination of a corruption and performance issue. The data corruption bug was at https://bugs.launchpad.net/qemu/+bug/1368815 and it was reported on Ubuntu Trusty (kernel 3.13 based on the release notes at https://wiki.ubuntu.com/TrustyTahr/ReleaseNotes). We corrected that by using FIEMAP_FLAG_SYNC, based on a similar patch to coreutils. This turned out to be too slow, so we dropped FIEMAP altogether. However, today I found kernel commit 91dd8c114499 ("ext4: prevent race while walking extent tree for fiemap", 2012-11-28) whose commit message says: Moreover the extent currently in delayed allocation might be allocated after we search the extent tree and before we search extent status tree delayed buffers resulting in those delayed buffers being completely missed, even though completely written and allocated. This seems pretty much like our data corruption bug; it would mean that using FIEMAP_FLAG_SYNC was only working around a bug and delayed allocations _should_ be reported as usual by FIEMAP. Except that the commit went in kernel 3.8 and as said above Trusty had 3.13. So either there are other bugs, or my understanding of the commit is not correct. So the questions for Lukas and Dave are: 1) is it expected that SEEK_HOLE skips unwritten extents? If not, would it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which would be similar to what SEEK_HOLE/SEEK_DATA do now? 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC? And if not, for what filesystems and kernel releases is it really not needed? Thanks, Paolo