From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37738) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bQDIF-0001JT-9C for qemu-devel@nongnu.org; Thu, 21 Jul 2016 08:41:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bQDID-0004LC-4A for qemu-devel@nongnu.org; Thu, 21 Jul 2016 08:41:34 -0400 Date: Thu, 21 Jul 2016 22:41:19 +1000 From: Dave Chinner Message-ID: <20160721124119.GR2031@devil.localdomain> References: <1468901281-22858-1-git-send-email-eblake@redhat.com> <20160720033402.GA7641@ad.usersys.redhat.com> <578EF446.70202@redhat.com> <20160720043709.GA10539@ad.usersys.redhat.com> <913397c9-6edc-2561-3d2e-e32032f9db22@redhat.com> <20160720073836.GF10539@ad.usersys.redhat.com> <1796238868.8815050.1469006377577.JavaMail.zimbra@redhat.com> <20160720123025.GO2031@devil.localdomain> <360732077.8875393.1469022006074.JavaMail.zimbra@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <360732077.8875393.1469022006074.JavaMail.zimbra@redhat.com> Subject: Re: [Qemu-devel] semantics of FIEMAP without FIEMAP_FLAG_SYNC (was Re: [PATCH v5 13/14] nbd: Implement NBD_CMD_WRITE_ZEROES on server) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: Fam Zheng , Eric Blake , Kevin Wolf , qemu-block@nongnu.org, qemu-devel@nongnu.org, Max Reitz , Lukas Czerner , P@draigBrady.com, Niels de Vos On Wed, Jul 20, 2016 at 09:40:06AM -0400, Paolo Bonzini wrote: > > > 1) is it expected that SEEK_HOLE skips unwritten extents? > > > > There are multiple answers to this, all of which are correct depending > > on current context and state: > > > > 1. No - some filesystems will report clean unwritten extents as holes. > > > > 2. Yes - some filesystems will report clean unwritten extents as data. > > > > 3. Maybe - if there is written data in memory over the unwritten > > extent on disk (i.e. hasn't been flushed to disk, it will be > > considered a data region with non-zero data. (FIEMAP will still > > report is as unwritten) > > Ok, I thought it would return FIEMAP_EXTENT_UNKNOWN|FIEMAP_EXTENT_DELALLOC > in this case (not FIEMAP_EXTENT_UNWRITTEN). No. FIEMAP only returns the known extent state at the given file offset. "delalloc" extents exist in memory, indicating the space has already been accounted for over that offset, but the extent has not been physically allocated. Like all other types of extents, there may or may not be valid data over a delayed allocation extent. IOWs, fiemap only gives you a snapshot of extent state, not the ranges of valid data in the file. > > > If not, would > > > it be acceptable to introduce Linux-specific SEEK_ZERO/SEEK_NONZERO, which > > > would be similar to what SEEK_HOLE/SEEK_DATA do now? > > > > To solve what problem? You haven't explained what problem you are > > trying to solve yet. > > > > > 2) for FIEMAP do we really need FIEMAP_FLAG_SYNC? And if not, for what > > > filesystems and kernel releases is it really not needed? > > > > I can't answer this question, either, because I don't know what > > you want the fiemap information for. > > The answer is the same no matter if we use both lseek and FIEMAP, so > I'll answer just once. We want to do two things: > > 1) avoid copying zero data, to keep the copy process efficient. For this, > SEEK_HOLE/SEEK_DATA are enough. > > 2) copy file contents while preserving the allocation state of the file's extents. Which is /very difficult/ to do safely and reliably. We do actually do reliable, safe, exact hole and preallocation layout duplication with xfs_fsr, but that uses kernel provided cookies (from XFS_IOC_BULKSTAT) to detect that data in the source file has not changed while it was being copied before executing the final defrag operation in the kernel (XFS_IOC_SWAPEXT) that makes the new copy of the data user visible. i.e. the use of fiemap to duplicate the exact layout of a file from userspace is only posisble if you can /guarantee/ the source file has not changed in any way during the copy operation at the pointin time you finalise the destination data copy. > There can be various reasons why the user has preallocated the file (because they > don't want an ENOSPC to happen while the VM runs; on some filesystems, to > minimize cases where io_submit is very un-asynchronous; or just because someone > had a reason to do a BLKZEROOUT ioctl on the virtual disk). We want to preserve > these while converting or otherwise moving the file around. Sure, there's many reasons for using prealloc/punch/zero. The real difference to other file operations is that they interface with low level filesystem structure, not the data contained within the extents. That's what makes them problematic for duplication - userspace cannot serialise against low level filesystem structure modifications. Optimising file copies safely is one of the reasons the copy_file_range() syscall has been introduced (in 4.5). While we haven't implemented anything special in XFS yet, it will internally use splice to do a zero-copy data transfer from source to destination file. Optimising for exact layout copies is precisely the sort of thing this syscall is intended for. It's also intended to enable applications to take advantage of hardware acceleration of data copying (e.g. server side copies to avoid round trips as has been implemented for NFS, or storage array offload of data copying) when such support is provided by the kernel. IOWs, I think you should be looking to optimise file copies by using copy_file_range() and getting filesystems to do exactly what you need. Using FIEMAP, fallocate and moving data through userspace won't ever be reliable without special filesystem help (that only exists for XFS right now), nor will it enable the application to transparently use smart storage protocols and hardware when it is present on user systems.... Cheers, Dave. -- Dave Chinner dchinner@redhat.com