From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:6867 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751990AbeGAAKJ (ORCPT ); Sat, 30 Jun 2018 20:10:09 -0400 Date: Sun, 1 Jul 2018 10:10:05 +1000 From: Dave Chinner To: Steve French Cc: linux-fsdevel Subject: Re: Copy tools on Linux Message-ID: <20180701001005.GR19934@dastard> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri, Jun 29, 2018 at 09:37:27PM -0500, Steve French wrote: > I have been looking at i/o patterns from various copy tools on Linux, > and it is pretty discouraging - I am hoping that I am forgetting an > important one that someone can point me to ... > > Some general problems: > 1) if source and target on the same file system it would be nice to > call the copy_file_range syscall (AFAIK only test tools call that), > although in some cases at least cp can do it for --reflink copy_file_range() should be made to do the right thing in as many scnearios as we can document, and then switch userspace over to use it at all times. Aggregate all the knowledge in one place, where we know what the filesystem implementations are and can get hints to do the right thing. > 2) if source and target on different file systems there are multiple problems > a) smaller i/o (rsync e.g. maxes at 128K!) > b) no async parallelized writes sent down to the kernel so writes > get serialized (either through page cache, or some fs offer option to > disable it - but it still is one thread at a time) That because, historically, copying data into the page cache for buffering has been orders of magnitude faster than actually doing IO. These days, with pcie based storage, not so much. Indeed, for bulk data copy on nvme based storage I wonder if we even need buffered IO anymore... > c) sparse file support is mediocre (although cp has some support > for it, and can call fiemap in some cases) Using fiemap for this is broken and will lead to data corruption, especially if you start parallelising IO to individual files. SEEK_DATA/SEEK_HOLE should be used instead. > d) for file systems that prefer setting the file size first (to > avoid metadata penalties with multiple extending writes) - AFAIK only > rsync offers that, but rsync is one of the slowest tools otherwise We don't want to do this for most local filesystems as it defeats things like specualtive preallocation which are used to optimise IO patterns and prevent file fragmentation when concurrent parallel writes are done. > I have looked at cp, dd, scp, rsync, gio, gcp ... are there others? > > What I am looking for (and maybe we just need to patch cp and rsync > etc.) is more like what you see with other OS ... > 1) options for large i/o sizes (network latencies in network/cluster > fs can be large, so prefer larger 1M or 8M in some cases I/Os) -o largeio mount option on XFS will expose the stripe unit as the iminimum efficient IO size returned in stat. IIRC there's another combination that makes it emit the stripe width rather than stripe unit. > 2) parallelizing writes so not just one write in flight at a time That won't make buffered IO any faster - writeback will still be the bottleneck, and it already does allow many IOs to be in flight at once. > 3) options to turn off the page cache (large number of large file > copies are not going to benefit from reuse of pages in the page cache > so going through the page cache may be suboptimal in that case) If you do this, you really need to use AIO+DIO to avoid blocking (i.e. userspace can still be single threaded!), and need the filesystem to be able to tell the app what optimal DIO sizes are (e.g. XFS_IOC_DIOINFO has historically been used for this) > 4) option to set the file size first, and then fill in writes (so > non-extending writes) Must only be an option, as will cause problems with buffered IO because it defeats all the extending write optimisations that local filesystems do. Same with using fallocate() to preallocate files - this defeats all the anti-fragmentation and cross-file allocation packing optimisations we do at writeback time that are enabled by delayed allocation. In general, fine-grained control of extent allocation in local filesystems from userspace is a recipe for rapidly aging and degrading filesystem performance. We want to avoid that as much as possible - it sets us back at least 25 years in terms of file layout and allocation algorithm sophistication. > 5) sparse file support > (and it would also be nice to support copy_file_range syscall ... but > that is unrelated to the above) make copy_file_range() handle that properly. You also forgot all the benfits we'd get from parallelising recursive directory walks and stat()ing inodes along the way (i.e. the dir walk that rsync does to work out what it needs to copy). Also, sorting the files to be copied based on something like inode number rather than just operating on readdir order can improve copying of large numbers of files substantially. Chris Mason demonstrated this years ago: https://oss.oracle.com/~mason/acp/ > Am I missing some magic tool? Seems like Windows has various options > for copy tools - but looking at Linux i/o patterns from these tools > was pretty depressing - I am hoping that there are other choices. Not that I know of. The Linux IO tools need to dragged kicking and screaming out of the 1980s. :/ Cheers, Dave. -- Dave Chinner david@fromorbit.com