From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:6867 "EHLO
        ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1751990AbeGAAKJ (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Sat, 30 Jun 2018 20:10:09 -0400
Date: Sun, 1 Jul 2018 10:10:05 +1000
From: Dave Chinner <david@fromorbit.com>
To: Steve French <smfrench@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Copy tools on Linux
Message-ID: <20180701001005.GR19934@dastard>
References: <CAH2r5msSSe-qPzeZNTw8cGt0GXH=f5YEeCrne4KToHS1D0vmTA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAH2r5msSSe-qPzeZNTw8cGt0GXH=f5YEeCrne4KToHS1D0vmTA@mail.gmail.com>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, Jun 29, 2018 at 09:37:27PM -0500, Steve French wrote:
> I have been looking at i/o patterns from various copy tools on Linux,
> and it is pretty discouraging - I am hoping that I am forgetting an
> important one that someone can point me to ...
> 
> Some general problems:
> 1) if source and target on the same file system it would be nice to
> call the copy_file_range syscall (AFAIK only test tools call that),
> although in some cases at least cp can do it for --reflink

copy_file_range() should be made to do the right thing in as many
scnearios as we can document, and then switch userspace over to use
it at all times. Aggregate all the knowledge in one place, where we
know what the filesystem implementations are and can get hints to do
the right thing.

> 2) if source and target on different file systems there are multiple problems
>     a) smaller i/o  (rsync e.g. maxes at 128K!)
>     b) no async parallelized writes sent down to the kernel so writes
> get serialized (either through page cache, or some fs offer option to
> disable it - but it still is one thread at a time)

That because, historically, copying data into the page cache for
buffering has been orders of magnitude faster than actually doing
IO. These days, with pcie based storage, not so much. Indeed, for
bulk data copy on nvme based storage I wonder if we even need
buffered IO anymore...

>     c) sparse file support is mediocre (although cp has some support
> for it, and can call fiemap in some cases)

Using fiemap for this is broken and will lead to data corruption,
especially if you start parallelising IO to individual files.
SEEK_DATA/SEEK_HOLE should be used instead.

>     d) for file systems that prefer setting the file size first (to
> avoid metadata penalties with multiple extending writes) - AFAIK only
> rsync offers that, but rsync is one of the slowest tools otherwise

We don't want to do this for most local filesystems as it defeats
things like specualtive preallocation which are used to optimise IO
patterns and prevent file fragmentation when concurrent parallel
writes are done.

> I have looked at cp, dd, scp, rsync, gio, gcp ... are there others?
> 
> What I am looking for (and maybe we just need to patch cp and rsync
> etc.) is more like what you see with other OS ...
> 1) options for large i/o sizes (network latencies in network/cluster
> fs can be large, so prefer larger 1M or 8M in some cases I/Os)

-o largeio mount option on XFS will expose the stripe unit as the
iminimum efficient IO size returned in stat. IIRC there's another
combination that makes it emit the stripe width rather than stripe
unit.

> 2) parallelizing writes so not just one write in flight at a time

That won't make buffered IO any faster - writeback will still be the
bottleneck, and it already does allow many IOs to be in flight at
once.

> 3) options to turn off the page cache (large number of large file
> copies are not going to benefit from reuse of pages in the page cache
> so going through the page cache may be suboptimal in that case)

If you do this, you really need to use AIO+DIO to avoid blocking
(i.e. userspace can still be single threaded!), and need the
filesystem to be able to tell the app what optimal DIO sizes are
(e.g. XFS_IOC_DIOINFO has historically been used for this)

> 4) option to set the file size first, and then fill in writes (so
> non-extending writes)

Must only be an option, as will cause problems with buffered IO
because it defeats all the extending write optimisations that local
filesystems do. Same with using fallocate() to preallocate files -
this defeats all the anti-fragmentation and cross-file allocation
packing optimisations we do at writeback time that are enabled by
delayed allocation.

In general, fine-grained control of extent allocation in local
filesystems from userspace is a recipe for rapidly aging and
degrading filesystem performance. We want to avoid that as much as
possible - it sets us back at least 25 years in terms of file layout
and allocation algorithm sophistication.

> 5) sparse file support
> (and it would also be nice to support copy_file_range syscall ... but
> that is unrelated to the above)

make copy_file_range() handle that properly.

You also forgot all the benfits we'd get from parallelising
recursive directory walks and stat()ing inodes along the way (i.e.
the dir walk that rsync does to work out what it needs to copy).
Also, sorting the files to be copied based on something like inode
number rather than just operating on readdir order can improve
copying of large numbers of files substantially. Chris Mason
demonstrated this years ago:

https://oss.oracle.com/~mason/acp/

> Am I missing some magic tool?  Seems like Windows has various options
> for copy tools - but looking at Linux i/o patterns from these tools
> was pretty depressing - I am hoping that there are other choices.

Not that I know of.  The Linux IO tools need to dragged kicking and
screaming out of the 1980s. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com