From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f195.google.com ([209.85.192.195]:36436 "EHLO mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751060AbeF3OMW (ORCPT ); Sat, 30 Jun 2018 10:12:22 -0400 Received: by mail-pf0-f195.google.com with SMTP id u16-v6so5493344pfh.3 for ; Sat, 30 Jun 2018 07:12:22 -0700 (PDT) MIME-Version: 1.0 References: <20180630131351.ks64ekwpcvu45yqw@merlin> In-Reply-To: <20180630131351.ks64ekwpcvu45yqw@merlin> From: Steve French Date: Sat, 30 Jun 2018 09:12:10 -0500 Message-ID: Subject: Re: Copy tools on Linux To: Goldwyn Rodrigues Cc: linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Sat, Jun 30, 2018 at 8:13 AM Goldwyn Rodrigues wrote: > > Hi Steve, > > On 06-29 21:37, Steve French wrote: > > I have been looking at i/o patterns from various copy tools on Linux, > > and it is pretty discouraging - I am hoping that I am forgetting an > > important one that someone can point me to ... > > > > Some general problems: > > 1) if source and target on the same file system it would be nice to > > call the copy_file_range syscall (AFAIK only test tools call that), > > although in some cases at least cp can do it for --reflink > > I have submitted a patch set for copy_file_range() across filesystems > which can atleast use splice() [1] as a part of enabling holes in > copy_file_range(), but it has not been incorporated so far. Do you have a link to the patch? > > 1) options for large i/o sizes (network latencies in network/cluster > > fs can be large, so prefer larger 1M or 8M in some cases I/Os) > > Unfortunately tools derive I/O size from stat.st_blksize which may be > pretty small for performing "efficient" I/O. However, the tools such as > cp also determine series of zeros to convert into holes. So for that > reason it works well. OTOH, that is not the most common case of tools, > which I agree could be made faster. dd is nice in that you can set i/o size (as can rsync) but seems sane to allow rsize/wsize to be configurable > > 2) parallelizing writes so not just one write in flight at a time > > What would the resultant file be in case of errors? Should the > destination file be considered partially copied? man cp does not cover > the case errors but currently it is assumed the file is partially copied > and correct until the point of error. Whether parallel i/o on one file, or multiple files, either will be a huge help. Just did a quick google search on the topic and it pointed to a sysadmin article discussing one of the more common copy tools on Windows, robocopy: "Perhaps the most important switch to pay attention is /MT, which is a feature that enables Robocopy to copy files in multi-threaded mode... with multi-threaded enabled, you can copy multiple files at the same time better utilizing the bandwidth and significantly speeding up the process. If you don=E2=80=99t set a number when using the /MT switch, then = the default number will be 8, which means that Robocopy will try to copy eight files at the same time. However, Robocopy supports 1 to 128 threads." This seems sane - even if cp can't do it, having a tool that can reasonably get at least four i/o in flight (perhaps for different files, with only one i/o per file) would be huge help. > > 4) option to set the file size first, and then fill in writes (so > > non-extending writes) > > File size or file allocation? How would you determine what file > size to set? Consider the case the source file is sparse. It can be > calculated, but needs more thought. The goal here is to allow a copy option (as rsync does) for target file systems where metadata sync is expensive or expensive locking needed for setting end-of-file, set the filesize early so it doesn't get reset 100s of times on extending writes --=20 Thanks, Steve