All of lore.kernel.org
 help / color / mirror / Atom feed
From: Patrick Goetz <pgoetz@math.utexas.edu>
To: Daire Byrne <daire@dneg.com>
Cc: Bruce Fields <bfields@fieldses.org>,
	Chuck Lever III <chuck.lever@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: parallel file create rates (+high latency)
Date: Tue, 25 Jan 2022 17:01:08 -0600	[thread overview]
Message-ID: <a5627c80-4b03-29f2-1432-6e0f0b5197ef@math.utexas.edu> (raw)
In-Reply-To: <CAPt2mGNMGjq+i=k_6oYBYPFPCTR2UdeEtWfyeTU9uUC0OC=T4w@mail.gmail.com>



On 1/25/22 16:41, Daire Byrne wrote:
> On Tue, 25 Jan 2022 at 22:11, Patrick Goetz <pgoetz@math.utexas.edu> wrote:
>>
>> IDK, 4000 images per collection, with hundreds of collections on disk?
>> Say at least 500,000 files?  Maybe a million? With most files about 1GB
>> in size.  I was trying to just rsync it all from the data server to a
>> ZFS-based backup server in our data center, but the backup started
>> failing constantly because the filesystem would change after rsync had
>> already constructed an index. Even after an initial copy, a backup like
>> that runs for over a week.  The strategy I'm about to try and implement
>> is to NFS mount the data server's data partition to the backup server
>> and then have a script walk through the directory hierarchy, rsyncing
>> collections one at a time.  ZFS send/receive would probably be better,
>> but the data server isn't configured with ZFS.
> 
> We've strayed slightly off topic (even if we are talking about file
> creates over NFS) because you can get good parallel performance
> (creates, read, writes etc) over NFS with simultaneous copies using
> lots of processes if distributed across lots of directories.
> 
> Well "good" being subjective. I get 1,500 creates/s in a single
> directory on a LAN NFS server from a single client and 160 creates/s
> aggregate over my extreme 200ms using 10 clients & 10 different
> directories. It seems fair all things considered.
> 
> But seeing as I do a lot of these kinds of big data moves (TBs) across
> both the LAN and WAN, I can perhaps offer some advice from experience
> that might be useful:
> 
> * walk the filesystem (locally) first to build a file list, split it
> and then use rsync --files-from (e.g. https://github.com/jbd/msrsync)
> to feed multiple simultaneous rsyncs.
> * avoid NFS and use rsyncd directly between the servers (no ssh) so
> filesystem walks are "local".


Thanks for this suggestion! This option didn't even occur to me.  The 
only downside is that this server gets really busy during image 
processing, so I'm a bit worried about loading it down with dozens of 
simultaneous rsync processes. Also, the biggest performance problem in 
this system (which includes multiple GPU-laden workstations and 2 other 
NFS servers) is always I/O bottlenecks.  I suppose the solution is to 
nice all the rsync processes to 19.

Question: given that I usually run backups from cron, and given that 
they can take a long time, how does msrsync avoid stepping on itself?




> 
> The advantage of rsync is that it will do the filesystem walks at both
> ends locally and compare the directory trees as it goes along. The
> other nice thing it does is open a connection between sender and
> receiver and stream all the file data down it so it works really well
> even for lists of small files. The TCP connection and window scaling
> can sit at it's maximum without any slow remote file metadata latency
> disrupting it. Avoid the encapsulation of  sshand use rsyncd instead
> as it just speeds everything up.
> 
> And as always with any WAN connection, large buffers, window scaling,
> no firewall DPI and maybe some fancy congestion control like BBR/2
> helps.
> 
> Daire

  reply	other threads:[~2022-01-25 23:01 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-23 23:53 parallel file create rates (+high latency) Daire Byrne
2022-01-24 13:52 ` Daire Byrne
2022-01-24 19:37 ` J. Bruce Fields
2022-01-24 20:10   ` Daire Byrne
2022-01-24 20:50     ` J. Bruce Fields
2022-01-25 12:52       ` Daire Byrne
2022-01-25 13:59         ` J. Bruce Fields
2022-01-25 15:24           ` Daire Byrne
2022-01-25 15:30           ` Chuck Lever III
2022-01-25 21:50             ` Patrick Goetz
2022-01-25 21:58               ` Chuck Lever III
2022-01-25 21:59               ` Bruce Fields
2022-01-25 22:11                 ` Patrick Goetz
2022-01-25 22:41                   ` Daire Byrne
2022-01-25 23:01                     ` Patrick Goetz [this message]
2022-01-25 23:25                       ` Daire Byrne
2022-01-25 21:15   ` Patrick Goetz
2022-01-25 21:20     ` J. Bruce Fields
2022-01-26  0:02       ` NeilBrown
2022-01-26  0:28         ` Daire Byrne
2022-01-26  2:57         ` J. Bruce Fields
2022-02-08 18:48           ` Daire Byrne
2022-02-10 18:19             ` Daire Byrne
2022-02-11 15:59               ` J. Bruce Fields
2022-02-17 19:50                 ` Daire Byrne
2022-02-18  7:46                   ` NeilBrown
2022-02-21 13:59                     ` Daire Byrne
2022-04-25 13:00                       ` Daire Byrne
2022-04-25 13:22                         ` J. Bruce Fields
2022-04-25 15:24                           ` Daire Byrne
2022-04-25 16:02                             ` J. Bruce Fields
2022-04-25 16:47                               ` Daire Byrne
2022-04-26  1:36                                 ` NeilBrown
2022-04-26 12:29                                   ` Daire Byrne
2022-04-28  5:46                                     ` NeilBrown
2022-04-29  7:55                                       ` Daire Byrne

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a5627c80-4b03-29f2-1432-6e0f0b5197ef@math.utexas.edu \
    --to=pgoetz@math.utexas.edu \
    --cc=bfields@fieldses.org \
    --cc=chuck.lever@oracle.com \
    --cc=daire@dneg.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.