All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Wong <e@80x24.org>
To: Kevin Wern <kevin.m.wern@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 00/11] Resumable clone
Date: Tue, 27 Sep 2016 21:51:43 +0000	[thread overview]
Message-ID: <20160927215143.GA32622@starla> (raw)
In-Reply-To: <1473984742-12516-1-git-send-email-kevin.m.wern@gmail.com>

Kevin Wern <kevin.m.wern@gmail.com> wrote:
> Hey, all,
> 
> It's been a while (sent a very short patch in May), but I've
> still been working on the resumable clone feature and checking up on
> the mailing list for any updates. After submitting the prime-clone
> service alone, I figured implementing the whole thing would be the best
> way to understand the full scope of the problem (this is my first real
> contribution here, and learning while working on such an involved
> feature has not been easy). 

Thank you for working on this.  I'm hugely interested in this
feature as both a cheapskate sysadmin and as a client with
unreliable connectivity (and I've barely been connected this month).

> This is a functional implementation handling a direct http/ftp URI to a
> single, fully connected packfile (i.e. the link is a direct path to the
> file, not a prefix or guess). My hope is that this acts as a bare
> minimum cross-section spanning the full requirments that can expand in
> width as more cases are added (.info file, split bundle, daemon
> download service). This is certainly not perfect, but I think it at
> least prototypes each component involved in the workflow.
> 
> This patch series is based on jc/bundle, because the logic to find the
> tips of a pack's history already exists there (I call index-pack
> --clone-bundle on the downloaded file, and read the file to write the
> references to a temporary directory). If I need to re-implement this
> logic or base it on another branch, let me know. For ease of pulling
> and testing, I included the branch here:
> 
> https://github.com/kevinwern/git/tree/feature/prime-clone

Am I correct this imposes no additional storage burden for servers?

(unlike the current .bundle dance used by kernel.org:
  https://www.kernel.org/cloning-linux-from-a-bundle.html )

That would be great!

> Although there are a few changes internally from the last patch,
> the "alternate resource" url to download is configured on the
> server side in exactly the same way:
> 
> [primeclone]
> 	url = http://location/pack-$NAME.pack
> 	filetype = pack

If unconfigured, I wonder if a primeclone pack can be inferred by
the existence of a pack bitmap (or merely being the biggest+oldest
pack for dumb HTTP).

> The prime-clone service simply outputs the components as:
> 
> ####url filetype
> 0000
> 
> On the client side, the transport_prime_clone and
> transport_download_primer APIs are built to be more robust (i.e. read
> messages without dying due to protocol errors), so that git clone can
> always try them without being dependent on the capability output of
> git-upload-pack. transport_download_primer is dependent on the success
> of transport_prime_clone, but transport_prime_clone is always run on an
> initial clone. Part of achieving this robustness involves adding
> *_gentle functions to pkt_line, so that prime_clone can fail silently
> without dying.
> 
> The transport_download_primer function uses a resumable download,
> which is applicable to both automatic and manual resuming. Automatic
> is programmatically reconnecting to the resource after being
> interrupted (up to a set number of times). Manual is using a newly
> taught --resume option on the command line:
> 
> git clone --resume <resumable_work_or_git_dir>

I think calling "git fetch" should resume, actually.
It would reduce the learning curve and seems natural to me:
"fetch" is jabout grabbing whatever else appeared since the
last clone/fetch happened.

> Right now, a manually resumable directory is left behind only if the
> *client* is interrupted while a new junk mode, JUNK_LEAVE_RESUMABLE,
> is set (right before the download). For an initial clone, if the
> connection fails after automatic resuming, the client erases the
> partial resources and falls through to a normal clone. However, once a
> resumable directory is left behind by the program, it is NEVER
> deleted/abandoned after it is continued with --resume.

I'm not sure if erasing partial resources should ever be done
automatically.  Perhaps a note to the user explaining the
situation and potential ways to correct/resume it.

> I think determining when a resource is "unsalvageable" should be more
> nuanced. Especially in a case where a connection is perpetually poor
> and the user wishes to resume over a long period of time. The timeout
> logic itself *definitely* needs more nuance than "repeat 5 times", such
> as expanding wait times and using earlier successes when deciding to
> try again. Right now, I think the most important part of this patch is
> that these two paths (falling through after a failed download, exiting
> to be manually resumed later) exist.
> 
> Off the top of my head, outstanding issues/TODOs inlcude:
> 	- The above issue of determining when to fall through, when to
> 	  reattempt, and when to write the resumable info and exit
> 	  in git clone.

My current (initial) reaction is: you're overthinking this.

I think it's less surprising to a user to always write resumable
info and let them know how to resume (or abort); rather than
trying to second-guess their intent.

Going by the zero-one-infinity rule, I'd probably attempt an
auto-retry once on socket errors before saving state and bailing
with instructions on how to resume.

If they hit Ctrl-C manually, then just tell them they can
either resume or "rm -r" the directory.

> 	- Creating git-daemon service to download a resumable resource.
> 	  Pretty straightforward, I think, especially if
> 	  http.getanyfile already exists. This falls more under
> 	  "haven't gotten to yet" than dilemma.

I think this could be handled natively by git-daemon for
trickling data to slow clients in the existing event loop (and
expanded to use epoll/kqueue).  Similar to how X-Sendfile works
with (Apache|lighttpd) or X-Accel in nginx.

This would be cheaper than wasting a process (or thread) to
trickle to low-bandwidth clients.  But this may be an
optimization we defer until we've ironed out other parts.

> 	- Logic for git clone to determine when a full clone would
> 	  be superior, such as when a clone is local or a reference is
> 	  given.
> 	- Configuring prime-clone for multiple resources, in two
> 	  dimensions: (a) resources to choose from (e.g. fall back to
> 	  a second resource if the first one doesn't work) and (b)
> 	  resources to be downloaded together or in sequence (e.g.
> 	  download http://host/this, then http://host/that). Maybe
> 	  prime-clone could also handle client preferences in terms of
> 	  filetype or protocol. For this, I just have to re-read a few
> 	  discussions about the filetypes we use to see if there are
> 	  any outliers that aren't representable in this way. I think
> 	  this is another "haven't gotten to yet".

Perhaps using the existing http-alternates (and automatic
primeclone pack inference I wrote about above) can be done.

<snip>

> 	- Creating the logic to guess a packfile, and append that to a
> 	  prefix specified by the admin. Additionally, allowing the
> 	  admin to use a custom script to use their own logic to
> 	  output the URL.

Yes :) Though I'm not sure if the custom script is necessary.

  parent reply	other threads:[~2016-09-27 21:51 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-09-16  0:12 [PATCH 00/11] Resumable clone Kevin Wern
2016-09-16  0:12 ` [PATCH 01/11] Resumable clone: create service git-prime-clone Kevin Wern
2016-09-16 20:53   ` Junio C Hamano
2016-09-28  4:40     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 02/11] Resumable clone: add prime-clone endpoints Kevin Wern
2016-09-19 13:15   ` Duy Nguyen
2016-09-28  4:43     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 03/11] pkt-line: create gentle packet_read_line functions Kevin Wern
2016-09-16 22:17   ` Junio C Hamano
2016-09-28  4:42     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 04/11] Resumable clone: add prime-clone to remote-curl Kevin Wern
2016-09-19 13:52   ` Duy Nguyen
2016-09-28  6:45     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 05/11] Resumable clone: add output parsing to connect.c Kevin Wern
2016-09-16  0:12 ` [PATCH 06/11] Resumable clone: implement transport_prime_clone Kevin Wern
2016-09-16  0:12 ` [PATCH 07/11] Resumable clone: add resumable download to http/curl Kevin Wern
2016-09-16 22:45   ` Junio C Hamano
2016-09-28  6:41     ` Kevin Wern
2016-09-16  0:12 ` [PATCH 08/11] Resumable clone: create transport_download_primer Kevin Wern
2016-09-16  0:12 ` [PATCH 09/11] path: add resumable marker Kevin Wern
2016-09-19 13:24   ` Duy Nguyen
2016-09-16  0:12 ` [PATCH 10/11] run command: add RUN_COMMAND_NO_STDOUT Kevin Wern
2016-09-16 23:07   ` Junio C Hamano
2016-09-18 19:22     ` Johannes Schindelin
2016-09-28  4:46     ` Kevin Wern
2016-09-28 17:54       ` Junio C Hamano
2016-09-28 18:06         ` Kevin Wern
2016-09-16  0:12 ` [PATCH 11/11] Resumable clone: implement primer logic in git-clone Kevin Wern
2016-09-16 23:32   ` Junio C Hamano
2016-09-28  5:49     ` Kevin Wern
2016-09-19 14:04   ` Duy Nguyen
2016-09-19 17:16     ` Junio C Hamano
2016-09-28  4:44     ` Kevin Wern
2016-09-16 20:47 ` [PATCH 00/11] Resumable clone Junio C Hamano
2016-09-27 21:51 ` Eric Wong [this message]
2016-09-27 22:07   ` Junio C Hamano
2016-09-28 17:32     ` Junio C Hamano
2016-09-28 18:22       ` Junio C Hamano
2016-09-28 20:46     ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160927215143.GA32622@starla \
    --to=e@80x24.org \
    --cc=git@vger.kernel.org \
    --cc=kevin.m.wern@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.