git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: Jonathan Tan <jonathantanmy@google.com>, git@vger.kernel.org
Subject: Re: [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs
Date: Wed, 12 Jul 2017 13:29:11 -0400	[thread overview]
Message-ID: <890a36fc-be16-83bc-fec6-94e21d0f7d0a@jeffhostetler.com> (raw)
In-Reply-To: <f9c7d4b3f800ea31e85e4897ee7048fec1e3c2f0.1499800530.git.jonathantanmy@google.com>



On 7/11/2017 3:48 PM, Jonathan Tan wrote:
> Currently, Git does not support repos with very large numbers of blobs
> or repos that wish to minimize manipulation of certain blobs (for
> example, because they are very large) very well, even if the user
> operates mostly on part of the repo, because Git is designed on the
> assumption that every blob referenced by a tree object is available
> somewhere in the repo storage.
> 
> As a first step to reducing this problem, introduce the concept of
> promised blobs. Each Git repo can contain a list of promised blobs and
> their sizes at $GIT_DIR/objects/promisedblob. This patch contains
> functions to query them; functions for creating and modifying that file
> will be introduced in later patches.


As part of my on-going effort on partial/narrow clone/fetch I've also
looked at how to represent the set of omitted objects and whether or
not we should even try.

My primary concern is scale and managing the list of objects over time.

My fear is that this list will be quite large.  If we only want to omit
the very large blobs, then maybe not.  But if we want to expand that
scope to also omit other objects (such as a clone synchronized with a
sparse checkout), then that list will get large on large repos.  For
example, on the Windows repo we have (conservatively) 100M+ blobs (and
growing).  Assuming 28 bytes per, gives a 2.8GB list to be manipulated.

If I understand your proposal, newly-omitted blobs would need to be
merged into the promised-blob list after each fetch.  The fetch itself
may not have that many new entries, but inserting them into the existing
list will be slow.  Also, mmap'ing and bsearch'ing will likely have
issues.  And there's likely to be a very expensive step to remove
entries from the list as new blobs are received (or locally created).

In such a "sparse clone", it would be nice to omit unneeded tree objects
in addition to just blobs.   I say that because we are finding with GVFS
on the Windows repo, that even with commits-and-trees-only filtering,
the number of tree objects is overwhelming.  So I'm also concerned about
limiting the list to just blobs.  If we need to have this list, it
should be able to contain any object.  (Suggesting having an object type
in the entry.)

I assume that we'll also need a promised-blob.lock file to control
access during list manipulation.  This is already a sore spot with the
index; I'd hate to create another one.

I also have to wonder about the need to have a complete list of omitted
blobs up front.  It may be better to just relax the consistency checks
and assume a missing blob is "intentionally missing" rather than
indicating a corruption somewhere.  And then let the client do a later
round-trip to either demand-load the object -or- demand-load the
existence/size info if/when it really matters.

Maybe we should add a verb to your new fetch-blob endpoint to just get
the size of one or more objects to help with this.

Thanks,
Jeff


  parent reply	other threads:[~2017-07-12 17:29 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-11 19:48 [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Jonathan Tan
2017-07-11 19:48 ` [RFC PATCH 1/3] promised-blob, fsck: introduce promised blobs Jonathan Tan
2017-07-11 22:02   ` Stefan Beller
2017-07-19 23:37     ` Jonathan Tan
2017-07-12 17:29   ` Jeff Hostetler [this message]
2017-07-12 19:28     ` Jonathan Nieder
2017-07-13 14:48       ` Jeff Hostetler
2017-07-13 15:05         ` Jeff Hostetler
2017-07-13 19:39     ` Jonathan Tan
2017-07-14 20:03       ` Jeff Hostetler
2017-07-14 21:30         ` Jonathan Nieder
2017-07-11 19:48 ` [RFC PATCH 2/3] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-11 22:06   ` Stefan Beller
2017-07-19 23:56     ` Jonathan Tan
2017-07-20  0:06       ` Stefan Beller
2017-07-11 19:48 ` [RFC PATCH 3/3] sha1_file: add promised blob hook support Jonathan Tan
2017-07-11 22:38   ` Stefan Beller
2017-07-12 17:40   ` Ben Peart
2017-07-12 20:38     ` Jonathan Nieder
2017-07-16 15:23 ` [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing blobs") Philip Oakley
2017-07-17 17:43   ` Ben Peart
2017-07-25 20:48     ` Philip Oakley
2017-07-17 18:03   ` Jonathan Nieder
2017-07-29 12:51     ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 0/4] Partial clone: promised objects (not only blobs) Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 1/4] object: remove "used" field from struct object Jonathan Tan
2017-07-20  0:36   ` Stefan Beller
2017-07-20  0:55     ` Jonathan Tan
2017-07-20 17:44       ` Ben Peart
2017-07-20 21:20   ` Junio C Hamano
2017-07-20  0:21 ` [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects Jonathan Tan
2017-07-20 18:07   ` Stefan Beller
2017-07-20 19:17     ` Jonathan Tan
2017-07-20 19:58   ` Ben Peart
2017-07-20 21:13     ` Jonathan Tan
2017-07-21 16:24       ` Ben Peart
2017-07-21 20:33         ` Jonathan Tan
2017-07-25 15:10           ` Ben Peart
2017-07-29 13:26             ` Philip Oakley
2017-07-20  0:21 ` [RFC PATCH v2 3/4] sha1-array: support appending unsigned char hash Jonathan Tan
2017-07-20  0:21 ` [RFC PATCH v2 4/4] sha1_file: support promised object hook Jonathan Tan
2017-07-20 18:23   ` Stefan Beller
2017-07-20 20:58     ` Ben Peart
2017-07-20 21:18       ` Jonathan Tan
2017-07-21 16:27         ` Ben Peart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=890a36fc-be16-83bc-fec6-94e21d0f7d0a@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).