All of lore.kernel.org
 help / color / mirror / Atom feed
* Partial clone design (with connectivity check for locally-created objects)
@ 2017-08-04 21:51 Jonathan Tan
  2017-08-04 22:51 ` Junio C Hamano
  2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
  0 siblings, 2 replies; 18+ messages in thread
From: Jonathan Tan @ 2017-08-04 21:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jonathan Nieder, peartben, christian.couder

After some discussion in [1] (in particular, about preserving the
functionality of the connectivity check as much as possible) and some
in-office discussion, here's an updated design.

Overview
========

This is an update of the design in [1].

The main difference between this and other related work [1] [2] [3] is
that we can still check connectivity between locally-created objects
without having to consult a remote server for any information.

In addition, the object loader writes to an incomplete packfile. This
(i) ensures that Git has immediate access to the object, (ii) ensures
that not too many files are written during a single Git invocation, and
(iii) prevents some unnecessary copies (compared to, for example,
transmitting entire objects through the protocol).

Local repo layout
=================

Objects in the local repo are further divided into "homegrown" and
"imported" objects.

"Imported" objects must be in a packfile that has a "<pack name>.remote"
file with arbitrary text (similar to the ".keep" file). They come from
clones, fetches, and the object loader (see below).

"Homegrown" objects are every other object.

Object loader
=============

The object loader is a process that can obtain objects from elsewhere,
given their hashes, and write their packed representation to a
client-given file.

The first time a missing object is needed during an invocation of Git,
Git creates a temporary packfile and writes the header with a
placeholder number of objects. Then, it starts the object loader,
passing in the name of that temporary packfile.

Whenever a missing object is needed, Git sends the hash of the missing
object and expects the loader to append (with O_APPEND) the object to
that packfile. Git keeps track of the object offsets as it goes, and Git
can use the contents of that incomplete packfile. This is similar to
what "git fast-import" does.

When Git exits, it writes the number of objects in the header, writes
the packfile checksum, moves the packfile to its final location, and
writes a .idx and a .remote file.

Connectivity check
==================

An object walk is performed as usual from the tips (see the
documentation for fsck etc. for which tips they use).

A "homegrown" object is valid if each object it references:
 1. is a "homegrown" object,
 2. is an "imported" object, or
 3. is referenced by an "imported" object.

The references of an "imported" object are not checked.

Performance notes
-----------------

Because of rule 3 above, iteration through every "imported" object (or,
at least, every "imported" object of a certain type) is sometimes
required.

For fsck, this should be fine because (i) this is not a regression since
currently all objects must be iterated through anyway, and (ii) fsck
prioritizes correctness over speed.

For fetch, the speed of the connectivity check is immaterial; the
connectivity check no longer needs to be performed because all objects
obtained from the remote are, by definition, "imported" objects.

There might be connectivity checks run during other commands like
"receive-pack". I don't expect partial clones to use these often. These
commands will still work, but performance of these is a secondary
concern in this design.

Impact on other tools
=====================

"git gc" will need to not do anything to an "imported" object, even if
it is unreachable, without ensuring that the connectivity check will
succeed in that object's absence. (Special attention to rule 3 under
"Connectivity check".)

If this design stands, the initial patch set will probably have "git gc"
not touch "imported" packs at all, trivially satisfying the above. In
the future, "git gc" will either need to expel such objects into loose
objects (like what is currently done for normal packs), treating them
like a "homegrown" object (unreachable, so it won't interfere with
future connectivity checks), or delete them outright - but there may be
race conditions to think of.

"git repack" will need to differentiate between packs with ".remote" and
packs without.

[1] https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/
[2] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
[3] https://public-inbox.org/git/20170803091926.1755-1-chriscool@tuxfamily.org/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
@ 2017-08-04 22:51 ` Junio C Hamano
  2017-08-05  0:21   ` Jonathan Tan
  2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2017-08-04 22:51 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Jonathan Nieder, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> "Imported" objects must be in a packfile that has a "<pack name>.remote"
> file with arbitrary text (similar to the ".keep" file). They come from
> clones, fetches, and the object loader (see below).
> ...
> A "homegrown" object is valid if each object it references:
>  1. is a "homegrown" object,
>  2. is an "imported" object, or
>  3. is referenced by an "imported" object.

Overall it captures what was discussed, and I think it is a good
start.

I doubt you want to treat all fetches/clones the same way as the
"lazy object" loading, though.  You may be critically rely on the
corporate central server that will give the objects it "promised"
when you cloned from it lazily (i.e. it may have given you a commit,
but not its parents or objects contained in its tree--you still know
that the parents and the tree and its contents will later be
available and rely on that fact).  You trust that and build on top,
so the packfile you obtained when you cloned from such a server
should count as "imported".  But if you exchanged wip changes with
your colleages by fetching or pushing peer-to-peer, without the
corporate central server knowing, you would want to treat objects in
packs (or loose objects) you obtained that way as "not imported".

Also I think "imported" vs "homegrown" may be a bit misnomer; the
idea to split objects into two camps sounds like a good idea, and
"imported" probably is an OK name to use for the category that is a
group of objects to which you know/trust are backed by your lazy
loader.  But the other one does not have to be "home"-grown.

Well, the names are not that important, but I think the line between
the two classes should not be "everything that came from clone and
fetch is imported", which is a more important point I am trying to
make.

Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-04 22:51 ` Junio C Hamano
@ 2017-08-05  0:21   ` Jonathan Tan
  2017-08-07 19:12     ` Ben Peart
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Tan @ 2017-08-05  0:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Jonathan Nieder, peartben, christian.couder

On Fri, 04 Aug 2017 15:51:08 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > "Imported" objects must be in a packfile that has a "<pack name>.remote"
> > file with arbitrary text (similar to the ".keep" file). They come from
> > clones, fetches, and the object loader (see below).
> > ...
> > A "homegrown" object is valid if each object it references:
> >  1. is a "homegrown" object,
> >  2. is an "imported" object, or
> >  3. is referenced by an "imported" object.
> 
> Overall it captures what was discussed, and I think it is a good
> start.
> 
> I doubt you want to treat all fetches/clones the same way as the
> "lazy object" loading, though.  You may be critically rely on the
> corporate central server that will give the objects it "promised"
> when you cloned from it lazily (i.e. it may have given you a commit,
> but not its parents or objects contained in its tree--you still know
> that the parents and the tree and its contents will later be
> available and rely on that fact).  You trust that and build on top,
> so the packfile you obtained when you cloned from such a server
> should count as "imported".  But if you exchanged wip changes with
> your colleages by fetching or pushing peer-to-peer, without the
> corporate central server knowing, you would want to treat objects in
> packs (or loose objects) you obtained that way as "not imported".

That's true. I discussed this with a teammate and we might need to make
extensions.lazyObject be the name of the "corporate central server"
remote instead, and have a "loader" setting within that remote, so that
we can distinguish that objects from this server are "imported" but
objects from other servers are not.

The connectivity check shouldn't be slow in this case because fetches
are usually onto tips that we have (so we don't hit case 3).

> Also I think "imported" vs "homegrown" may be a bit misnomer; the
> idea to split objects into two camps sounds like a good idea, and
> "imported" probably is an OK name to use for the category that is a
> group of objects to which you know/trust are backed by your lazy
> loader.  But the other one does not have to be "home"-grown.
> 
> Well, the names are not that important, but I think the line between
> the two classes should not be "everything that came from clone and
> fetch is imported", which is a more important point I am trying to
> make.
> 
> Thanks.

Maybe "imported" vs "non-imported" would be better. I agree that the
objects in the non-"imported" group could still be obtained from
elsewhere.

Thanks for your comments.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-05  0:21   ` Jonathan Tan
@ 2017-08-07 19:12     ` Ben Peart
  2017-08-07 19:21       ` Jonathan Nieder
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Ben Peart @ 2017-08-07 19:12 UTC (permalink / raw)
  To: Jonathan Tan, Junio C Hamano; +Cc: git, Jonathan Nieder, christian.couder



On 8/4/2017 8:21 PM, Jonathan Tan wrote:
> On Fri, 04 Aug 2017 15:51:08 -0700
> Junio C Hamano <gitster@pobox.com> wrote:
> 
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>> file with arbitrary text (similar to the ".keep" file). They come from
>>> clones, fetches, and the object loader (see below).
>>> ...
>>> A "homegrown" object is valid if each object it references:
>>>   1. is a "homegrown" object,
>>>   2. is an "imported" object, or
>>>   3. is referenced by an "imported" object.
>>
>> Overall it captures what was discussed, and I think it is a good
>> start.

I missed the offline discussion and so am trying to piece together what 
this latest design is trying to do.  Please let me know if I'm not 
understanding something correctly.

 From what I can tell, objects are going to be segmented into two 
"types" - those that were fetched from a remote source that allows 
partial clones/fetches (lazyobject/imported) and those that come from 
"regular" remote sources (homegrown) that requires all objects to exist 
locally.

FWIW, the names here are not making things clearer for me. If I'm 
correct perhaps "partial" and "normal" would be better to indicate the 
type of the source? Anyway...

Once the objects are segmented into the 2 types, the fsck connectivity 
check code is updated to ignore missing objects from "partial" remotes 
but still expect/validate them from "normal" remotes.

This compromise seems reasonable - don't generate errors for missing 
objects for remotes that returned a partial clone but do generate errors 
for missing objects from normal clones as a missing object is always an 
error in this case.

This segmentation is what is driving the need for the object loader to 
build a new local pack file for every command that has to fetch a 
missing object.  For example, we can't just write a tree object from a 
"partial" clone into the loose object store as we have no way for fsck 
to treat them differently and ignore any missing objects referenced by 
that tree object.

My concern with this proposal is the combination of 1) writing a new 
pack file for every git command that ends up bringing down a missing 
object and 2) gc not compressing those pack files into a single pack file.

We all know that git doesn't scale well with a lot of pack files as it 
has to do a linear search through all the pack files when attempting to 
find an object.  I can see that very quickly, there would be a lot of 
pack files generated and with gc ignoring "partial" pack files, this 
would never get corrected.

In our usage scenarios, _all_ of the objects come from "partial" clones 
so all of our objects would end up in a series of "partial" pack files 
and would have pretty poor performance as a result.

I wondered if it is possible to flag a specific remote as "partial" and 
have fsck be able to track any given object back to the remote and then 
properly handle the fact that it was missing based on that. I couldn't 
think of a good way to do that without some additional data structure 
that would have to be build/maintained (ie promises).

That thinking did lead me back to wondering again if we could live with 
a repo specific flag.  If any clone/fetch was "partial" the flag is set 
and fsck ignore missing objects whether they came from a "partial" 
remote or not.

I'll admit it isn't as robust if someone is mixing and matching remotes 
from different servers some of which are partial and some of which are 
not.  I'm not sure how often that would actually happen but I _am_ 
certain a single repo specific flag is a _much_ simpler model than 
anything else we've come up with so far.

>>
>> I doubt you want to treat all fetches/clones the same way as the
>> "lazy object" loading, though.  You may be critically rely on the
>> corporate central server that will give the objects it "promised"
>> when you cloned from it lazily (i.e. it may have given you a commit,
>> but not its parents or objects contained in its tree--you still know
>> that the parents and the tree and its contents will later be
>> available and rely on that fact).  You trust that and build on top,
>> so the packfile you obtained when you cloned from such a server
>> should count as "imported".  But if you exchanged wip changes with
>> your colleages by fetching or pushing peer-to-peer, without the
>> corporate central server knowing, you would want to treat objects in
>> packs (or loose objects) you obtained that way as "not imported".
> 
> That's true. I discussed this with a teammate and we might need to make
> extensions.lazyObject be the name of the "corporate central server"
> remote instead, and have a "loader" setting within that remote, so that
> we can distinguish that objects from this server are "imported" but
> objects from other servers are not.
> 
> The connectivity check shouldn't be slow in this case because fetches
> are usually onto tips that we have (so we don't hit case 3).
> 
>> Also I think "imported" vs "homegrown" may be a bit misnomer; the
>> idea to split objects into two camps sounds like a good idea, and
>> "imported" probably is an OK name to use for the category that is a
>> group of objects to which you know/trust are backed by your lazy
>> loader.  But the other one does not have to be "home"-grown.
>>
>> Well, the names are not that important, but I think the line between
>> the two classes should not be "everything that came from clone and
>> fetch is imported", which is a more important point I am trying to
>> make.
>>
>> Thanks.
> 
> Maybe "imported" vs "non-imported" would be better. I agree that the
> objects in the non-"imported" group could still be obtained from
> elsewhere.
> 
> Thanks for your comments.
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-07 19:12     ` Ben Peart
@ 2017-08-07 19:21       ` Jonathan Nieder
  2017-08-08 14:18         ` Ben Peart
  2017-08-07 19:41       ` Junio C Hamano
  2017-08-07 23:10       ` Jonathan Tan
  2 siblings, 1 reply; 18+ messages in thread
From: Jonathan Nieder @ 2017-08-07 19:21 UTC (permalink / raw)
  To: Ben Peart; +Cc: Jonathan Tan, Junio C Hamano, git, christian.couder

Hi,

Ben Peart wrote:
>> On Fri, 04 Aug 2017 15:51:08 -0700
>> Junio C Hamano <gitster@pobox.com> wrote:
>>> Jonathan Tan <jonathantanmy@google.com> writes:

>>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>>> file with arbitrary text (similar to the ".keep" file). They come from
>>>> clones, fetches, and the object loader (see below).
>>>> ...
>>>>
>>>> A "homegrown" object is valid if each object it references:
>>>>  1. is a "homegrown" object,
>>>>  2. is an "imported" object, or
>>>>  3. is referenced by an "imported" object.
>>>
>>> Overall it captures what was discussed, and I think it is a good
>>> start.
>
> I missed the offline discussion and so am trying to piece together
> what this latest design is trying to do.  Please let me know if I'm
> not understanding something correctly.

I believe
https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/
and the surrounding thread (especially
https://public-inbox.org/git/xmqqefsudjqk.fsf@gitster.mtv.corp.google.com/)
is the discussion Junio is referring to.

[...]
> This segmentation is what is driving the need for the object loader
> to build a new local pack file for every command that has to fetch a
> missing object.  For example, we can't just write a tree object from
> a "partial" clone into the loose object store as we have no way for
> fsck to treat them differently and ignore any missing objects
> referenced by that tree object.

That's related and how it got lumped into this proposal, but it's not
the only motivation.

Other aspects:

 1. using pack files instead of loose objects means we can use deltas.
    This is the primary motivation.

 2. pack files can use reachability bitmaps (I realize there are
    obstacles to getting benefit out of this because git's bitmap
    format currently requires a pack to be self-contained, but I
    thought it was worth mentioning for completeness).

 3. existing git servers are oriented around pack files; they can
    more cheaply serve objects from pack files in pack format,
    including reusing deltas from them.

 4. file systems cope better with a few large files than many small
    files

[...]
> We all know that git doesn't scale well with a lot of pack files as
> it has to do a linear search through all the pack files when
> attempting to find an object.  I can see that very quickly, there
> would be a lot of pack files generated and with gc ignoring
> "partial" pack files, this would never get corrected.

Yes, that's an important point.  Regardless of this proposal, we need
to get more aggressive about concatenating pack files (e.g. by
implementing exponential rollup in "git gc --auto").

> In our usage scenarios, _all_ of the objects come from "partial"
> clones so all of our objects would end up in a series of "partial"
> pack files and would have pretty poor performance as a result.

Can you say more about this?  Why would the pack files (or loose
objects, for that matter) never end up being consolidated into few
pack files?

[...]
> That thinking did lead me back to wondering again if we could live
> with a repo specific flag.  If any clone/fetch was "partial" the
> flag is set and fsck ignore missing objects whether they came from a
> "partial" remote or not.
>
> I'll admit it isn't as robust if someone is mixing and matching
> remotes from different servers some of which are partial and some of
> which are not.  I'm not sure how often that would actually happen
> but I _am_ certain a single repo specific flag is a _much_ simpler
> model than anything else we've come up with so far.

The primary motivation in this thread is locally-created objects, not
objects obtained from other remotes.  Objects obtained from other
remotes are more of an edge case.

Thanks for your thoughtful comments.

Jonathan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-07 19:12     ` Ben Peart
  2017-08-07 19:21       ` Jonathan Nieder
@ 2017-08-07 19:41       ` Junio C Hamano
  2017-08-08 16:45         ` Ben Peart
  2017-08-07 23:10       ` Jonathan Tan
  2 siblings, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2017-08-07 19:41 UTC (permalink / raw)
  To: Ben Peart; +Cc: Jonathan Tan, git, Jonathan Nieder, christian.couder

Ben Peart <peartben@gmail.com> writes:

> My concern with this proposal is the combination of 1) writing a new
> pack file for every git command that ends up bringing down a missing
> object and 2) gc not compressing those pack files into a single pack
> file.

Your noticing these is a sign that you read the outline of the
design correctly, I think.  

The basic idea is that the local fsck should tolerate missing
objects when they are known to be obtainable from that external
service, but should still be able to diagnose missing objects that
we do not know if the external service has, especially the ones that
have been newly created locally and not yet made available to them
by pushing them back.

So we need a way to tell if an object that we do not have (but we
know about) can later be obtained from the external service.
Maintaining an explicit list of such objects obviously is one way,
but we can get the moral equivalent by using pack files.  After
receiving a pack file that has a commit from such an external
service, if the commit refers to its parent commit that we do not
have locally, the design proposes us to consider that the parent
commit that is missing is available at the external service that
gave the pack to us.  Similarly for missing trees, blobs, and any
objects that are supposed to be "reachable" from objects in such a
packfile.  

We can extend the approach to cover loose objects if we wanted to;
just define an alternate object store used internally for this
purpose and drop loose objects obtained from such an external
service in that object store.

Because we do not want to leave too many loose objects and small
packfiles lying around, we will need a new way of packing these.
Just enumerate these objects known to have come from the external
service (by being in packfiles marked as such or being loose objects
in the dedicated alternate object store), and create a single larger
packfile, which is marked as "holding the objects that are known to
be in the external service".  We do not have such a mode of gc, and
that is a new development that needs to happen, but we know that is
doable.

> That thinking did lead me back to wondering again if we could live
> with a repo specific flag.  If any clone/fetch was "partial" the flag
> is set and fsck ignore missing objects whether they came from a
> "partial" remote or not.

The only reason people run "git fsck" is to make sure that their
local repository is sound and they can rely on the objects you have
as the base of building new stuff on top of.  That is why we are
trying to find a way to make sure "fsck" can be used to detect
broken or missing objects that cannot be obtained from the
lazy-object store, without incurring undue overhead for normal
codepath (i.e. outside fsck).

It is OK to go back to wondering again, but I think that essentially
tosses "git fsck" out of the window and declares that it is OK to
hope that local objects will never go bad.  We can make such an
declaration anytime, but I do not want to see us doing so without
first trying to solve the issue without punting.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-07 19:12     ` Ben Peart
  2017-08-07 19:21       ` Jonathan Nieder
  2017-08-07 19:41       ` Junio C Hamano
@ 2017-08-07 23:10       ` Jonathan Tan
  2 siblings, 0 replies; 18+ messages in thread
From: Jonathan Tan @ 2017-08-07 23:10 UTC (permalink / raw)
  To: Ben Peart; +Cc: Junio C Hamano, git, Jonathan Nieder, christian.couder

On Mon, 7 Aug 2017 15:12:11 -0400
Ben Peart <peartben@gmail.com> wrote:

> I missed the offline discussion and so am trying to piece together what 
> this latest design is trying to do.  Please let me know if I'm not 
> understanding something correctly.
> 
>  From what I can tell, objects are going to be segmented into two 
> "types" - those that were fetched from a remote source that allows 
> partial clones/fetches (lazyobject/imported) and those that come from 
> "regular" remote sources (homegrown) that requires all objects to exist 
> locally.
> 
> FWIW, the names here are not making things clearer for me. If I'm 
> correct perhaps "partial" and "normal" would be better to indicate the 
> type of the source? Anyway...

That's right. As for names, I'm leaning now towards "imported" and
"non-imported". "Partial" is a bit strange because such an object is
fully available; it's just that the objects that it references are
promised by the server.

> Once the objects are segmented into the 2 types, the fsck connectivity 
> check code is updated to ignore missing objects from "partial" remotes 
> but still expect/validate them from "normal" remotes.
> 
> This compromise seems reasonable - don't generate errors for missing 
> objects for remotes that returned a partial clone but do generate errors 
> for missing objects from normal clones as a missing object is always an 
> error in this case.

Yes. In addition, the references of "imported" objects are also
potentially used when connectivity-checking "non-imported" objects - if
a "non-imported" object refers to an object that an "imported" object
refers to, that is fine, even though we don't have that object.

> This segmentation is what is driving the need for the object loader to 
> build a new local pack file for every command that has to fetch a 
> missing object.  For example, we can't just write a tree object from a 
> "partial" clone into the loose object store as we have no way for fsck 
> to treat them differently and ignore any missing objects referenced by 
> that tree object.
> 
> My concern with this proposal is the combination of 1) writing a new 
> pack file for every git command that ends up bringing down a missing 
> object and 2) gc not compressing those pack files into a single pack file.
> 
> We all know that git doesn't scale well with a lot of pack files as it 
> has to do a linear search through all the pack files when attempting to 
> find an object.  I can see that very quickly, there would be a lot of 
> pack files generated and with gc ignoring "partial" pack files, this 
> would never get corrected.
> 
> In our usage scenarios, _all_ of the objects come from "partial" clones 
> so all of our objects would end up in a series of "partial" pack files 
> and would have pretty poor performance as a result.

One possible solution...would support for annotating loose objects with
".remote" be sufficient? (That is, for each loose object file created,
create another of the same name but with ".remote" appended.) This means
that a loose-object-creating lazy loader would need to create 2 files
per object instead of one.

The lazy loader protocol will thus be updated to something resembling a
prior version with the loader writing objects directly to the object
database, but now the loader is also responsible for creating the
".remote" files.  (In the Android use case, we probably won't need the
writing-to-partial-packfile mechanism anymore since only comparatively
few and large blobs will go in there.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-07 19:21       ` Jonathan Nieder
@ 2017-08-08 14:18         ` Ben Peart
  0 siblings, 0 replies; 18+ messages in thread
From: Ben Peart @ 2017-08-08 14:18 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, Junio C Hamano, git, christian.couder



On 8/7/2017 3:21 PM, Jonathan Nieder wrote:
> Hi,
> 
> Ben Peart wrote:
>>> On Fri, 04 Aug 2017 15:51:08 -0700
>>> Junio C Hamano <gitster@pobox.com> wrote:
>>>> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>>>>> "Imported" objects must be in a packfile that has a "<pack name>.remote"
>>>>> file with arbitrary text (similar to the ".keep" file). They come from
>>>>> clones, fetches, and the object loader (see below).
>>>>> ...
>>>>>
>>>>> A "homegrown" object is valid if each object it references:
>>>>>   1. is a "homegrown" object,
>>>>>   2. is an "imported" object, or
>>>>>   3. is referenced by an "imported" object.
>>>>
>>>> Overall it captures what was discussed, and I think it is a good
>>>> start.
>>
>> I missed the offline discussion and so am trying to piece together
>> what this latest design is trying to do.  Please let me know if I'm
>> not understanding something correctly.
> 
> I believe
> https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/
> and the surrounding thread (especially
> https://public-inbox.org/git/xmqqefsudjqk.fsf@gitster.mtv.corp.google.com/)
> is the discussion Junio is referring to.
> 
> [...]
>> This segmentation is what is driving the need for the object loader
>> to build a new local pack file for every command that has to fetch a
>> missing object.  For example, we can't just write a tree object from
>> a "partial" clone into the loose object store as we have no way for
>> fsck to treat them differently and ignore any missing objects
>> referenced by that tree object.
> 
> That's related and how it got lumped into this proposal, but it's not
> the only motivation.
> 
> Other aspects:
> 
>   1. using pack files instead of loose objects means we can use deltas.
>      This is the primary motivation.
> 
>   2. pack files can use reachability bitmaps (I realize there are
>      obstacles to getting benefit out of this because git's bitmap
>      format currently requires a pack to be self-contained, but I
>      thought it was worth mentioning for completeness).
> 
>   3. existing git servers are oriented around pack files; they can
>      more cheaply serve objects from pack files in pack format,
>      including reusing deltas from them.
> 
>   4. file systems cope better with a few large files than many small
>      files
> 
> [...]
>> We all know that git doesn't scale well with a lot of pack files as
>> it has to do a linear search through all the pack files when
>> attempting to find an object.  I can see that very quickly, there
>> would be a lot of pack files generated and with gc ignoring
>> "partial" pack files, this would never get corrected.
> 
> Yes, that's an important point.  Regardless of this proposal, we need
> to get more aggressive about concatenating pack files (e.g. by
> implementing exponential rollup in "git gc --auto").
> 
>> In our usage scenarios, _all_ of the objects come from "partial"
>> clones so all of our objects would end up in a series of "partial"
>> pack files and would have pretty poor performance as a result.
> 
> Can you say more about this?  Why would the pack files (or loose
> objects, for that matter) never end up being consolidated into few
> pack files?
> 

Our initial clone is very sparse - we only pull down the commit we are 
about to checkout and none of the blobs. All missing objects are then 
downloaded on demand (and in this proposal, would end up in a "partial" 
pack file).  For performance reasons, we also (by default) download a 
server computed pack file of commits and trees to pre-populate the local 
cache.

Without modification, fsck, repack, prune, gc will trigger every object 
in the repo to be downloaded.  We punted for now and just block those 
commands but eventually they need to be aware of missing objects so that 
they do not cause them to be downloaded.  Jonathan is already working on 
this for fsck in another patch series.

> [...]
>> That thinking did lead me back to wondering again if we could live
>> with a repo specific flag.  If any clone/fetch was "partial" the
>> flag is set and fsck ignore missing objects whether they came from a
>> "partial" remote or not.
>>
>> I'll admit it isn't as robust if someone is mixing and matching
>> remotes from different servers some of which are partial and some of
>> which are not.  I'm not sure how often that would actually happen
>> but I _am_ certain a single repo specific flag is a _much_ simpler
>> model than anything else we've come up with so far.
> 
> The primary motivation in this thread is locally-created objects, not
> objects obtained from other remotes.  Objects obtained from other
> remotes are more of an edge case.
> 

Thank you - that helps me to better understand the requirements of the 
problem we're trying to solve.  In short, that means what we really need 
is a way to identify locally created objects so that fsck can do a 
complete connectivity check on them.  I'll have to think about a good 
way to do that - we've talked about a few but each has a different set 
of trade-offs and none of them are great (yet :)).

> Thanks for your thoughtful comments.
> 
> Jonathan
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-07 19:41       ` Junio C Hamano
@ 2017-08-08 16:45         ` Ben Peart
  2017-08-08 17:03           ` Jonathan Nieder
  0 siblings, 1 reply; 18+ messages in thread
From: Ben Peart @ 2017-08-08 16:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git, Jonathan Nieder, christian.couder



On 8/7/2017 3:41 PM, Junio C Hamano wrote:
> Ben Peart <peartben@gmail.com> writes:
> 
>> My concern with this proposal is the combination of 1) writing a new
>> pack file for every git command that ends up bringing down a missing
>> object and 2) gc not compressing those pack files into a single pack
>> file.
> 
> Your noticing these is a sign that you read the outline of the
> design correctly, I think.
> 
> The basic idea is that the local fsck should tolerate missing
> objects when they are known to be obtainable from that external
> service, but should still be able to diagnose missing objects that
> we do not know if the external service has, especially the ones that
> have been newly created locally and not yet made available to them
> by pushing them back.
> 

This helps me a lot as now I think I understand the primary requirement 
we're trying to solve for.  Let me rephrase it and see if this makes sense:

We need to be able to identify whether an object was created locally 
(and should pass more strict fsck/connectivity tests) or whether it came 
from a remote (and so any missing objects could presumably be fetched 
from the server).

I agree it would be nice to solve this (and not just punt fsck - even if 
it is an opt-in behavior).

We've discussed a couple of different possible solutions, each of which 
have different tradeoffs.  Let me try to summarize here and perhaps 
suggest some other possibilities:


Promised list
-------------
This provides an external data structure that allowed us to flag objects 
that came from a remote server (vs created locally).

The biggest drawback is that this data structure can get very large and 
become difficult/expensive to generate/transfer/maintain.

It also (at least in one proposal) required protocol and server side 
changes to support it.


Annotated via filename
----------------------
This idea is to annotate the file names of objects that came from a 
remote server (pack files and loose objects) with a unique file 
extension (.remote) that indicates whether they are locally created or not.

To make this work, git must understand about both types of loose objects 
and pack files and search in both locations when looking for objects.

Another drawback of this is that commands (repack, gc) that optimize 
loose objects and pack files must now be aware of the different 
extensions and handle both while not merging remote and non-remote objects.

In short, we're creating separate object stores - one for locally 
created objects and one for everything else.


Now a couple of different ideas:

Annotated via flags
===================
The fundamental idea here is that we add the ability to flag locally 
created objects on the object itself.

Given that at the core, "Git is a simple key-value data store" can we 
take advantage of that fact and include a "locally created" bit as a 
property on every object?

I could not think of a good way to accomplish this as it is ultimately 
changing the object format which creates rapidly expanding ripples of 
change.

For example, The object header currently includes the type a space, the 
length and a null. Even if we could add a "local" property (either by 
adding a 5th item, taking over the space, creating new object types, 
etc), the fact that the header is included in the sha1 means that push 
would become problematic as flipping the bit would change the sha and 
the trees and commits that reference it.


Local list
----------
Given the number of locally created objects is usually very small in 
comparison to the total number of objects (even just due to history), it 
makes more sense to track locally created objects instead of 
promised/remote objects.

The biggest advantage of this over the "promised list" is that the 
"local list" being maintained is _significantly_ smaller (often orders 
of magnitude smaller).

Another advantage over the "promised list" solution is that it doesn't 
require any server side or protocol changes.

On the client when objects are created (write_loose_object?) the new 
objects are added to the "local list" and in the connectivity check 
(fsck) if the object is not in the "local list," the connectivity check 
can be skipped as any missing object can presumably be retrieved from 
the server.

A simple file format could be used (header + list of SHA1 values) and 
write_loose_object could do a trivial append. In fsck, the file could be 
loaded into a hashmap to make for fast existence checks.

Entries could be removed from the "local list" for objects later fetched 
from a server (though I had a hard time contriving a scenario where this 
would happen so I consider this optional).

On the surface, this seems like the simplest solution that meets the 
stated requirements.


Object DB
---------
This is a different way of providing separate object stores than the 
"Annotated via filename" proposal. It should be a cleaner/more elegant 
solution that enables several other capabilities but it is also more 
work to implement (isn't that always the case?).

We create an object store abstraction layer that enables multiple object 
store providers to exist. The order that they are called should be 
configurable based on the command (esp have/read vs create/write). This 
enables features like tiered storage: in memory, pack, loose, alternate, 
large, remote.

The connectivity check in fsck would then only traverse and validate 
objects that existed via the local object store providers.

While I like the flexibility of this design and hope we can obtain it in 
the long term for it's other benefits, it's a bit overkill for this 
specific problem. The big drawback of this model is the cost to design 
and implement it.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Partial clone design (with connectivity check for locally-created objects)
  2017-08-08 16:45         ` Ben Peart
@ 2017-08-08 17:03           ` Jonathan Nieder
  0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Nieder @ 2017-08-08 17:03 UTC (permalink / raw)
  To: Ben Peart; +Cc: Junio C Hamano, Jonathan Tan, git, christian.couder

Hi,

Ben Peart wrote:

> We've discussed a couple of different possible solutions, each of
> which have different tradeoffs.  Let me try to summarize here and
> perhaps suggest some other possibilities:

Thanks for this.  Some comments below.

> Promised list
> -------------
> This provides an external data structure that allowed us to flag
> objects that came from a remote server (vs created locally).
>
> The biggest drawback is that this data structure can get very large
> and become difficult/expensive to generate/transfer/maintain.

Agreed.  Using a single immutable file to maintain this data with
lock-and-rename update means that I/O when updating it can be a
bottleneck and that contention can be a problem.

> It also (at least in one proposal) required protocol and server side
> changes to support it.

I don't think that's a very big problem.  This is the Git project: we
control the protocol and the server.  Partial clone requires changing
the protocol and server already.

> Annotated via filename
> ----------------------
> This idea is to annotate the file names of objects that came from a
> remote server (pack files and loose objects) with a unique file
> extension (.remote) that indicates whether they are locally created
> or not.
>
> To make this work, git must understand about both types of loose
> objects and pack files and search in both locations when looking for
> objects.

I don't understand the drawback you're describing here.  To avoid a
number of serious problems, Git already needs to be aware of partial
clone.  I don't think anyone has been proposing adding partial clone
to upstream Git without a repository format extension (see
Documentation/technical/repository-version.txt) to prevent older
versions of Git from being confused about such repositories.

If you don't do this, some problems include
- confusing messages due to missing objects
- errors over the wire protocol from trying to serve fetches and
  getting confused
- "git gc" running and not knowing which objects are safe to be
  deleted

So the relevant issue couldn't be that Git has to be changed at all:
it would be that a change is excessively invasive.

But it's not clear to me that the change you are describing is very
invasive.

> Another drawback of this is that commands (repack, gc) that optimize
> loose objects and pack files must now be aware of the different
> extensions and handle both while not merging remote and non-remote
> objects.
>
> In short, we're creating separate object stores - one for locally
> created objects and one for everything else.

These also seem like non-issues.

Some examples of problems I could imagine:

- is writing multiple files when writing a loose object a problem for
  your setup?
- is one of the operations described (repack, prune, fsck) too slow?

Do you forsee either of those being an issue?

> Now a couple of different ideas:
>
> Annotated via flags
> ===================
> The fundamental idea here is that we add the ability to flag locally
> created objects on the object itself.

Do you mean changing the underlying object format that produces an
object's object id?  Or do you mean changing the container format?

Changing the container format is exactly what was described in the
previous example ("Annotated via filename").  There are other ways to
change the container format: e.g. if writing multiple files when
writing a loose object is a problem, we could add a field that does
not affect the object id to the loose object format.

[...]
> Local list
> ----------
> Given the number of locally created objects is usually very small in
> comparison to the total number of objects (even just due to
> history), it makes more sense to track locally created objects
> instead of promised/remote objects.
>
> The biggest advantage of this over the "promised list" is that the
> "local list" being maintained is _significantly_ smaller (often
> orders of magnitude smaller).
[...]
> On the surface, this seems like the simplest solution that meets the
> stated requirements.

This has the same problems as the list of promised objects: excessive
I/O and contention when updating the list.

Moreover, it doesn't bring one of the main benefits of the list of
promised objects.  Promised objects are not present in the local
repository, so the list of promises provided a way to maintain some
information about them (e.g., object size).  Locally created objects
are present in the local repository so they don't need such metadata.

> Object DB
> ---------

If I understand correctly, this is pushing the issues described in the
other cases into a hook and making them not upstream Git's problem.

But it is still someone's problem.  It just means upstream Git doesn't
benefit from their solution to it.

I don't see a need to give up in that way just yet.

I'm also available on #git-devel on freenode.net for real-time
conversation.  Logs are at http://bit.ly/aLzrmv.  You can prepend a
message with "[off]" to prevent it from showing up in logs.  I'm also
happy to try to summarize any conversation that happens there here.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH] Updated "imported object" design
  2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
  2017-08-04 22:51 ` Junio C Hamano
@ 2017-08-16  0:32 ` Jonathan Tan
  2017-08-16 20:32   ` Junio C Hamano
  2017-08-17 20:07   ` Ben Peart
  1 sibling, 2 replies; 18+ messages in thread
From: Jonathan Tan @ 2017-08-16  0:32 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, jrnieder, peartben, gitster

This patch is based on an updated version of my refactoring of
pack-related functions [1].

This corresponds to patch 1 of my previous series [2]. I'm sending this
patch out (before I update the rest) for 2 reasons:
 * to provide a demonstration of how the feature could be implemented,
   in the hope of restarting the discussion
 * to obtain comments about this patch to see if I'm heading in the
   right direction

In an earlier e-mail [3], I suggested that loose objects can also be
marked as ".imported" (formerly ".remote" - after looking at the code, I
decided to use "imported" throughout, since "remote" can be easily
confused as the opposite of "local", used to represent objects in the
local store as opposed to an alternate store).

However, I have only implemented the imported packed objects part -
imported loose objects can be added later.

It still remains to be discussed whether we should mark the imported
objects or the non-imported objects as the source of promises, but I
still think that we should mark the imported objects. In this way, the
new properties (the provision of promises and the mark) coincide on the
same object, and the same things (locally created objects, fetches from
non-lazy-object-serving remotes) behave in the same way regardless of
whether extensions.lazyObject is set (allowing, for example, a repo to
be converted into a promise-enabled one solely through modifying the
configuration).

Also, let me know if there's a better way to send out these patches for
review. Some of the code here has been reviewed before, for example.

[1] https://public-inbox.org/git/cover.1502241234.git.jonathantanmy@google.com/

[2] https://public-inbox.org/git/ffb734d277132802bcc25baa13e8ede3490af62a.1501532294.git.jonathantanmy@google.com/

[3] https://public-inbox.org/git/20170807161031.7c4eae50@twelve2.svl.corp.google.com/
-- 8< --
environment, fsck: introduce lazyobject extension

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage. In such an arrangement, the full set of objects is usually
available in remote storage, ready to be lazily downloaded.

Introduce the concept of promises, objects believed by the local repo to
be downloadable from remote storage. An object is a promise if it is
referred to by an object in a specially marked packfile. Any such
promise can be validly referred to, even if the object itself is not in
the local repo.

This functionality is guarded behind a new repository extension option
`extensions.lazyObject`. The value of `extensions.lazyObject` must be a
string. The meaning of this string will be defined in a subsequent
commit.

Teach fsck about the new state of affairs. In this commit, teach fsck
that promises referenced from the reflog are not an error case; in
future commits, fsck will be taught about other cases.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/repository-version.txt |  8 +++
 builtin/fsck.c                                 | 64 +++++++++++++++++++-
 cache.h                                        |  5 +-
 environment.c                                  |  1 +
 pack.h                                         |  1 +
 packfile.c                                     | 13 +++-
 setup.c                                        |  7 ++-
 t/t0410-lazy-object.sh                         | 84 ++++++++++++++++++++++++++
 8 files changed, 177 insertions(+), 6 deletions(-)
 create mode 100755 t/t0410-lazy-object.sh

diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986..71cb3bfee 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,11 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`lazyObject`
+~~~~~~~~~~~~~~~~~
+
+When the config key `extensions.lazyObject` is set, Git does not treat
+missing objects as errors. The value of `extensions.lazyObject` must be
+a string. NEEDSWORK: define what this string contains when the
+appropriate functionality is implemented.
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 99dea7adf..25265b1fe 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -15,6 +15,7 @@
 #include "progress.h"
 #include "streaming.h"
 #include "decorate.h"
+#include "oidset.h"
 
 #define REACHABLE 0x0001
 #define SEEN      0x0002
@@ -43,6 +44,67 @@ static int name_objects;
 #define ERROR_PACK 04
 #define ERROR_REFS 010
 
+/*
+ * Objects that are believed to be loadable by the lazy loader, because
+ * they are referred to by an imported object. If an object that we have
+ * refers to such an object even though we don't have that object, it is
+ * not an error.
+ */
+static struct oidset promises;
+static int promises_prepared;
+
+static int add_promise(const struct object_id *oid, struct packed_git *pack,
+		       uint32_t pos, void *data)
+{
+	struct object *obj = parse_object(oid);
+	if (!obj)
+		/*
+		 * Error messages are given when packs are verified, so
+		 * do not print any here.
+		 */
+		return 0;
+	
+	/*
+	 * If this is a tree, commit, or tag, the objects it refers
+	 * to are promises. (Blobs refer to no objects.)
+	 */
+	if (obj->type == OBJ_TREE) {
+		struct tree *tree = (struct tree *) obj;
+		struct tree_desc desc;
+		struct name_entry entry;
+		if (init_tree_desc_gently(&desc, tree->buffer, tree->size))
+			/*
+			 * Error messages are given when packs are
+			 * verified, so do not print any here.
+			 */
+			return 0;
+		while (tree_entry_gently(&desc, &entry))
+			oidset_insert(&promises, entry.oid);
+	} else if (obj->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) obj;
+		struct commit_list *parents = commit->parents;
+
+		oidset_insert(&promises, &commit->tree->object.oid);
+		for (; parents; parents = parents->next)
+			oidset_insert(&promises, &parents->item->object.oid);
+	} else if (obj->type == OBJ_TAG) {
+		struct tag *tag = (struct tag *) obj;
+		oidset_insert(&promises, &tag->tagged->oid);
+	}
+	return 0;
+}
+
+static int is_promise(const struct object_id *oid)
+{
+	if (!promises_prepared) {
+		if (repository_format_lazy_object)
+			for_each_packed_object(add_promise, NULL,
+					       FOR_EACH_OBJECT_IMPORTED_ONLY);
+		promises_prepared = 1;
+	}
+	return oidset_contains(&promises, oid);
+}
+
 static const char *describe_object(struct object *obj)
 {
 	static struct strbuf buf = STRBUF_INIT;
@@ -410,7 +472,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
 					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
 			obj->used = 1;
 			mark_object_reachable(obj);
-		} else {
+		} else if (!is_promise(oid)) {
 			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
 			errors_found |= ERROR_REACHABLE;
 		}
diff --git a/cache.h b/cache.h
index b15645672..f529096c8 100644
--- a/cache.h
+++ b/cache.h
@@ -853,10 +853,12 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_lazy_object;
 
 struct repository_format {
 	int version;
 	int precious_objects;
+	char *lazy_object;
 	int is_bare;
 	char *work_tree;
 	struct string_list unknown_extensions;
@@ -1584,7 +1586,8 @@ extern struct packed_git {
 	unsigned pack_local:1,
 		 pack_keep:1,
 		 freshened:1,
-		 do_not_close:1;
+		 do_not_close:1,
+		 pack_imported:1;
 	unsigned char sha1[20];
 	struct revindex_entry *revindex;
 	/* something like ".git/objects/pack/xxxxx.pack" */
diff --git a/environment.c b/environment.c
index 3fd4b1084..cd8ef2897 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_lazy_object;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/pack.h b/pack.h
index 6aae1a7c3..7c196c6cd 100644
--- a/pack.h
+++ b/pack.h
@@ -229,6 +229,7 @@ extern int has_pack_index(const unsigned char *sha1);
  * repository and any alternates repositories (unless the
  * FOR_EACH_OBJECT_LOCAL_ONLY flag, defined in cache.h, is set).
  */
+#define FOR_EACH_OBJECT_IMPORTED_ONLY 2
 typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
 				  uint32_t pos,
diff --git a/packfile.c b/packfile.c
index 2f008ede7..90028f9af 100644
--- a/packfile.c
+++ b/packfile.c
@@ -637,10 +637,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 		return NULL;
 
 	/*
-	 * ".pack" is long enough to hold any suffix we're adding (and
+	 * ".imported" is long enough to hold any suffix we're adding (and
 	 * the use xsnprintf double-checks that)
 	 */
-	alloc = st_add3(path_len, strlen(".pack"), 1);
+	alloc = st_add3(path_len, strlen(".imported"), 1);
 	p = alloc_packed_git(alloc);
 	memcpy(p->pack_name, path, path_len);
 
@@ -648,6 +648,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_keep = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".imported");
+	if (!access(p->pack_name, F_OK))
+		p->pack_imported = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -775,7 +779,8 @@ static void prepare_packed_git_one(char *objdir, int local)
 		if (ends_with(de->d_name, ".idx") ||
 		    ends_with(de->d_name, ".pack") ||
 		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep"))
+		    ends_with(de->d_name, ".keep") ||
+		    ends_with(de->d_name, ".imported"))
 			string_list_append(&garbage, path.buf);
 		else
 			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
@@ -1893,6 +1898,8 @@ int for_each_packed_object(each_packed_object_fn cb, void *data, unsigned flags)
 	for (p = packed_git; p; p = p->next) {
 		if ((flags & FOR_EACH_OBJECT_LOCAL_ONLY) && !p->pack_local)
 			continue;
+		if ((flags & FOR_EACH_OBJECT_IMPORTED_ONLY) && !p->pack_imported)
+			continue;
 		if (open_pack_index(p)) {
 			pack_errors = 1;
 			continue;
diff --git a/setup.c b/setup.c
index 860507e1f..94cfde3cc 100644
--- a/setup.c
+++ b/setup.c
@@ -425,7 +425,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			;
 		else if (!strcmp(ext, "preciousobjects"))
 			data->precious_objects = git_config_bool(var, value);
-		else
+		else if (!strcmp(ext, "lazyobject")) {
+			if (!value)
+				return config_error_nonbool(var);
+			data->lazy_object = xstrdup(value);
+		} else
 			string_list_append(&data->unknown_extensions, ext);
 	} else if (strcmp(var, "core.bare") == 0) {
 		data->is_bare = git_config_bool(var, value);
@@ -468,6 +472,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	}
 
 	repository_format_precious_objects = candidate.precious_objects;
+	repository_format_lazy_object = candidate.lazy_object;
 	string_list_clear(&candidate.unknown_extensions, 0);
 	if (!has_common) {
 		if (candidate.is_bare != -1) {
diff --git a/t/t0410-lazy-object.sh b/t/t0410-lazy-object.sh
new file mode 100755
index 000000000..2368150d4
--- /dev/null
+++ b/t/t0410-lazy-object.sh
@@ -0,0 +1,84 @@
+#!/bin/sh
+
+test_description='lazy object'
+
+. ./test-lib.sh
+
+delete_object () {
+	rm $1/.git/objects/$(echo $2 | cut -c1-2)/$(echo $2 | cut -c3-40)
+}
+
+pack_imported_object() {
+	printf "%s" "$1" | git -C repo pack-objects .git/objects/pack/pack &&
+	(
+		cd repo/.git/objects/pack &&
+		>$(basename --suffix=.pack *.pack).imported
+	)
+}
+
+test_expect_success 'missing reflog object referred to by imported commit passes fsck' '
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	C=$(git -C repo commit-tree -m c -p $A HEAD^{tree}) &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch my_branch "$A" &&
+	git -C repo branch -f my_branch my_commit &&
+	delete_object repo "$A" &&
+
+	# Designate $C, which refers to $A, as an imported object
+	pack_imported_object "$C\n" &&
+
+	# Normally, it fails
+	test_must_fail git -C repo fsck &&
+
+	# But with the extension, it succeeds
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_expect_success 'missing reflog object referred to by imported tag passes fsck' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	git -C repo tag -a -m d my_tag_name $A &&
+	T=$(git -C repo rev-parse my_tag_name) &&
+	git -C repo tag -d my_tag_name &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch my_branch "$A" &&
+	git -C repo branch -f my_branch my_commit &&
+	delete_object repo "$A" &&
+
+	# Designate $T, which refers to $A, as an imported object
+	pack_imported_object "$T\n" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_expect_success 'missing reflog object alone fails fsck, even with extension set' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch my_branch "$A" &&
+	git -C repo branch -f my_branch my_commit &&
+	delete_object repo "$A" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.lazyobject "arbitrary string" &&
+	test_must_fail git -C repo fsck
+'
+
+test_done
-- 
2.14.1.480.gb18f417b89-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
@ 2017-08-16 20:32   ` Junio C Hamano
  2017-08-16 21:35     ` Jonathan Tan
  2017-08-17 20:07   ` Ben Peart
  1 sibling, 1 reply; 18+ messages in thread
From: Junio C Hamano @ 2017-08-16 20:32 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, jrnieder, peartben

Jonathan Tan <jonathantanmy@google.com> writes:

> Also, let me know if there's a better way to send out these patches for
> review. Some of the code here has been reviewed before, for example.
>
> [1] https://public-inbox.org/git/cover.1502241234.git.jonathantanmy@google.com/
>
> [2] https://public-inbox.org/git/ffb734d277132802bcc25baa13e8ede3490af62a.1501532294.git.jonathantanmy@google.com/
>
> [3] https://public-inbox.org/git/20170807161031.7c4eae50@twelve2.svl.corp.google.com/

... and some of the code exists only in the list archive, so we
don't know which other topic if any we may want to eject tentatively
if we wanted to give precedence to move this topic forward over
others.  I'll worry about it later but help from others is also
appreciated.

As to the contents of this patch, overall, everything makes sense,
except for one thing that makes me wonder.  It's not that I see
something specifically incorrect--it is just I do not yet quiet
fathom the implications of.

> +/*
> + * Objects that are believed to be loadable by the lazy loader, because
> + * they are referred to by an imported object. If an object that we have
> + * refers to such an object even though we don't have that object, it is
> + * not an error.
> + */
> +static struct oidset promises;
> +static int promises_prepared;
> +
> +static int add_promise(const struct object_id *oid, struct packed_git *pack,
> +		       uint32_t pos, void *data)
> +{
> +	struct object *obj = parse_object(oid);
> +	if (!obj)
> +		/*
> +		 * Error messages are given when packs are verified, so
> +		 * do not print any here.
> +		 */
> +		return 0;
> +	
> +	/*
> +	 * If this is a tree, commit, or tag, the objects it refers
> +	 * to are promises. (Blobs refer to no objects.)
> +	 */
> +	if (obj->type == OBJ_TREE) {
> +		struct tree *tree = (struct tree *) obj;
> +		struct tree_desc desc;
> +		struct name_entry entry;
> +		if (init_tree_desc_gently(&desc, tree->buffer, tree->size))
> +			/*
> +			 * Error messages are given when packs are
> +			 * verified, so do not print any here.
> +			 */
> +			return 0;
> +		while (tree_entry_gently(&desc, &entry))
> +			oidset_insert(&promises, entry.oid);
> +	} else if (obj->type == OBJ_COMMIT) {
> +		struct commit *commit = (struct commit *) obj;
> +		struct commit_list *parents = commit->parents;
> +
> +		oidset_insert(&promises, &commit->tree->object.oid);
> +		for (; parents; parents = parents->next)
> +			oidset_insert(&promises, &parents->item->object.oid);
> +	} else if (obj->type == OBJ_TAG) {
> +		struct tag *tag = (struct tag *) obj;
> +		oidset_insert(&promises, &tag->tagged->oid);
> +	}
> +	return 0;
> +}

This collects names of the objects that are _directly_ referred to
by imported objects.  An imported pack may have a commit, whose
top-level tree may or may not appear in the same pack, or the tree
may exist locally but not in the same pack.  Or the tree may not be
locally available at all.  In any of these four cases, the top-level
tree is listed in the "promises" set.  Same for trees and tags.

I wonder if all of the calls to oidset_insert() in this function
want to be guarded by "mark it as promised only when the referrent
is *not* locally available" to keep the promises set minimally
populated.  The only change needed to fsck in order to make it
refrain from treating a missing but promised object as an error
would be:

        -       if (object is missing)
        +       if (object is missing && object is not promised)
                        error("that object must be there but missing");

so there is no point in throwing something that we know we locally
have in this oidset, right?

On the other hand, cost of such additional checks in this function
may outweigh the savings of both memory pressure and look-up cost,
so I do not know how the tradeoff would turn out.

> +static int is_promise(const struct object_id *oid)
> +{
> +	if (!promises_prepared) {
> +		if (repository_format_lazy_object)
> +			for_each_packed_object(add_promise, NULL,
> +					       FOR_EACH_OBJECT_IMPORTED_ONLY);
> +		promises_prepared = 1;
> +	}
> +	return oidset_contains(&promises, oid);
> +}

Somehow I'm tempted to call this function "is_promised()" but that
is a minor naming issue.

>  static const char *describe_object(struct object *obj)
>  {
>  	static struct strbuf buf = STRBUF_INIT;
> @@ -410,7 +472,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
>  					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
>  			obj->used = 1;
>  			mark_object_reachable(obj);
> -		} else {
> +		} else if (!is_promise(oid)) {
>  			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
>  			errors_found |= ERROR_REACHABLE;
>  		}

This is about certainly is one place we want to check if the missing
object is OK, but I'm a bit surprised if this were the only place.

Don't we need "while trying to follow all the outgoing links from
this tree object, and we found this object is not available locally;
normally we would mark it as an error but it turns out that the
missing one is in the promised set of objects, so it is OK" for the
normal connectivity traversal codepaths, for example?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-16 20:32   ` Junio C Hamano
@ 2017-08-16 21:35     ` Jonathan Tan
  2017-08-17 20:50       ` Ben Peart
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Tan @ 2017-08-16 21:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, jrnieder, peartben

On Wed, 16 Aug 2017 13:32:23 -0700
Junio C Hamano <gitster@pobox.com> wrote:

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > Also, let me know if there's a better way to send out these patches for
> > review. Some of the code here has been reviewed before, for example.
> >
> > [1] https://public-inbox.org/git/cover.1502241234.git.jonathantanmy@google.com/
> >
> > [2] https://public-inbox.org/git/ffb734d277132802bcc25baa13e8ede3490af62a.1501532294.git.jonathantanmy@google.com/
> >
> > [3] https://public-inbox.org/git/20170807161031.7c4eae50@twelve2.svl.corp.google.com/
> 
> ... and some of the code exists only in the list archive, so we
> don't know which other topic if any we may want to eject tentatively
> if we wanted to give precedence to move this topic forward over
> others.  I'll worry about it later but help from others is also
> appreciated.

Thanks - I can help take a look when it is time to move the code in.

I think the issue here is whether we want to move this topic forward or
not, that is, if this (special ".imported" objects) is the best way to
solve (at least partially) the connectivity check part of tolerating
missing objects. I hope that we can continue to talk about it.

> This collects names of the objects that are _directly_ referred to
> by imported objects.  An imported pack may have a commit, whose
> top-level tree may or may not appear in the same pack, or the tree
> may exist locally but not in the same pack.  Or the tree may not be
> locally available at all.  In any of these four cases, the top-level
> tree is listed in the "promises" set.  Same for trees and tags.
> 
> I wonder if all of the calls to oidset_insert() in this function
> want to be guarded by "mark it as promised only when the referrent
> is *not* locally available" to keep the promises set minimally
> populated.  The only change needed to fsck in order to make it
> refrain from treating a missing but promised object as an error
> would be:
> 
>         -       if (object is missing)
>         +       if (object is missing && object is not promised)
>                         error("that object must be there but missing");
> 
> so there is no point in throwing something that we know we locally
> have in this oidset, right?
> 
> On the other hand, cost of such additional checks in this function
> may outweigh the savings of both memory pressure and look-up cost,
> so I do not know how the tradeoff would turn out.

I also don't know how the tradeoff would turn out, so I leaned towards
the slightly simpler solution of not doing the check. In the future,
maybe a t/perf test can be done to decide between the two.

> > +static int is_promise(const struct object_id *oid)
> > +{
> > +	if (!promises_prepared) {
> > +		if (repository_format_lazy_object)
> > +			for_each_packed_object(add_promise, NULL,
> > +					       FOR_EACH_OBJECT_IMPORTED_ONLY);
> > +		promises_prepared = 1;
> > +	}
> > +	return oidset_contains(&promises, oid);
> > +}
> 
> Somehow I'm tempted to call this function "is_promised()" but that
> is a minor naming issue.

I was trying to be consistent in using the name "promise" instead of
"promised object/tag/commit/tree/blob" everywhere, but we can switch if
need be (for example, if we don't want to limit the generic name
"promise" to merely objects).

> >  static const char *describe_object(struct object *obj)
> >  {
> >  	static struct strbuf buf = STRBUF_INIT;
> > @@ -410,7 +472,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
> >  					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
> >  			obj->used = 1;
> >  			mark_object_reachable(obj);
> > -		} else {
> > +		} else if (!is_promise(oid)) {
> >  			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
> >  			errors_found |= ERROR_REACHABLE;
> >  		}
> 
> This is about certainly is one place we want to check if the missing
> object is OK, but I'm a bit surprised if this were the only place.
> 
> Don't we need "while trying to follow all the outgoing links from
> this tree object, and we found this object is not available locally;
> normally we would mark it as an error but it turns out that the
> missing one is in the promised set of objects, so it is OK" for the
> normal connectivity traversal codepaths, for example?

That's right. The places to make this change are the same as those in
some earlier patches I sent (patches 2-4 in [1]).

[1] https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
  2017-08-16 20:32   ` Junio C Hamano
@ 2017-08-17 20:07   ` Ben Peart
  1 sibling, 0 replies; 18+ messages in thread
From: Ben Peart @ 2017-08-17 20:07 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: jrnieder, gitster



On 8/15/2017 8:32 PM, Jonathan Tan wrote:
> This patch is based on an updated version of my refactoring of
> pack-related functions [1].
> 
> This corresponds to patch 1 of my previous series [2]. I'm sending this
> patch out (before I update the rest) for 2 reasons:
>   * to provide a demonstration of how the feature could be implemented,
>     in the hope of restarting the discussion
>   * to obtain comments about this patch to see if I'm heading in the
>     right direction
> 
> In an earlier e-mail [3], I suggested that loose objects can also be
> marked as ".imported" (formerly ".remote" - after looking at the code, I
> decided to use "imported" throughout, since "remote" can be easily
> confused as the opposite of "local", used to represent objects in the
> local store as opposed to an alternate store).
> 
> However, I have only implemented the imported packed objects part -
> imported loose objects can be added later.
> 
> It still remains to be discussed whether we should mark the imported
> objects or the non-imported objects as the source of promises, but I
> still think that we should mark the imported objects. In this way, the
> new properties (the provision of promises and the mark) coincide on the
> same object, and the same things (locally created objects, fetches from
> non-lazy-object-serving remotes) behave in the same way regardless of
> whether extensions.lazyObject is set (allowing, for example, a repo to
> be converted into a promise-enabled one solely through modifying the
> configuration).
> 

This illustrates another place we need to resolve the naming/vocabulary. 
  We should at least be consistent to make it easier to discuss/explain. 
  We obviously went with "virtual" when building GVFS but I'm OK with 
"lazy" as long as we're consistent.  Some examples of how the naming can 
clarify or confuse:

'Promise-enable your repo by setting the "extensions.lazyObject" flag'

'Enable your repo to lazily fetch objects by setting the 
"extensions.lazyObject"'

'Virtualize your repo by setting the "extensions.virtualize" flag'

We may want to carry the same name into the filename we use to mark the 
(virtualized/lazy/promised/imported) objects.

(This reminds me that there are only 2 hard problems in computer 
science...) ;)


It is true that converting a normal repo to 
virtualized/lazy/promise-enabled repo wouldn't have to munge with the 
object marking but since this only happens once, I wouldn't make that 
primary decision maker (and switching back would require munging the 
objects anyway).  I think we can make either work (ie tagging local vs 
non-local/remote/imported/promised/lazy/virtual).

I'm fine leaving "how" the objects are marked as an implementation 
detail - the design requires them to be marked, the person writing the 
code can figure out the fastest way to accomplish that.

In the current patch/design, the access time to check for the existence 
of an "imported" file is going to be dwarfed by the cost of opening and 
parsing every object (loose and pack) to get its oid so that isn't 
really an issue.

> +/*
> + * Objects that are believed to be loadable by the lazy loader, because
> + * they are referred to by an imported object. If an object that we have
> + * refers to such an object even though we don't have that object, it is
> + * not an error.
> + */
> +static struct oidset promises;
> +static int promises_prepared;
> +
> +static int add_promise(const struct object_id *oid, struct packed_git *pack,
> +		       uint32_t pos, void *data)
> +{
> +	struct object *obj = parse_object(oid);
> +	if (!obj)
> +		/*
> +		 * Error messages are given when packs are verified, so
> +		 * do not print any here.
> +		 */
> +		return 0;
> +	
> +	/*
> +	 * If this is a tree, commit, or tag, the objects it refers
> +	 * to are promises. (Blobs refer to no objects.)
> +	 */
> +	if (obj->type == OBJ_TREE) {
> +		struct tree *tree = (struct tree *) obj;
> +		struct tree_desc desc;
> +		struct name_entry entry;
> +		if (init_tree_desc_gently(&desc, tree->buffer, tree->size))
> +			/*
> +			 * Error messages are given when packs are
> +			 * verified, so do not print any here.
> +			 */
> +			return 0;
> +		while (tree_entry_gently(&desc, &entry))
> +			oidset_insert(&promises, entry.oid);
> +	} else if (obj->type == OBJ_COMMIT) {
> +		struct commit *commit = (struct commit *) obj;
> +		struct commit_list *parents = commit->parents;
> +
> +		oidset_insert(&promises, &commit->tree->object.oid);
> +		for (; parents; parents = parents->next)
> +			oidset_insert(&promises, &parents->item->object.oid);
> +	} else if (obj->type == OBJ_TAG) {
> +		struct tag *tag = (struct tag *) obj;
> +		oidset_insert(&promises, &tag->tagged->oid);
> +	}
> +	return 0;
> +}
> +
> +static int is_promise(const struct object_id *oid)
> +{
> +	if (!promises_prepared) {
> +		if (repository_format_lazy_object)
> +			for_each_packed_object(add_promise, NULL,
> +					       FOR_EACH_OBJECT_IMPORTED_ONLY);
> +		promises_prepared = 1;
> +	}
> +	return oidset_contains(&promises, oid);
> +}
> +

I think this all works and would meet the requirements we've been 
discussing.  The big trade off here vs what we first discussed with 
promises is that we are generating the list of promises on the fly when 
they are needed rather than downloading and maintaining a list locally.

My biggest concern with this model is the cost of opening and parsing 
every imported object (loose and pack for local and alternates) to build 
the oidset of promises.

In fsck this probably won't be an issue as it already focuses on 
correctness at the expense of speed.  I'm more worried about when we add 
the same/similar logic into check_connected.  That impacts fetch, clone, 
and receive_pack.

I guess the only way we can know for sure it to do a perf test and 
measure the impact.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-16 21:35     ` Jonathan Tan
@ 2017-08-17 20:50       ` Ben Peart
  2017-08-17 21:39         ` Jonathan Tan
  0 siblings, 1 reply; 18+ messages in thread
From: Ben Peart @ 2017-08-17 20:50 UTC (permalink / raw)
  To: Jonathan Tan, Junio C Hamano; +Cc: git, jrnieder



On 8/16/2017 5:35 PM, Jonathan Tan wrote:
> On Wed, 16 Aug 2017 13:32:23 -0700
> Junio C Hamano <gitster@pobox.com> wrote:
> 
>> Jonathan Tan <jonathantanmy@google.com> writes:
>>
>>> Also, let me know if there's a better way to send out these patches for
>>> review. Some of the code here has been reviewed before, for example.
>>>
>>> [1] https://public-inbox.org/git/cover.1502241234.git.jonathantanmy@google.com/
>>>
>>> [2] https://public-inbox.org/git/ffb734d277132802bcc25baa13e8ede3490af62a.1501532294.git.jonathantanmy@google.com/
>>>
>>> [3] https://public-inbox.org/git/20170807161031.7c4eae50@twelve2.svl.corp.google.com/
>>
>> ... and some of the code exists only in the list archive, so we
>> don't know which other topic if any we may want to eject tentatively
>> if we wanted to give precedence to move this topic forward over
>> others.  I'll worry about it later but help from others is also
>> appreciated.
> 
> Thanks - I can help take a look when it is time to move the code in.
> 

I agree that having this depend on patches elsewhere in the list archive 
makes it more difficult to review.  I know I like to see things in 
context to get a better picture.

> I think the issue here is whether we want to move this topic forward or
> not, that is, if this (special ".imported" objects) is the best way to
> solve (at least partially) the connectivity check part of tolerating
> missing objects. I hope that we can continue to talk about it.
> 

I think this topic should continue to move forward so that we can 
provide reasonable connectivity tests for fsck and check_connected in 
the face of partial clones.  I'm not sure the prototype implementation 
of reading/parsing all imported objects to build the promised oidset is 
the most performant model but we can continue to investigate the best 
options.

>> This collects names of the objects that are _directly_ referred to
>> by imported objects.  An imported pack may have a commit, whose
>> top-level tree may or may not appear in the same pack, or the tree
>> may exist locally but not in the same pack.  Or the tree may not be
>> locally available at all.  In any of these four cases, the top-level
>> tree is listed in the "promises" set.  Same for trees and tags.
>>
>> I wonder if all of the calls to oidset_insert() in this function
>> want to be guarded by "mark it as promised only when the referrent
>> is *not* locally available" to keep the promises set minimally
>> populated.  The only change needed to fsck in order to make it
>> refrain from treating a missing but promised object as an error
>> would be:
>>
>>          -       if (object is missing)
>>          +       if (object is missing && object is not promised)
>>                          error("that object must be there but missing");
>>
>> so there is no point in throwing something that we know we locally
>> have in this oidset, right?
>>
>> On the other hand, cost of such additional checks in this function
>> may outweigh the savings of both memory pressure and look-up cost,
>> so I do not know how the tradeoff would turn out.
> 
> I also don't know how the tradeoff would turn out, so I leaned towards
> the slightly simpler solution of not doing the check. In the future,
> maybe a t/perf test can be done to decide between the two.
> 
>>> +static int is_promise(const struct object_id *oid)
>>> +{
>>> +	if (!promises_prepared) {
>>> +		if (repository_format_lazy_object)
>>> +			for_each_packed_object(add_promise, NULL,
>>> +					       FOR_EACH_OBJECT_IMPORTED_ONLY);
>>> +		promises_prepared = 1;
>>> +	}
>>> +	return oidset_contains(&promises, oid);
>>> +}
>>
>> Somehow I'm tempted to call this function "is_promised()" but that
>> is a minor naming issue.
> 

Given all we need is an existance check for a given oid, I wonder if it 
would be faster overall to do a binary search through the list of 
imported idx files + an existence test for an imported loose object.

Especially in the check_connected case which isn't verifying every 
object, that should be a lot less IO than loading all the imported 
commits, trees and blobs and pre-computing an oidset of all possible 
objects.  The lookup for each object would be slower than a simple call 
to oidset_contains but we avoid the up front cost.

With some caching of idx files and threading, I suspect this could be 
made pretty fast.

> I was trying to be consistent in using the name "promise" instead of
> "promised object/tag/commit/tree/blob" everywhere, but we can switch if
> need be (for example, if we don't want to limit the generic name
> "promise" to merely objects).
> 
>>>   static const char *describe_object(struct object *obj)
>>>   {
>>>   	static struct strbuf buf = STRBUF_INIT;
>>> @@ -410,7 +472,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
>>>   					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
>>>   			obj->used = 1;
>>>   			mark_object_reachable(obj);
>>> -		} else {
>>> +		} else if (!is_promise(oid)) {
>>>   			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
>>>   			errors_found |= ERROR_REACHABLE;
>>>   		}
>>
>> This is about certainly is one place we want to check if the missing
>> object is OK, but I'm a bit surprised if this were the only place.
>>
>> Don't we need "while trying to follow all the outgoing links from
>> this tree object, and we found this object is not available locally;
>> normally we would mark it as an error but it turns out that the
>> missing one is in the promised set of objects, so it is OK" for the
>> normal connectivity traversal codepaths, for example?
> 
> That's right. The places to make this change are the same as those in
> some earlier patches I sent (patches 2-4 in [1]).
> 
> [1] https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-17 20:50       ` Ben Peart
@ 2017-08-17 21:39         ` Jonathan Tan
  2017-08-18 14:18           ` Ben Peart
  0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Tan @ 2017-08-17 21:39 UTC (permalink / raw)
  To: Ben Peart; +Cc: Junio C Hamano, git, jrnieder

Thanks for your comments. I'll reply to both your e-mails in this one
e-mail.

> This illustrates another place we need to resolve the
> naming/vocabulary.  We should at least be consistent to make it easier
> to discuss/explain.  We obviously went with "virtual" when building
> GVFS but I'm OK with "lazy" as long as we're consistent.  Some
> examples of how the naming can clarify or confuse:
> 
> 'Promise-enable your repo by setting the "extensions.lazyObject" flag'
> 
> 'Enable your repo to lazily fetch objects by setting the
> "extensions.lazyObject"'
> 
> 'Virtualize your repo by setting the "extensions.virtualize" flag'
> 
> We may want to carry the same name into the filename we use to mark
> the (virtualized/lazy/promised/imported) objects.
> 
> (This reminds me that there are only 2 hard problems in computer
> science...) ;)

Good point about the name. Maybe the 2nd one is the best? (Mainly
because I would expect a "virtualized" repo to have virtual refs too.)

But if there was a good way to refer to the "anti-projection" in a
virtualized system (that is, the "real" thing or "object" behind the
"virtual" thing or "image"), then maybe the "virtualized" language is
the best. (And I would gladly change - I'm having a hard time coming up
with a name for the "anti-projection" in the "lazy" language.)

Also, I should probably standardize on "lazily fetch" instead of "lazily
load". I didn't want to overlap with the existing fetching, but after
some thought, it's probably better to do that. The explanation would
thus be that you can either use the built-in Git fetcher (to be built,
although I have an old version here [1]) or supply a custom fetcher.

[1] https://github.com/jonathantanmy/git/commits/partialclone

> I think this all works and would meet the requirements we've been
> discussing.  The big trade off here vs what we first discussed with
> promises is that we are generating the list of promises on the fly
> when they are needed rather than downloading and maintaining a list
> locally.
> 
> My biggest concern with this model is the cost of opening and parsing
> every imported object (loose and pack for local and alternates) to
> build the oidset of promises.
> 
> In fsck this probably won't be an issue as it already focuses on
> correctness at the expense of speed.  I'm more worried about when we
> add the same/similar logic into check_connected.  That impacts fetch,
> clone, and receive_pack.
> 
> I guess the only way we can know for sure it to do a perf test and
> measure the impact.

As for fetching from the main repo, the connectivity check does not need
to be performed at all because all objects are "imported", so the
performance of the connectivity check does not matter. Same for cloning.

This is not true if you're fetching from another repo or if you're using
receive-pack, but (1) I think these are not used as much in such a
situation, and (2) if you do use them, the slowness only "kicks in" if
you do not have the objects referred to (whether non-"imported" or
"imported") and thus have to check the references in all "imported"
objects.

> I think this topic should continue to move forward so that we can 
> provide reasonable connectivity tests for fsck and check_connected in 
> the face of partial clones.  I'm not sure the prototype implementation 
> of reading/parsing all imported objects to build the promised oidset is 
> the most performant model but we can continue to investigate the best 
> options.

Agreed - I think the most important thing here is settling on the API
(name of extension and the nature of the object mark).

> Given all we need is an existance check for a given oid,

This is true...

> I wonder if it 
> would be faster overall to do a binary search through the list of 
> imported idx files + an existence test for an imported loose object.

...but what we're checking is the existence of a reference, not the
existence of an object. For a concrete example, consider what happens if
we both have an "imported" tree and a non-"imported" tree that
references a blob that we do not have. When checking the non-"imported"
tree for connectivity, we have to iterate through all "imported" trees
to see if any can vouch for the existence of such a blob. We cannot
merely binary-search the .idx file.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-17 21:39         ` Jonathan Tan
@ 2017-08-18 14:18           ` Ben Peart
  2017-08-18 23:33             ` Jonathan Tan
  0 siblings, 1 reply; 18+ messages in thread
From: Ben Peart @ 2017-08-18 14:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, git, jrnieder



On 8/17/2017 5:39 PM, Jonathan Tan wrote:
> Thanks for your comments. I'll reply to both your e-mails in this one
> e-mail.
> 
>> This illustrates another place we need to resolve the
>> naming/vocabulary.  We should at least be consistent to make it easier
>> to discuss/explain.  We obviously went with "virtual" when building
>> GVFS but I'm OK with "lazy" as long as we're consistent.  Some
>> examples of how the naming can clarify or confuse:
>>
>> 'Promise-enable your repo by setting the "extensions.lazyObject" flag'
>>
>> 'Enable your repo to lazily fetch objects by setting the
>> "extensions.lazyObject"'
>>
>> 'Virtualize your repo by setting the "extensions.virtualize" flag'
>>
>> We may want to carry the same name into the filename we use to mark
>> the (virtualized/lazy/promised/imported) objects.
>>
>> (This reminds me that there are only 2 hard problems in computer
>> science...) ;)
> 
> Good point about the name. Maybe the 2nd one is the best? (Mainly
> because I would expect a "virtualized" repo to have virtual refs too.)
> 
> But if there was a good way to refer to the "anti-projection" in a
> virtualized system (that is, the "real" thing or "object" behind the
> "virtual" thing or "image"), then maybe the "virtualized" language is
> the best. (And I would gladly change - I'm having a hard time coming up
> with a name for the "anti-projection" in the "lazy" language.)
> 

The most common "anti-virtual" language I'm familiar with is "physical." 
  Virtual machine <-> physical machine. Virtual world <-> physical 
world. Virtual repo, commit, tree, blob - physical repo, commit, tree, 
blob. I'm not thrilled but I think it works...

> Also, I should probably standardize on "lazily fetch" instead of "lazily
> load". I didn't want to overlap with the existing fetching, but after
> some thought, it's probably better to do that. The explanation would
> thus be that you can either use the built-in Git fetcher (to be built,
> although I have an old version here [1]) or supply a custom fetcher.
> 
> [1] https://github.com/jonathantanmy/git/commits/partialclone
> 
>> I think this all works and would meet the requirements we've been
>> discussing.  The big trade off here vs what we first discussed with
>> promises is that we are generating the list of promises on the fly
>> when they are needed rather than downloading and maintaining a list
>> locally.
>>
>> My biggest concern with this model is the cost of opening and parsing
>> every imported object (loose and pack for local and alternates) to
>> build the oidset of promises.
>>
>> In fsck this probably won't be an issue as it already focuses on
>> correctness at the expense of speed.  I'm more worried about when we
>> add the same/similar logic into check_connected.  That impacts fetch,
>> clone, and receive_pack.
>>
>> I guess the only way we can know for sure it to do a perf test and
>> measure the impact.
> 
> As for fetching from the main repo, the connectivity check does not need
> to be performed at all because all objects are "imported", so the
> performance of the connectivity check does not matter. Same for cloning.
> 

Very good point! I got stuck on connectivity check in general forgetting 
that we really only need to prevent sharing a corrupt repo.

> This is not true if you're fetching from another repo 

This isn't a case we've explicitly dealt with (multiple remotes into a 
virtualized repo).  Our behavior today would be that once you set the 
"virtual repo" flag on the repo (this happens at clone for us), all 
remotes are treated as virtual as well (ie we don't differentiate 
behavior based on which remote was used).  Our "custom fetcher" always 
uses "origin" and some custom settings for a cache-server saved in the 
.git/config file when asked to fetch missing objects.

This is probably a good model to stick with at least initially as trying 
to solve multiple possible "virtual" remotes as well as mingling 
virtualized and non-virtualized remotes and all the mixed cases that can 
come up makes my head hurt.  We should probably address that in a 
different thread. :)

> or if you're using
> receive-pack, but (1) I think these are not used as much in such a
> situation, and (2) if you do use them, the slowness only "kicks in" if
> you do not have the objects referred to (whether non-"imported" or
> "imported") and thus have to check the references in all "imported"
> objects.
> 

Is there any case where receive-pack is used on the client side?  I'm 
only aware of it being used on the server side to receive packs pushed 
from the client.  If it is not used in a virtualized client, then we 
would not need to do anything different for receive-pack.

>> I think this topic should continue to move forward so that we can
>> provide reasonable connectivity tests for fsck and check_connected in
>> the face of partial clones.  I'm not sure the prototype implementation
>> of reading/parsing all imported objects to build the promised oidset is
>> the most performant model but we can continue to investigate the best
>> options.
> 
> Agreed - I think the most important thing here is settling on the API
> (name of extension and the nature of the object mark).
> 
>> Given all we need is an existance check for a given oid,
> 
> This is true...
> 
>> I wonder if it
>> would be faster overall to do a binary search through the list of
>> imported idx files + an existence test for an imported loose object.
> 
> ...but what we're checking is the existence of a reference, not the
> existence of an object. For a concrete example, consider what happens if
> we both have an "imported" tree and a non-"imported" tree that
> references a blob that we do not have. When checking the non-"imported"
> tree for connectivity, we have to iterate through all "imported" trees
> to see if any can vouch for the existence of such a blob. We cannot
> merely binary-search the .idx file.
> 

That is another good point.  Given the discussion above about not 
needing to do the connectivity test for fetch/clone - the potential perf 
hit of loading/parsing all the various objects to build up the oidset is 
much less of an issue.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH] Updated "imported object" design
  2017-08-18 14:18           ` Ben Peart
@ 2017-08-18 23:33             ` Jonathan Tan
  0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Tan @ 2017-08-18 23:33 UTC (permalink / raw)
  To: Ben Peart; +Cc: Junio C Hamano, git, jrnieder

On Fri, 18 Aug 2017 10:18:37 -0400
Ben Peart <peartben@gmail.com> wrote:

> > But if there was a good way to refer to the "anti-projection" in a
> > virtualized system (that is, the "real" thing or "object" behind the
> > "virtual" thing or "image"), then maybe the "virtualized" language is
> > the best. (And I would gladly change - I'm having a hard time coming up
> > with a name for the "anti-projection" in the "lazy" language.)
> > 
> 
> The most common "anti-virtual" language I'm familiar with is "physical." 
>   Virtual machine <-> physical machine. Virtual world <-> physical 
> world. Virtual repo, commit, tree, blob - physical repo, commit, tree, 
> blob. I'm not thrilled but I think it works...

I was thinking more along the lines of the "entity that projects the
virtualization", not the opposite of a "virtualization" - "physical"
might work for the latter but probably not the former.

After some in-office discussion, if we stick to the "promise" concept,
maybe we have something like this:

  In a partial clone, the origin acts as a promisor of objects. Every
  object obtained from the promisor also acts as a promise that any
  object directly or indirectly referenced from that object is fetchable
  from the promisor.

> > This is not true if you're fetching from another repo 
> 
> This isn't a case we've explicitly dealt with (multiple remotes into a 
> virtualized repo).  Our behavior today would be that once you set the 
> "virtual repo" flag on the repo (this happens at clone for us), all 
> remotes are treated as virtual as well (ie we don't differentiate 
> behavior based on which remote was used).  Our "custom fetcher" always 
> uses "origin" and some custom settings for a cache-server saved in the 
> .git/config file when asked to fetch missing objects.
> 
> This is probably a good model to stick with at least initially as trying 
> to solve multiple possible "virtual" remotes as well as mingling 
> virtualized and non-virtualized remotes and all the mixed cases that can 
> come up makes my head hurt.  We should probably address that in a 
> different thread. :)

OK, let's stick to the current model first then, whether our opinion on
other remotes is (1) "we won't have any other remotes so we don't care",
(2) "we have other remotes but it's fine to make sure that they don't
introduce any new missing objects", or (3) "we need other remotes to
introduce missing objects, but we can build that after this foundation
is laid".

> > or if you're using
> > receive-pack, but (1) I think these are not used as much in such a
> > situation, and (2) if you do use them, the slowness only "kicks in" if
> > you do not have the objects referred to (whether non-"imported" or
> > "imported") and thus have to check the references in all "imported"
> > objects.
> > 
> 
> Is there any case where receive-pack is used on the client side?  I'm 
> only aware of it being used on the server side to receive packs pushed 
> from the client.  If it is not used in a virtualized client, then we 
> would not need to do anything different for receive-pack.

This happens if another repo decides to push to the virtualized client,
which (as I wrote) I don't expect to happen often. My intention is to
ensure that receive-pack will still work.

> That is another good point.  Given the discussion above about not 
> needing to do the connectivity test for fetch/clone - the potential perf 
> hit of loading/parsing all the various objects to build up the oidset is 
> much less of an issue.

Agreed.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-08-18 23:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04 21:51 Partial clone design (with connectivity check for locally-created objects) Jonathan Tan
2017-08-04 22:51 ` Junio C Hamano
2017-08-05  0:21   ` Jonathan Tan
2017-08-07 19:12     ` Ben Peart
2017-08-07 19:21       ` Jonathan Nieder
2017-08-08 14:18         ` Ben Peart
2017-08-07 19:41       ` Junio C Hamano
2017-08-08 16:45         ` Ben Peart
2017-08-08 17:03           ` Jonathan Nieder
2017-08-07 23:10       ` Jonathan Tan
2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
2017-08-16 20:32   ` Junio C Hamano
2017-08-16 21:35     ` Jonathan Tan
2017-08-17 20:50       ` Ben Peart
2017-08-17 21:39         ` Jonathan Tan
2017-08-18 14:18           ` Ben Peart
2017-08-18 23:33             ` Jonathan Tan
2017-08-17 20:07   ` Ben Peart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.