Re: Partial clone design (with connectivity check for locally-created objects)

From: Ben Peart <peartben@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	git@vger.kernel.org, Jonathan Nieder <jrnieder@gmail.com>,
	christian.couder@gmail.com
Subject: Re: Partial clone design (with connectivity check for locally-created objects)
Date: Tue, 8 Aug 2017 12:45:40 -0400	[thread overview]
Message-ID: <693596b8-7a84-bcc8-7eef-2d534293e14b@gmail.com> (raw)
In-Reply-To: <xmqqwp6fp3mk.fsf@gitster.mtv.corp.google.com>

On 8/7/2017 3:41 PM, Junio C Hamano wrote:
> Ben Peart <peartben@gmail.com> writes:
> 
>> My concern with this proposal is the combination of 1) writing a new
>> pack file for every git command that ends up bringing down a missing
>> object and 2) gc not compressing those pack files into a single pack
>> file.
> 
> Your noticing these is a sign that you read the outline of the
> design correctly, I think.
> 
> The basic idea is that the local fsck should tolerate missing
> objects when they are known to be obtainable from that external
> service, but should still be able to diagnose missing objects that
> we do not know if the external service has, especially the ones that
> have been newly created locally and not yet made available to them
> by pushing them back.
> 

This helps me a lot as now I think I understand the primary requirement 
we're trying to solve for.  Let me rephrase it and see if this makes sense:

We need to be able to identify whether an object was created locally 
(and should pass more strict fsck/connectivity tests) or whether it came 
from a remote (and so any missing objects could presumably be fetched 
from the server).

I agree it would be nice to solve this (and not just punt fsck - even if 
it is an opt-in behavior).

We've discussed a couple of different possible solutions, each of which 
have different tradeoffs.  Let me try to summarize here and perhaps 
suggest some other possibilities:

Promised list
-------------
This provides an external data structure that allowed us to flag objects 
that came from a remote server (vs created locally).

The biggest drawback is that this data structure can get very large and 
become difficult/expensive to generate/transfer/maintain.

It also (at least in one proposal) required protocol and server side 
changes to support it.

Annotated via filename
----------------------
This idea is to annotate the file names of objects that came from a 
remote server (pack files and loose objects) with a unique file 
extension (.remote) that indicates whether they are locally created or not.

To make this work, git must understand about both types of loose objects 
and pack files and search in both locations when looking for objects.

Another drawback of this is that commands (repack, gc) that optimize 
loose objects and pack files must now be aware of the different 
extensions and handle both while not merging remote and non-remote objects.

In short, we're creating separate object stores - one for locally 
created objects and one for everything else.

Now a couple of different ideas:

Annotated via flags
===================
The fundamental idea here is that we add the ability to flag locally 
created objects on the object itself.

Given that at the core, "Git is a simple key-value data store" can we 
take advantage of that fact and include a "locally created" bit as a 
property on every object?

I could not think of a good way to accomplish this as it is ultimately 
changing the object format which creates rapidly expanding ripples of 
change.

For example, The object header currently includes the type a space, the 
length and a null. Even if we could add a "local" property (either by 
adding a 5th item, taking over the space, creating new object types, 
etc), the fact that the header is included in the sha1 means that push 
would become problematic as flipping the bit would change the sha and 
the trees and commits that reference it.

Local list
----------
Given the number of locally created objects is usually very small in 
comparison to the total number of objects (even just due to history), it 
makes more sense to track locally created objects instead of 
promised/remote objects.

The biggest advantage of this over the "promised list" is that the 
"local list" being maintained is _significantly_ smaller (often orders 
of magnitude smaller).

Another advantage over the "promised list" solution is that it doesn't 
require any server side or protocol changes.

On the client when objects are created (write_loose_object?) the new 
objects are added to the "local list" and in the connectivity check 
(fsck) if the object is not in the "local list," the connectivity check 
can be skipped as any missing object can presumably be retrieved from 
the server.

A simple file format could be used (header + list of SHA1 values) and 
write_loose_object could do a trivial append. In fsck, the file could be 
loaded into a hashmap to make for fast existence checks.

Entries could be removed from the "local list" for objects later fetched 
from a server (though I had a hard time contriving a scenario where this 
would happen so I consider this optional).

On the surface, this seems like the simplest solution that meets the 
stated requirements.

Object DB
---------
This is a different way of providing separate object stores than the 
"Annotated via filename" proposal. It should be a cleaner/more elegant 
solution that enables several other capabilities but it is also more 
work to implement (isn't that always the case?).

We create an object store abstraction layer that enables multiple object 
store providers to exist. The order that they are called should be 
configurable based on the command (esp have/read vs create/write). This 
enables features like tiered storage: in memory, pack, loose, alternate, 
large, remote.

The connectivity check in fsck would then only traverse and validate 
objects that existed via the local object store providers.

While I like the flexibility of this design and hope we can obtain it in 
the long term for it's other benefits, it's a bit overkill for this 
specific problem. The big drawback of this model is the cost to design 
and implement it.