Partial clone design (with connectivity check for locally-created objects)

* Partial clone design (with connectivity check for locally-created objects)
@ 2017-08-04 21:51 Jonathan Tan
  2017-08-04 22:51 ` Junio C Hamano
  2017-08-16  0:32 ` [RFC PATCH] Updated "imported object" design Jonathan Tan
  0 siblings, 2 replies; 18+ messages in thread
From: Jonathan Tan @ 2017-08-04 21:51 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jonathan Nieder, peartben, christian.couder

After some discussion in [1] (in particular, about preserving the
functionality of the connectivity check as much as possible) and some
in-office discussion, here's an updated design.

Overview
========

This is an update of the design in [1].

The main difference between this and other related work [1] [2] [3] is
that we can still check connectivity between locally-created objects
without having to consult a remote server for any information.

In addition, the object loader writes to an incomplete packfile. This
(i) ensures that Git has immediate access to the object, (ii) ensures
that not too many files are written during a single Git invocation, and
(iii) prevents some unnecessary copies (compared to, for example,
transmitting entire objects through the protocol).

Local repo layout
=================

Objects in the local repo are further divided into "homegrown" and
"imported" objects.

"Imported" objects must be in a packfile that has a "<pack name>.remote"
file with arbitrary text (similar to the ".keep" file). They come from
clones, fetches, and the object loader (see below).

"Homegrown" objects are every other object.

Object loader
=============

The object loader is a process that can obtain objects from elsewhere,
given their hashes, and write their packed representation to a
client-given file.

The first time a missing object is needed during an invocation of Git,
Git creates a temporary packfile and writes the header with a
placeholder number of objects. Then, it starts the object loader,
passing in the name of that temporary packfile.

Whenever a missing object is needed, Git sends the hash of the missing
object and expects the loader to append (with O_APPEND) the object to
that packfile. Git keeps track of the object offsets as it goes, and Git
can use the contents of that incomplete packfile. This is similar to
what "git fast-import" does.

When Git exits, it writes the number of objects in the header, writes
the packfile checksum, moves the packfile to its final location, and
writes a .idx and a .remote file.

Connectivity check
==================

An object walk is performed as usual from the tips (see the
documentation for fsck etc. for which tips they use).

A "homegrown" object is valid if each object it references:
 1. is a "homegrown" object,
 2. is an "imported" object, or
 3. is referenced by an "imported" object.

The references of an "imported" object are not checked.

Performance notes
-----------------

Because of rule 3 above, iteration through every "imported" object (or,
at least, every "imported" object of a certain type) is sometimes
required.

For fsck, this should be fine because (i) this is not a regression since
currently all objects must be iterated through anyway, and (ii) fsck
prioritizes correctness over speed.

For fetch, the speed of the connectivity check is immaterial; the
connectivity check no longer needs to be performed because all objects
obtained from the remote are, by definition, "imported" objects.

There might be connectivity checks run during other commands like
"receive-pack". I don't expect partial clones to use these often. These
commands will still work, but performance of these is a secondary
concern in this design.

Impact on other tools
=====================

"git gc" will need to not do anything to an "imported" object, even if
it is unreachable, without ensuring that the connectivity check will
succeed in that object's absence. (Special attention to rule 3 under
"Connectivity check".)

If this design stands, the initial patch set will probably have "git gc"
not touch "imported" packs at all, trivially satisfying the above. In
the future, "git gc" will either need to expel such objects into loose
objects (like what is currently done for normal packs), treating them
like a "homegrown" object (unreachable, so it won't interfere with
future connectivity checks), or delete them outright - but there may be
race conditions to think of.

"git repack" will need to differentiate between packs with ".remote" and
packs without.

[1] https://public-inbox.org/git/cover.1501532294.git.jonathantanmy@google.com/
[2] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
[3] https://public-inbox.org/git/20170803091926.1755-1-chriscool@tuxfamily.org/

^ permalink raw reply	[flat|nested] 18+ messages in thread