git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] doc: describe Git bundle format
@ 2020-01-30 22:58 Masaya Suzuki
  2020-01-31 13:56 ` Johannes Schindelin
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Masaya Suzuki @ 2020-01-30 22:58 UTC (permalink / raw)
  To: git; +Cc: Masaya Suzuki

The bundle format was not documented. Describe the format with ABNF and
explain the meaning of each part.

Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
---
 Documentation/technical/bundle-format.txt | 40 +++++++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 Documentation/technical/bundle-format.txt

diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
new file mode 100644
index 0000000000..dbb80225b5
--- /dev/null
+++ b/Documentation/technical/bundle-format.txt
@@ -0,0 +1,40 @@
+= Git bundle v2 format
+
+The Git bundle format is a format that represents both refs and Git objects.
+
+== Format
+
+We will use ABNF notation to define the Git bundle format. See
+protocol-common.txt for the details.
+
+----
+bundle    = signature references pack
+signature = "# v2 git bundle" LF
+
+references   = *(prerequisite / ref) LF
+prerequisite = "-" obj-id SP comment LF
+comment      = *CHAR
+ref          = obj-id SP refname LF
+
+pack         = ... ; packfile
+----
+
+== Semantics
+
+A Git bundle consists of three parts.
+
+*   Prerequisites: Optional list of objects that are not included in the bundle
+    file. A bundle can reference these prerequisite objects (or it can reference
+    the objects reachable from the prerequisite objects). The bundle itself
+    might not contain those objects.
+*   References: Mapping of ref names to objects.
+*   Git objects: Commit, tree, blob, and tags. These are included in the pack
+    format.
+
+If a bundle contains prerequisites, it means the bundle has a thin pack and the
+bundle alone is not enough for resolving all objects. When you read such
+bundles, you should have those missing objects beforehand.
+
+In the bundle format, there can be a comment following a prerequisite obj-id.
+This is a comment and it has no specific meaning. When you write a bundle, you
+can put any string here. When you read a bundle, you can ignore this part.
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] doc: describe Git bundle format
  2020-01-30 22:58 [PATCH] doc: describe Git bundle format Masaya Suzuki
@ 2020-01-31 13:56 ` Johannes Schindelin
  2020-01-31 20:38 ` Junio C Hamano
  2020-01-31 22:18 ` [PATCH v2] " Masaya Suzuki
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Schindelin @ 2020-01-31 13:56 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: git

Hi,

On Thu, 30 Jan 2020, Masaya Suzuki wrote:

> The bundle format was not documented. Describe the format with ABNF and
> explain the meaning of each part.

LGTM,
Dscho

>
> Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
> ---
>  Documentation/technical/bundle-format.txt | 40 +++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>  create mode 100644 Documentation/technical/bundle-format.txt
>
> diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
> new file mode 100644
> index 0000000000..dbb80225b5
> --- /dev/null
> +++ b/Documentation/technical/bundle-format.txt
> @@ -0,0 +1,40 @@
> += Git bundle v2 format
> +
> +The Git bundle format is a format that represents both refs and Git objects.
> +
> +== Format
> +
> +We will use ABNF notation to define the Git bundle format. See
> +protocol-common.txt for the details.
> +
> +----
> +bundle    = signature references pack
> +signature = "# v2 git bundle" LF
> +
> +references   = *(prerequisite / ref) LF
> +prerequisite = "-" obj-id SP comment LF
> +comment      = *CHAR
> +ref          = obj-id SP refname LF
> +
> +pack         = ... ; packfile
> +----
> +
> +== Semantics
> +
> +A Git bundle consists of three parts.
> +
> +*   Prerequisites: Optional list of objects that are not included in the bundle
> +    file. A bundle can reference these prerequisite objects (or it can reference
> +    the objects reachable from the prerequisite objects). The bundle itself
> +    might not contain those objects.
> +*   References: Mapping of ref names to objects.
> +*   Git objects: Commit, tree, blob, and tags. These are included in the pack
> +    format.
> +
> +If a bundle contains prerequisites, it means the bundle has a thin pack and the
> +bundle alone is not enough for resolving all objects. When you read such
> +bundles, you should have those missing objects beforehand.
> +
> +In the bundle format, there can be a comment following a prerequisite obj-id.
> +This is a comment and it has no specific meaning. When you write a bundle, you
> +can put any string here. When you read a bundle, you can ignore this part.
> --
> 2.25.0.341.g760bfbb309-goog
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] doc: describe Git bundle format
  2020-01-30 22:58 [PATCH] doc: describe Git bundle format Masaya Suzuki
  2020-01-31 13:56 ` Johannes Schindelin
@ 2020-01-31 20:38 ` Junio C Hamano
  2020-01-31 21:49   ` Masaya Suzuki
  2020-01-31 22:18 ` [PATCH v2] " Masaya Suzuki
  2 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2020-01-31 20:38 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: git

Masaya Suzuki <masayasuzuki@google.com> writes:

> The bundle format was not documented. Describe the format with ABNF and
> explain the meaning of each part.

Thanks.

>
> Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
> ---
>  Documentation/technical/bundle-format.txt | 40 +++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>  create mode 100644 Documentation/technical/bundle-format.txt
>
> diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
> new file mode 100644
> index 0000000000..dbb80225b5
> --- /dev/null
> +++ b/Documentation/technical/bundle-format.txt
> @@ -0,0 +1,40 @@
> += Git bundle v2 format
> +
> +The Git bundle format is a format that represents both refs and Git objects.
> +
> +== Format
> +
> +We will use ABNF notation to define the Git bundle format. See
> +protocol-common.txt for the details.
> +
> +----
> +bundle    = signature references pack
> +signature = "# v2 git bundle" LF

Good.  "signature" is the name used by bundle.c::create_bundle() to
call this part.

> +references   = *(prerequisite / ref) LF

This allows prereq and ref can come inter-mixed, but I think we show
all prerequisites first before refs.

> +prerequisite = "-" obj-id SP comment LF
> +comment      = *CHAR

Do readers know what CHAR consists of?  Anything other than NUL and
LF?

> +ref          = obj-id SP refname LF

OK.

"prerequisite" and "ref" are both used in bundle.c::create_bundle(),
so calling these parts with these names is consistent with the code.
"head" is also a good name for the latter as "git bundle list-heads"
is the way the end-users access them from outside.

> +
> +pack         = ... ; packfile
> +----
> +
> +== Semantics
> +
> +A Git bundle consists of three parts.
> +
> +*   Prerequisites: Optional list of objects that are not included in the bundle
> +    file. A bundle can reference these prerequisite objects (or it can reference
> +    the objects reachable from the prerequisite objects). The bundle itself
> +    might not contain those objects.

While not incorrect per-se, the above misses the more important
points (and defers the description to a later paragraph).  It is
better to describe what it means to have prereqs upfront.  

> +*   References: Mapping of ref names to objects.
> +*   Git objects: Commit, tree, blob, and tags. These are included in the pack
> +    format.
> +

Match the name you used to descibe the parts in the earlier ABNF
description, so that the correspondence is clear to the readers.
You somehow used "references" to mean both prereqs and heads, but in
the above you are describing only "heads" under the label of
"references".

Perhaps something like this?

    * "Prerequisites" lists the objects that are NOT included in the
      bundle and the receiver of the bundle MUST already have, in
      order to use the data in the bundle.  The objects stored in
      the bundle may refer to prerequiste objects and anything
      reachable from them and/or expressed as a delta against
      prerequisite objects.

    * "Heads" record the tips of the history graph, iow, what the
      receiver of the bundle CAN "git fetch" from it.

    * "Pack" is the pack data stream "git fetch" would send, if you
      fetch from a repository that has the references recorded in
      the "Heads" above into a repository that has references
      pointing at the objects listed in "Prerequisites" above.

> +If a bundle contains prerequisites, it means the bundle has a thin pack and the
> +bundle alone is not enough for resolving all objects. When you read such
> +bundles, you should have those missing objects beforehand.

With the above rewrite, this paragraph is unneeded.

> +In the bundle format, there can be a comment following a prerequisite obj-id.
> +This is a comment and it has no specific meaning. When you write a bundle, you
> +can put any string here. When you read a bundle, you can ignore this part.

Is it "you can"?  At least the last one should be "readers of a
bundle MUST ignore the comment", I think.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] doc: describe Git bundle format
  2020-01-31 20:38 ` Junio C Hamano
@ 2020-01-31 21:49   ` Masaya Suzuki
  2020-01-31 23:01     ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Masaya Suzuki @ 2020-01-31 21:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Fri, Jan 31, 2020 at 12:38 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Masaya Suzuki <masayasuzuki@google.com> writes:
>
> > The bundle format was not documented. Describe the format with ABNF and
> > explain the meaning of each part.
>
> Thanks.
>
> >
> > Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
> > ---
> >  Documentation/technical/bundle-format.txt | 40 +++++++++++++++++++++++
> >  1 file changed, 40 insertions(+)
> >  create mode 100644 Documentation/technical/bundle-format.txt
> >
> > diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
> > new file mode 100644
> > index 0000000000..dbb80225b5
> > --- /dev/null
> > +++ b/Documentation/technical/bundle-format.txt
> > @@ -0,0 +1,40 @@
> > += Git bundle v2 format
> > +
> > +The Git bundle format is a format that represents both refs and Git objects.
> > +
> > +== Format
> > +
> > +We will use ABNF notation to define the Git bundle format. See
> > +protocol-common.txt for the details.
> > +
> > +----
> > +bundle    = signature references pack
> > +signature = "# v2 git bundle" LF
>
> Good.  "signature" is the name used by bundle.c::create_bundle() to
> call this part.
>
> > +references   = *(prerequisite / ref) LF
>
> This allows prereq and ref can come inter-mixed, but I think we show
> all prerequisites first before refs.

Based on bundle.c::parse_bundle_header(), I infer that this can be
mixed. If that's not intended, this can be changed to have
prerequisites first.

>
> > +prerequisite = "-" obj-id SP comment LF
> > +comment      = *CHAR
>
> Do readers know what CHAR consists of?  Anything other than NUL and
> LF?

RFC 5234 defines core rules
(https://tools.ietf.org/html/rfc5234#appendix-B.1), and these CHAR etc
are defined there. It should be OK to use these rules.

>
> > +ref          = obj-id SP refname LF
>
> OK.
>
> "prerequisite" and "ref" are both used in bundle.c::create_bundle(),
> so calling these parts with these names is consistent with the code.
> "head" is also a good name for the latter as "git bundle list-heads"
> is the way the end-users access them from outside.
>
> > +
> > +pack         = ... ; packfile
> > +----
> > +
> > +== Semantics
> > +
> > +A Git bundle consists of three parts.
> > +
> > +*   Prerequisites: Optional list of objects that are not included in the bundle
> > +    file. A bundle can reference these prerequisite objects (or it can reference
> > +    the objects reachable from the prerequisite objects). The bundle itself
> > +    might not contain those objects.
>
> While not incorrect per-se, the above misses the more important
> points (and defers the description to a later paragraph).  It is
> better to describe what it means to have prereqs upfront.
>
> > +*   References: Mapping of ref names to objects.
> > +*   Git objects: Commit, tree, blob, and tags. These are included in the pack
> > +    format.
> > +
>
> Match the name you used to descibe the parts in the earlier ABNF
> description, so that the correspondence is clear to the readers.
> You somehow used "references" to mean both prereqs and heads, but in
> the above you are describing only "heads" under the label of
> "references".

Yes. It should match with the ABNF definition above.

I usually use "heads" to mean "references under refs/heads/*" (not
sure if this is true for other people). Since a bundle can contain
tags etc., using "heads" here seems confusing. With prerequisites and
references split you mentioned above, I think I can make ABNF and this
semantics section consistent in terms of wording.

bundle = signature *prerequisite *ref LF pack
prerequisite = "-" obj-id SP comment LF
comment = *CHAR
reference = obj-id SP refname LF
pack = ... ; packfile

The terms ("prerequisite" and "reference") are consistent with
bundle.h::ref_list.

>
> Perhaps something like this?
>
>     * "Prerequisites" lists the objects that are NOT included in the
>       bundle and the receiver of the bundle MUST already have, in
>       order to use the data in the bundle.  The objects stored in
>       the bundle may refer to prerequiste objects and anything
>       reachable from them and/or expressed as a delta against
>       prerequisite objects.

I want to make sure the meaning of prerequisites.

1. Are they meant for a delta base? Or are they meant to represent a
partial/shallow state?

If these prerequisites are used as a delta base, the receiver of the
bundle MUST have them. If these prerequisites are the indicators of
the shallowness or the partialness of the repository, the pack data
would have complete data in terms of deltification (e.g. all objects
in the pack file can be undeltified with just the pack file), and the
bundle can be treated as a shallow-cloned/partially-cloned repository
snapshot.

From what I can see from bundle.c, I think it's an indicator of a
delta base, not an indicator of a shallow/partial state, but I want to
make sure.

2. Do they need to be commits? Or can they be any object type?

From what I can see, it seems that they should always be commits.

3. Does the receiver have to have all reachable objects from prerequisites?

My understanding is "Yes, the receiver must have all reachable objects
from prerequisites." This means that if a receiver has a
shallow-cloned repository, they might not be able to proceess a bundle
with prerequisites. The bundle's pack part can deltify against the
objects that exist beyond the shallow depth.

>
>     * "Heads" record the tips of the history graph, iow, what the
>       receiver of the bundle CAN "git fetch" from it.
>
>     * "Pack" is the pack data stream "git fetch" would send, if you
>       fetch from a repository that has the references recorded in
>       the "Heads" above into a repository that has references
>       pointing at the objects listed in "Prerequisites" above.

I'll adopt this in the next patch.

>
> > +If a bundle contains prerequisites, it means the bundle has a thin pack and the
> > +bundle alone is not enough for resolving all objects. When you read such
> > +bundles, you should have those missing objects beforehand.
>
> With the above rewrite, this paragraph is unneeded.
>
> > +In the bundle format, there can be a comment following a prerequisite obj-id.
> > +This is a comment and it has no specific meaning. When you write a bundle, you
> > +can put any string here. When you read a bundle, you can ignore this part.
>
> Is it "you can"?  At least the last one should be "readers of a
> bundle MUST ignore the comment", I think.

I'll change this to MUST.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2] doc: describe Git bundle format
  2020-01-30 22:58 [PATCH] doc: describe Git bundle format Masaya Suzuki
  2020-01-31 13:56 ` Johannes Schindelin
  2020-01-31 20:38 ` Junio C Hamano
@ 2020-01-31 22:18 ` Masaya Suzuki
  2020-01-31 23:06   ` Junio C Hamano
  2020-02-07 20:42   ` [PATCH v3] " Masaya Suzuki
  2 siblings, 2 replies; 16+ messages in thread
From: Masaya Suzuki @ 2020-01-31 22:18 UTC (permalink / raw)
  To: git; +Cc: Masaya Suzuki

The bundle format was not documented. Describe the format with ABNF and
explain the meaning of each part.

Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
---
Changes from v1:

* Update the ABNF definition so that prerequisites come before references.
* Adopt Junio's suggestion on the semantics section.
* State that the receiver MUST ignore the comments in the prereqs.
* Change "you" to "the receiver" and "the sender" (I wonder if this should be
  "writer" and "reader").

 Documentation/technical/bundle-format.txt | 41 +++++++++++++++++++++++
 1 file changed, 41 insertions(+)
 create mode 100644 Documentation/technical/bundle-format.txt

diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
new file mode 100644
index 0000000000..f568fcd7d1
--- /dev/null
+++ b/Documentation/technical/bundle-format.txt
@@ -0,0 +1,41 @@
+= Git bundle v2 format
+
+The Git bundle format is a format that represents both refs and Git objects.
+
+== Format
+
+We will use ABNF notation to define the Git bundle format. See
+protocol-common.txt for the details.
+
+----
+bundle    = signature *prerequisite *reference LF pack
+signature = "# v2 git bundle" LF
+
+prerequisite = "-" obj-id SP comment LF
+comment      = *CHAR
+reference    = obj-id SP refname LF
+
+pack         = ... ; packfile
+----
+
+== Semantics
+
+A Git bundle consists of three parts.
+
+* "Prerequisites" lists the objects that are NOT included in the bundle and the
+  receiver of the bundle MUST already have, in order to use the data in the
+  bundle. The objects stored in the bundle may refer to prerequisite objects and
+  anything reachable from them and/or expressed as a delta against prerequisite
+  objects.
+
+* "References" record the tips of the history graph, iow, what the receiver of
+  the bundle CAN "git fetch" from it.
+
+* "Pack" is the pack data stream "git fetch" would send, if you fetch from a
+  repository that has the references recorded in the "References" above into a
+  repository that has references pointing at the objects listed in
+  "Prerequisites" above.
+
+In the bundle format, there can be a comment following a prerequisite obj-id.
+This is a comment and it has no specific meaning. The sender of the bundle MAY
+put any string here. The receiver of the bundle MUST ignore the comment.
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] doc: describe Git bundle format
  2020-01-31 21:49   ` Masaya Suzuki
@ 2020-01-31 23:01     ` Junio C Hamano
  2020-01-31 23:57       ` Masaya Suzuki
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2020-01-31 23:01 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: Git Mailing List

Masaya Suzuki <masayasuzuki@google.com> writes:

>> > +prerequisite = "-" obj-id SP comment LF
>> > +comment      = *CHAR
>>
>> Do readers know what CHAR consists of?  Anything other than NUL and
>> LF?
>
> RFC 5234 defines core rules
> (https://tools.ietf.org/html/rfc5234#appendix-B.1), and these CHAR etc
> are defined there. It should be OK to use these rules.

That's not what I asked.  Do readers know that?  Did you tell them
that we expect they are familiar with the RFC convention?

It might be easier to make the above simple ABNF understandable to
those without knowledge of RFC 5234 by spelling out what CHAR in the
context of the above description means.  Or to tell them "go over
there and learn CHAR then come back".  We need to do one of them.

> I want to make sure the meaning of prerequisites.
>
> 1. Are they meant for a delta base? Or are they meant to represent a
> partial/shallow state?

They are meant as the "bottom boundary" of the range of the pack
data stored in the bundle.

Think of "git rev-list --objects $heads --not $prerequisites".  If
we limit ourselves to commits, in the simplest case, "git log
maint..master".  Imagine your repository has everything up to
'maint' (and nothing else) and then you are "git fetch"-ing from
another repository that advanced the tip that now points at
'master'.  Imagine the data transferred over the network.  Imagine
that data is frozen on disk somehow.  That is what a bundle is.

So, 'maint' is the prerequisite---for the person who builds the
bundle, it can safely be assumed that the bundle will be used only
by those who already has 'maint'.

There is nothing about 'partial' or 'shallow'.  And even though a
bundle typically has deltified objects in the packfile, it does not
have to.  Some objects are delitifed against prerequisite, and the
logic to generate thin packs may even prefer to use the
prerequisites as the delta base, but it is merely a side effect that
the prerequisites are at the "bottom boundary" of the range.

> 2. Do they need to be commits? Or can they be any object type?
>
> From what I can see, it seems that they should always be commits.
>
> 3. Does the receiver have to have all reachable objects from prerequisites?

I would say that the receiver needs to have everything that is
needed to "complete" prereqs.

Bundle transfer predates shallow or incomplete repositories, but I
think that we can (and we should if needed) update it to adjust to
these situations by using the appropriate definition of what it
means to "complete".  In a lazy clone, it may be sufficient to have
promisor remote that has everything reachable from them.  In a
shallow clone, the repository may have to be deep enough to have
them and objects immediately reachable from them (e.g. trees and
blobs for a commit at the "bottom boundary").

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] doc: describe Git bundle format
  2020-01-31 22:18 ` [PATCH v2] " Masaya Suzuki
@ 2020-01-31 23:06   ` Junio C Hamano
  2020-02-07 20:42   ` [PATCH v3] " Masaya Suzuki
  1 sibling, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2020-01-31 23:06 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: git

Masaya Suzuki <masayasuzuki@google.com> writes:

> * Change "you" to "the receiver" and "the sender" (I wonder if this should be
>   "writer" and "reader").

I come from "a bundle is a frozen snapshot of 'git fetch' transfer"
school, and consider the act of reading a "fetch" too much, but I
think "writer" and "reader" is more understandable for new readers.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] doc: describe Git bundle format
  2020-01-31 23:01     ` Junio C Hamano
@ 2020-01-31 23:57       ` Masaya Suzuki
  2020-02-04 18:20         ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Masaya Suzuki @ 2020-01-31 23:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Fri, Jan 31, 2020 at 3:01 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Masaya Suzuki <masayasuzuki@google.com> writes:
>
> >> > +prerequisite = "-" obj-id SP comment LF
> >> > +comment      = *CHAR
> >>
> >> Do readers know what CHAR consists of?  Anything other than NUL and
> >> LF?
> >
> > RFC 5234 defines core rules
> > (https://tools.ietf.org/html/rfc5234#appendix-B.1), and these CHAR etc
> > are defined there. It should be OK to use these rules.
>
> That's not what I asked.  Do readers know that?  Did you tell them
> that we expect they are familiar with the RFC convention?

The patch says "We will use ABNF notation to define the Git bundle
format. See protocol-common.txt for the details.", and
protocol-common.txt says "ABNF notation as described by RFC 5234 is
used within the protocol documents, except the following replacement
core rules are used:". In order to interpret this ABNF definition,
it's not enough to read RFC 5234, but the reader has to read
protocol-common.txt. Otherwise, they cannot understand what `obj-id`
is and what `refname` is. Those are not defined in RFC 5234. They're
defined in protocol-common.txt.

Based on the fact that (1) this document instructs the reader to see
protocol-common.txt in the beginning and (2) protocol-common.txt is
needed to interpret this definition and protocol-common.txt says RFC
5234 describes ABNF format, the readers should know ABNF is defined in
RFC 5234 and ABNF includes those LF, CHAR, and SP as a part of the
definition after reading the first sentence and referenced documents.

>
> It might be easier to make the above simple ABNF understandable to
> those without knowledge of RFC 5234 by spelling out what CHAR in the
> context of the above description means.  Or to tell them "go over
> there and learn CHAR then come back".  We need to do one of them.

As I said above, the first sentence says "See protocol-common.txt"
which includes the reference to the RFC and other non-terminals. Note
that, not only CHAR, but obj-id and refname are not defined here as
well. The readers need to reference protocol-common.txt to get the
definition of them.

>
> > I want to make sure the meaning of prerequisites.
> >
> > 1. Are they meant for a delta base? Or are they meant to represent a
> > partial/shallow state?
>
> They are meant as the "bottom boundary" of the range of the pack
> data stored in the bundle.
>
> Think of "git rev-list --objects $heads --not $prerequisites".  If
> we limit ourselves to commits, in the simplest case, "git log
> maint..master".  Imagine your repository has everything up to
> 'maint' (and nothing else) and then you are "git fetch"-ing from
> another repository that advanced the tip that now points at
> 'master'.  Imagine the data transferred over the network.  Imagine
> that data is frozen on disk somehow.  That is what a bundle is.
>
> So, 'maint' is the prerequisite---for the person who builds the
> bundle, it can safely be assumed that the bundle will be used only
> by those who already has 'maint'.
>
> There is nothing about 'partial' or 'shallow'.  And even though a
> bundle typically has deltified objects in the packfile, it does not
> have to.  Some objects are delitifed against prerequisite, and the
> logic to generate thin packs may even prefer to use the
> prerequisites as the delta base, but it is merely a side effect that
> the prerequisites are at the "bottom boundary" of the range.

OK. Then, it's better to make this clear. If you follow the analogy of
saved git-fetch response, it's possible that these prerequisites are
interpreted same as "shallow" lines of the shallow clone response.
It's more like "have" lines of git-fetch request.

> > 2. Do they need to be commits? Or can they be any object type?
> >
> > From what I can see, it seems that they should always be commits.
> >
> > 3. Does the receiver have to have all reachable objects from prerequisites?
>
> I would say that the receiver needs to have everything that is
> needed to "complete" prereqs.
>
> Bundle transfer predates shallow or incomplete repositories, but I
> think that we can (and we should if needed) update it to adjust to
> these situations by using the appropriate definition of what it
> means to "complete".  In a lazy clone, it may be sufficient to have
> promisor remote that has everything reachable from them.  In a
> shallow clone, the repository may have to be deep enough to have
> them and objects immediately reachable from them (e.g. trees and
> blobs for a commit at the "bottom boundary").

I think there are two completeness of a packfile:

* Delta complete: If an object in a packfile is deltified, the delta
base exists in the same packfile.
* Object complete: If an object in a packfile contains a reference to
another object, that object exists in the same packfile.

For example, initial shallow clone response should contain a
delta-complete object-incomplete packfile. Incremental fetch response
and bundles with prereqs would have a delta-incomplete
object-incomplete packfile. Creating delta-incomplete object-complete
packfile is possible (e.g. create a parallel history with all blobs
slightly modified and deltify against the original branch. I can
create a packfile with all objects in one history with all objects
deltified with the other history), but it's a rare case.

The reader of a bundle SHOULD have all objects reachable from prereqs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] doc: describe Git bundle format
  2020-01-31 23:57       ` Masaya Suzuki
@ 2020-02-04 18:20         ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2020-02-04 18:20 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: Git Mailing List

Masaya Suzuki <masayasuzuki@google.com> writes:

> * Delta complete: If an object in a packfile is deltified, the delta
> base exists in the same packfile.

Yes, even though "thin" packs delierately violate this to save size,
normal packs, and more importantly, on-disk packs, are complete in
this sense.

> * Object complete: If an object in a packfile contains a reference to
> another object, that object exists in the same packfile.

A single packfile that would result from a full clone at some time
in the project's history would be "complete" in this sense.  Such a
packfile may contain all objects that are needed to reproduce the
history up to v1.0.1, or another larger "object complete" packfile
may contain everything needed for the history up to v3.0.0.  So as a
concept, this can be defined sensibly.  In the original packfile
design, however, this concept was not useful (iow, there was nowhere
that cared if a packfile is "object complete" or not), so I do not
think there is no explicit "support" to ensure or validate this
trait in the system.

Obviously, a bundle that stores object incomplete pack must have
been created with the bottom boundary.

> The reader of a bundle SHOULD have all objects reachable from prereqs.

Perhaps.  

It _might_ be possible to teach "git clone" to produce a shallow
clone whose shallow cut-off points match the prerequisites of the
bundle, so it depends on what the reader wants to do with the data,
though.

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3] doc: describe Git bundle format
  2020-01-31 22:18 ` [PATCH v2] " Masaya Suzuki
  2020-01-31 23:06   ` Junio C Hamano
@ 2020-02-07 20:42   ` Masaya Suzuki
  2020-02-07 20:44     ` Masaya Suzuki
  1 sibling, 1 reply; 16+ messages in thread
From: Masaya Suzuki @ 2020-02-07 20:42 UTC (permalink / raw)
  To: git; +Cc: Masaya Suzuki

The bundle format was not documented. Describe the format with ABNF and
explain the meaning of each part.

Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
---
Changes from v2:

* Change "sender" and "receiver" to "writer" and "reader".
* Add an example of a case that a bundle can reference an object outside of the
  bundle.
* Mention that the prerequisites are different from the shallow
  boundary, and the bundle format cannot represent a shallow clone repository.


 Documentation/technical/bundle-format.txt | 48 +++++++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 Documentation/technical/bundle-format.txt

diff --git a/Documentation/technical/bundle-format.txt b/Documentation/technical/bundle-format.txt
new file mode 100644
index 0000000000..0e828151a5
--- /dev/null
+++ b/Documentation/technical/bundle-format.txt
@@ -0,0 +1,48 @@
+= Git bundle v2 format
+
+The Git bundle format is a format that represents both refs and Git objects.
+
+== Format
+
+We will use ABNF notation to define the Git bundle format. See
+protocol-common.txt for the details.
+
+----
+bundle    = signature *prerequisite *reference LF pack
+signature = "# v2 git bundle" LF
+
+prerequisite = "-" obj-id SP comment LF
+comment      = *CHAR
+reference    = obj-id SP refname LF
+
+pack         = ... ; packfile
+----
+
+== Semantics
+
+A Git bundle consists of three parts.
+
+* "Prerequisites" lists the objects that are NOT included in the bundle and the
+  reader of the bundle MUST already have, in order to use the data in the
+  bundle. The objects stored in the bundle may refer to prerequisite objects and
+  anything reachable from them (e.g. a tree object in the bundle can reference
+  a blob that is reachable from a prerequisite) and/or expressed as a delta
+  against prerequisite objects.
+
+* "References" record the tips of the history graph, iow, what the reader of the
+  bundle CAN "git fetch" from it.
+
+* "Pack" is the pack data stream "git fetch" would send, if you fetch from a
+  repository that has the references recorded in the "References" above into a
+  repository that has references pointing at the objects listed in
+  "Prerequisites" above.
+
+In the bundle format, there can be a comment following a prerequisite obj-id.
+This is a comment and it has no specific meaning. The writer of the bundle MAY
+put any string here. The reader of the bundle MUST ignore the comment.
+
+=== Note on the shallow clone and a Git bundle
+
+Note that the prerequisites does not represent a shallow-clone boundary. The
+semantics of the prerequisites and the shallow-clone boundaries are different,
+and the Git bundle v2 format cannot represent a shallow clone repository.
-- 
2.25.0.341.g760bfbb309-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] doc: describe Git bundle format
  2020-02-07 20:42   ` [PATCH v3] " Masaya Suzuki
@ 2020-02-07 20:44     ` Masaya Suzuki
  2020-02-07 20:59       ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Masaya Suzuki @ 2020-02-07 20:44 UTC (permalink / raw)
  To: Git Mailing List

On Fri, Feb 7, 2020 at 12:42 PM Masaya Suzuki <masayasuzuki@google.com> wrote:
> +=== Note on the shallow clone and a Git bundle
> +
> +Note that the prerequisites does not represent a shallow-clone boundary. The

the prerequisites do not

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] doc: describe Git bundle format
  2020-02-07 20:44     ` Masaya Suzuki
@ 2020-02-07 20:59       ` Junio C Hamano
  2020-02-07 22:21         ` Masaya Suzuki
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2020-02-07 20:59 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: Git Mailing List

Masaya Suzuki <masayasuzuki@google.com> writes:

> On Fri, Feb 7, 2020 at 12:42 PM Masaya Suzuki <masayasuzuki@google.com> wrote:
>> +=== Note on the shallow clone and a Git bundle
>> +
>> +Note that the prerequisites does not represent a shallow-clone boundary. The
>
> the prerequisites do not

Grammo aside, I am not sure if that particular Note is beneficial to
begin with.  I would imagine that you can get a bundle that holds
all the objects in a shallow repository by specifying the range that
match the shallow-clone boundary when you run "git bundle create"
while disabling thin-pack generation.

The support of shallow-clone by Git may be incomplete and it may not
be easy to form such a range, and "git bundle create" command may
not have a knob to disable thin-pack generation, but that does not
mean that the bundle *format* cannot be used to represent the
shallow boundary.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] doc: describe Git bundle format
  2020-02-07 20:59       ` Junio C Hamano
@ 2020-02-07 22:21         ` Masaya Suzuki
  2020-02-08  1:49           ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Masaya Suzuki @ 2020-02-07 22:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Fri, Feb 7, 2020 at 12:59 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Masaya Suzuki <masayasuzuki@google.com> writes:
>
> > On Fri, Feb 7, 2020 at 12:42 PM Masaya Suzuki <masayasuzuki@google.com> wrote:
> >> +=== Note on the shallow clone and a Git bundle
> >> +
> >> +Note that the prerequisites does not represent a shallow-clone boundary. The
> >
> > the prerequisites do not
>
> Grammo aside, I am not sure if that particular Note is beneficial to
> begin with.  I would imagine that you can get a bundle that holds
> all the objects in a shallow repository by specifying the range that
> match the shallow-clone boundary when you run "git bundle create"
> while disabling thin-pack generation.

Yes. The reason that I've been trying to check the semantics of the
prerequisites is that I DO recognize that this is possible
format-wise. I'm not sure if this Git implementation can create such
bundles, but format-wise such bundles can be created.

When writing a Git bundle parser in other implementations (like JGit),
it's not clear whether, as a library, I should support such use cases.
If such usage is supported in the format, then the semantics of the
prerequisites changes. Currently the prerequisites are defined as the
objects that are NOT included in the bundle, and the reader of the
bundle MUST already have, in order to use the data in the bundle. If
the format supports shallow-cloned repository, this will be defined as
the objects that are NOT included in the bundle. If the reader wants
to read this bundle as if it's a non-shallow clone, the reader of the
bundle MUST have the objects that are reachable from these
prerequisites. If the reader wants to read this bundle as if it's a
shallow clone, the reader MUST treat these as a shallow boundary.

Also, this change will put further restrictions on the pack. "Pack" is
the pack data stream "git fetch" would send. If the writer of a bundle
wants to write as a shallow-clone pack, the pack MUST NOT reference
objects outside of the shallow boundary from the pack file as a delta
base. The writer MAY reference the commit objects outside of the
shallow boundary as a parent.

The readers and the writers of bundles MUST communicate whether a
bundle represents a shallow clone repository in other means. The
bundle file does not have any indicator whether it's a shallow clone
bundle or not.

> The support of shallow-clone by Git may be incomplete and it may not
> be easy to form such a range, and "git bundle create" command may
> not have a knob to disable thin-pack generation, but that does not
> mean that the bundle *format* cannot be used to represent the
> shallow boundary.

As I wrote above, if this bundle format supports the shallow clone
state, the semantics will change and writers and readers have
different constraints on the packs. In order to do so, the readers and
the writers have to agree whether it's a shallow clone or not in other
mean since the bundle file doesn't have such indicators. I think it's
better to prohibit such use cases (or at least make it as unintended
usage), and then create a different bundle format version that
supports shallow clone boundary (so that the bundle file can be more
close to the frozen git-fetch response).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] doc: describe Git bundle format
  2020-02-07 22:21         ` Masaya Suzuki
@ 2020-02-08  1:49           ` Junio C Hamano
  2020-02-12 22:13             ` Masaya Suzuki
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2020-02-08  1:49 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: Git Mailing List

Masaya Suzuki <masayasuzuki@google.com> writes:

> Yes. The reason that I've been trying to check the semantics of the
> prerequisites is that I DO recognize that this is possible
> format-wise. I'm not sure if this Git implementation can create such
> bundles, but format-wise such bundles can be created.

Yeah, now I get it.  

The problem is *not* that v2 format "cannot represent a shallow
clone repository", but is that there is nothing that prevents a
bundle in v2 format from depending on objects behind (not just at)
the shallow boundary, making it impossible for a reader to guarantee
that a bundle with prereqs can be used to create an equivalent
shallow repository with shallow boundary at the same place as
prereqs.  IOW, bundle with prereqs in the v2 format allows more
objects to be omitted than an equivalent shallow repository omits,
because prereqs and shallow cutoff points mean different things.

While we are at it, I suspect that with reachability bitmap, a "git
fetch" that updates a history up to commit A to a new history up to
commit B can omit more objects than what is directly reachable from
the commit A.  That is, if A's direct child (call it C) is a commit
that reverts A, a blob in A's tree won't be in the bundle (because A
is a prereq), but the blob at the same path in C is the same blob as
the blob at the same path in A's parent (that is what it means for
that A's direct child to be a revert of A).  In the normal
enumeration based on object-walk to decide which objects to send,
such a blob in C will be included in the pack, but a reachability
bitmap can say "if we assume the reader has A, it must have A^1, so
that blob should exist at the reader, hence can be omitted from the
transfer even though we are sending commit C."

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] doc: describe Git bundle format
  2020-02-08  1:49           ` Junio C Hamano
@ 2020-02-12 22:13             ` Masaya Suzuki
  2020-02-12 22:43               ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Masaya Suzuki @ 2020-02-12 22:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

On Fri, Feb 7, 2020 at 5:49 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Masaya Suzuki <masayasuzuki@google.com> writes:
>
> > Yes. The reason that I've been trying to check the semantics of the
> > prerequisites is that I DO recognize that this is possible
> > format-wise. I'm not sure if this Git implementation can create such
> > bundles, but format-wise such bundles can be created.
>
> Yeah, now I get it.
>
> The problem is *not* that v2 format "cannot represent a shallow
> clone repository", but is that there is nothing that prevents a
> bundle in v2 format from depending on objects behind (not just at)
> the shallow boundary, making it impossible for a reader to guarantee
> that a bundle with prereqs can be used to create an equivalent
> shallow repository with shallow boundary at the same place as
> prereqs.  IOW, bundle with prereqs in the v2 format allows more
> objects to be omitted than an equivalent shallow repository omits,
> because prereqs and shallow cutoff points mean different things.

Yes. So, I think it's better to say prereqs and shallow boundaries are
different.

> While we are at it, I suspect that with reachability bitmap, a "git
> fetch" that updates a history up to commit A to a new history up to
> commit B can omit more objects than what is directly reachable from
> the commit A.  That is, if A's direct child (call it C) is a commit
> that reverts A, a blob in A's tree won't be in the bundle (because A
> is a prereq), but the blob at the same path in C is the same blob as
> the blob at the same path in A's parent (that is what it means for
> that A's direct child to be a revert of A).  In the normal
> enumeration based on object-walk to decide which objects to send,
> such a blob in C will be included in the pack,

That's interesting. I have never looked CGit's implementation, but I
think JGit would omit those objects. (At least that was my
understanding. Not confirmed with the code.)

Anyway. Is it OK with adding this small note on "prereq is not a
shallow boundary"? In practice, there are not many Git implementations
that handle Git bundles, so it's not that big deal as long those few
implementers recognize this, but this document is meant for those
implementers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] doc: describe Git bundle format
  2020-02-12 22:13             ` Masaya Suzuki
@ 2020-02-12 22:43               ` Junio C Hamano
  0 siblings, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2020-02-12 22:43 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: Git Mailing List

Masaya Suzuki <masayasuzuki@google.com> writes:

> Anyway. Is it OK with adding this small note on "prereq is not a
> shallow boundary"?

I thought the text in the latest round is good as-is.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-02-12 22:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-30 22:58 [PATCH] doc: describe Git bundle format Masaya Suzuki
2020-01-31 13:56 ` Johannes Schindelin
2020-01-31 20:38 ` Junio C Hamano
2020-01-31 21:49   ` Masaya Suzuki
2020-01-31 23:01     ` Junio C Hamano
2020-01-31 23:57       ` Masaya Suzuki
2020-02-04 18:20         ` Junio C Hamano
2020-01-31 22:18 ` [PATCH v2] " Masaya Suzuki
2020-01-31 23:06   ` Junio C Hamano
2020-02-07 20:42   ` [PATCH v3] " Masaya Suzuki
2020-02-07 20:44     ` Masaya Suzuki
2020-02-07 20:59       ` Junio C Hamano
2020-02-07 22:21         ` Masaya Suzuki
2020-02-08  1:49           ` Junio C Hamano
2020-02-12 22:13             ` Masaya Suzuki
2020-02-12 22:43               ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).