All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Beller <sbeller@google.com>
To: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	git <git@vger.kernel.org>, "Junio C Hamano" <gitster@pobox.com>,
	"Jeff King" <peff@peff.net>,
	"brian m. carlson" <sandals@crustytoothpaste.net>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Eric Sunshine" <sunshine@sunshineco.com>
Subject: Re: [PATCH v2] pack-format.txt: more details on pack file format
Date: Thu, 10 May 2018 10:06:37 -0700	[thread overview]
Message-ID: <CAGZ79kYYFBE+39aeHX22w3ARmujFcgE8YDUQpp6Hps264r0wfQ@mail.gmail.com> (raw)
In-Reply-To: <20180510150939.25399-1-pclouds@gmail.com>

On Thu, May 10, 2018 at 8:09 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> The current document mentions OBJ_* constants without their actual
> values. A git developer would know these are from cache.h but that's
> not very friendly to a person who wants to read this file to implement
> a pack file parser.
>
> Similarly, the deltified representation is not documented at all (the
> "document" is basically patch-delta.c). Translate that C code to
> English with a bit more about what ofs-delta and ref-delta mean.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  This is a much better description than v1. I hope.
>
>  Documentation/technical/pack-format.txt | 78 +++++++++++++++++++++++++
>  cache.h                                 |  5 ++
>  2 files changed, 83 insertions(+)
>
> diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
> index 8e5bf60be3..d20bf592aa 100644
> --- a/Documentation/technical/pack-format.txt
> +++ b/Documentation/technical/pack-format.txt
> @@ -36,6 +36,84 @@ Git pack format
>
>    - The trailer records 20-byte SHA-1 checksum of all of the above.
>
> +=== Object types
> +
> +Valid object types are:
> +
> +- OBJ_COMMIT (1)
> +- OBJ_TREE (2)
> +- OBJ_BLOB (3)
> +- OBJ_TAG (4)
> +- OBJ_OFS_DELTA (6)
> +- OBJ_REF_DELTA (7)
> +
> +Type 5 is reserved for future expansion. Type 0 is invalid.
> +
> +=== Deltified representation
> +
> +Conceptually there are only four object types: commit, tree, tag and
> +blob. However to save space, an object could be stored as a "delta" of
> +another "base" object. These representations are assigned new types
> +ofs-delta and ref-delta, which is only valid in a pack file.

...only valid...

as opposed to loose objects or as opposed to referencing cross-packs?
I would think the former, not the latter.

> +Both ofs-delta and ref-delta store the "delta" against another
> +object. The difference between them is, ref-delta directly encodes
> +20-byte base object name. If the base object is in the same pack,
> +ofs-delta encodes the offset of the base object in the pack instead.

Reading this paragraph clears up the question from before.
The ref delta is a delta to another "reference by hash id (sha1)".
What abbreviation is OFS? OFfSet ?

> +The delta data is a sequence of instructions to reconstruct an object
> +from the base object.

As said before the base object is of type 1..4, we do not "delta-on-delta"
yet, but to construct the object we have to create the base object first,
which itself can be represented as a deltified object leading to a delta
chain.

>     Each instruction appends more and more data to
> +the target object until it's complete. There are two supported
> +instructions so far: one for copy a byte range from the source object
> +and one for inserting new data embedded in the instruction itself.

ok. So there are 2 types of instructions, "copy from (offset, size)" and
"new data follows".

The next paragraphs seem to describe the copy instruction, maybe
add a sub-headline here?

> +Each instruction has variable length. Instruction type is determined
> +by the seventh bit of the first octet. The following diagrams follow
> +the convention in RFC 1951 (Deflate compressed data format).
> +
> +  +----------+---------+---------+---------+---------+-------+-------+-------+
> +  | 1xxxxxxx | offset1 | offset2 | offset3 | offset4 | size1 | size2 | size3 |
> +  +----------+---------+---------+---------+---------+-------+-------+-------+
> +
> +This is the instruction format to copy a byte range from the source
> +object. It encodes the offset to copy from any the number of bytes to
> +copy. Offset and size are in little-endian order.
> +
> +All offset and size bytes are optional. This is to reduce the
> +instruction size when encoding small offsets or sizes. The first seven
> +bits in the first octet determines which of the next seven octets is
> +present. If bit zero is set, offset1 is present. If bit one is set
> +offset2 is present and so on.
> +
> +Note that a more compact instruction does not change offset and size
> +encoding. For example, if only offset2 is omitted like below, offset3
> +still contains bits 16-23. It does not become offset2 and contains
> +bits 8-15 even if it's right next to offset1.
> +
> +  +----------+---------+---------+
> +  | 10000101 | offset1 | offset3 |
> +  +----------+---------+---------+

It reads very fluently to here.

> +In its most compact form, this instruction only takes up one byte
> +(0x80) with both offset and size omitted, which will have default
> +values zero. There is another exception: size zero is automatically
> +converted to 0x10000.

This "another exception" sounds a bit tacked on, but is still understandable.
I would imagine that the size of 0 is used frequently to copy large blocks
and coincidentally it is represented using the lowest number of bytes
for size. Cute!

Before the next diagram we could have a sub-headline, indicating
that the other instruction "new data follows" will now be described.

> +  +----------+============+
> +  | 0xxxxxxx |    data    |
> +  +----------+============+
> +
> +This is the instruction to construct target object without the base
> +object. The following data is appended to the target object. The first
> +seven bits of the first octet determines the size of data in
> +bytes. The size must be non-zero.

This command sounds very easy.
However we can have at most 127 bytes of new data, so if someone
adds a larger part of new code, we'd have many "insert new data"
instructions, all at the max of 127, such that the overhead for instruction
bytes is 1/127 = 0.7 %. Sounds efficient.

> +  +----------+============
> +  | 00000000 |
> +  +----------+============
> +
> +This is the instruction reserved for future expansion.

Thanks for pointing this out.


>
> +/*
> + * Values in this enum (except those outside the 3 bit range) are part
> + * of pack file format. See Documentation/technical/pack-format.txt
> + * for more information.
> + */

Makes sense.

I really like this patch very much. Thanks for writing it.
My annotations are just to add the cherry onto the cake,
the current form is
Reviewed-by: Stefan Beller <sbeller@google.com>

Thanks!

  reply	other threads:[~2018-05-10 17:06 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-30 22:07 [PATCH 0/9] get_short_oid UI improvements Ævar Arnfjörð Bjarmason
2018-04-30 22:07 ` [PATCH 1/9] sha1-name.c: remove stray newline Ævar Arnfjörð Bjarmason
2018-04-30 22:07 ` [PATCH 2/9] sha1-array.h: align function arguments Ævar Arnfjörð Bjarmason
2018-04-30 22:07 ` [PATCH 3/9] sha1-name.c: move around the collect_ambiguous() function Ævar Arnfjörð Bjarmason
2018-04-30 22:07 ` [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1 Ævar Arnfjörð Bjarmason
2018-05-01 11:11   ` Derrick Stolee
2018-05-01 11:27     ` Ævar Arnfjörð Bjarmason
2018-05-01 12:26       ` Derrick Stolee
2018-05-01 12:36         ` Ævar Arnfjörð Bjarmason
2018-05-01 13:05           ` Derrick Stolee
2018-04-30 22:07 ` [PATCH 5/9] get_short_oid: learn to disambiguate by ^{tag} Ævar Arnfjörð Bjarmason
2018-04-30 22:07 ` [PATCH 6/9] get_short_oid: learn to disambiguate by ^{blob} Ævar Arnfjörð Bjarmason
2018-04-30 22:07 ` [PATCH 7/9] get_short_oid / peel_onion: ^{tree} should mean tree, not treeish Ævar Arnfjörð Bjarmason
2018-05-01  1:13   ` brian m. carlson
2018-04-30 22:07 ` [PATCH 8/9] get_short_oid / peel_onion: ^{tree} should mean commit, not commitish Ævar Arnfjörð Bjarmason
2018-04-30 23:22   ` Eric Sunshine
2018-04-30 22:07 ` [PATCH 9/9] config doc: document core.disambiguate Ævar Arnfjörð Bjarmason
2018-04-30 22:34 ` [PATCH 0/9] get_short_oid UI improvements Stefan Beller
2018-05-01  1:27 ` brian m. carlson
2018-05-01 11:16 ` Derrick Stolee
2018-05-01 12:06 ` [PATCH v2 00/12] " Ævar Arnfjörð Bjarmason
2018-05-01 13:03   ` [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1 Derrick Stolee
2018-05-01 13:39     ` Ævar Arnfjörð Bjarmason
2018-05-01 13:44       ` Derrick Stolee
2018-05-01 14:10         ` Ævar Arnfjörð Bjarmason
2018-05-01 14:15           ` Derrick Stolee
2018-05-01 18:40   ` [PATCH v3 00/12] get_short_oid UI improvements Ævar Arnfjörð Bjarmason
2018-05-02 12:42     ` Derrick Stolee
2018-05-02 13:45       ` Derrick Stolee
2018-05-03  6:43         ` Jacob Keller
2018-05-01 18:40   ` [PATCH v3 01/12] sha1-name.c: remove stray newline Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 02/12] sha1-array.h: align function arguments Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 03/12] git-p4: change "commitish" typo to "committish" Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 04/12] cache.h: add comment explaining the order in object_type Ævar Arnfjörð Bjarmason
2018-05-03  5:05     ` Junio C Hamano
2018-05-08 15:35     ` Duy Nguyen
2018-05-08 15:56       ` [PATCH] pack-format.txt: more details on pack file format Nguyễn Thái Ngọc Duy
2018-05-08 17:23         ` Stefan Beller
2018-05-08 18:22           ` Duy Nguyen
2018-05-08 18:58             ` Stefan Beller
2018-05-08 18:21         ` Ævar Arnfjörð Bjarmason
2018-05-08 18:24           ` Duy Nguyen
2018-05-10 15:09         ` [PATCH v2] " Nguyễn Thái Ngọc Duy
2018-05-10 17:06           ` Stefan Beller [this message]
2018-05-11  6:41             ` Duy Nguyen
2018-05-11  3:54           ` Junio C Hamano
2018-05-11  6:55           ` [PATCH v3] " Nguyễn Thái Ngọc Duy
2018-05-01 18:40   ` [PATCH v3 05/12] sha1-name.c: move around the collect_ambiguous() function Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 06/12] get_short_oid: sort ambiguous objects by type, then SHA-1 Ævar Arnfjörð Bjarmason
2018-05-03  5:13     ` Junio C Hamano
2018-05-08 14:44     ` Jeff King
2018-05-01 18:40   ` [PATCH v3 07/12] get_short_oid: learn to disambiguate by ^{tag} Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 08/12] get_short_oid: learn to disambiguate by ^{blob} Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 09/12] get_short_oid / peel_onion: ^{tree} should be tree, not treeish Ævar Arnfjörð Bjarmason
2018-05-03  5:28     ` Junio C Hamano
2018-05-03  7:28       ` Ævar Arnfjörð Bjarmason
2018-05-04  2:19         ` Junio C Hamano
2018-05-04  8:42           ` Ævar Arnfjörð Bjarmason
2018-05-07  4:08             ` Junio C Hamano
2018-05-08 14:34               ` Jeff King
2018-05-08 18:53                 ` Ævar Arnfjörð Bjarmason
2018-05-09  7:56                   ` Jeff King
2018-05-09 10:48                     ` Ævar Arnfjörð Bjarmason
2018-05-10  4:21                       ` Junio C Hamano
2018-05-10  6:50                         ` Jeff King
2018-05-10 12:42     ` [PATCH v4 0/6] get_short_oid UI improvements Ævar Arnfjörð Bjarmason
2018-05-10 16:04       ` Jeff King
2018-05-10 12:42     ` [PATCH v4 1/6] sha1-name.c: remove stray newline Ævar Arnfjörð Bjarmason
2018-05-10 12:42     ` [PATCH v4 2/6] sha1-array.h: align function arguments Ævar Arnfjörð Bjarmason
2018-05-10 15:06       ` Jeff King
2018-05-11  3:07         ` Junio C Hamano
2018-05-11  3:09           ` Junio C Hamano
2018-05-10 12:43     ` [PATCH v4 3/6] git-p4: change "commitish" typo to "committish" Ævar Arnfjörð Bjarmason
2018-05-10 15:00       ` Luke Diamand
2018-05-10 12:43     ` [PATCH v4 4/6] sha1-name.c: move around the collect_ambiguous() function Ævar Arnfjörð Bjarmason
2018-05-10 12:43     ` [PATCH v4 5/6] get_short_oid: sort ambiguous objects by type, then SHA-1 Ævar Arnfjörð Bjarmason
2018-05-10 15:22       ` Jeff King
2018-05-11  5:36       ` Junio C Hamano
2018-05-10 12:43     ` [PATCH v4 6/6] get_short_oid: document & warn if we ignore the type selector Ævar Arnfjörð Bjarmason
2018-05-10 13:15       ` Martin Ågren
2018-05-10 16:03       ` Jeff King
2018-05-10 16:10         ` Jeff King
2018-05-10 16:15         ` Jeff King
2018-05-01 18:40   ` [PATCH v3 10/12] get_short_oid / peel_onion: ^{commit} should be commit, not committish Ævar Arnfjörð Bjarmason
2018-05-01 18:40   ` [PATCH v3 11/12] config doc: document core.disambiguate Ævar Arnfjörð Bjarmason
2018-05-08 14:41     ` Jeff King
2018-05-01 18:40   ` [PATCH v3 12/12] get_short_oid: document & warn if we ignore the type selector Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 01/12] sha1-name.c: remove stray newline Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 02/12] sha1-array.h: align function arguments Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 03/12] git-p4: change "commitish" typo to "committish" Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 04/12] cache.h: add comment explaining the order in object_type Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 05/12] sha1-name.c: move around the collect_ambiguous() function Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 06/12] get_short_oid: sort ambiguous objects by type, then SHA-1 Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 07/12] get_short_oid: learn to disambiguate by ^{tag} Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 08/12] get_short_oid: learn to disambiguate by ^{blob} Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 09/12] get_short_oid / peel_onion: ^{tree} should be tree, not treeish Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 10/12] get_short_oid / peel_onion: ^{commit} should be commit, not committish Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 11/12] config doc: document core.disambiguate Ævar Arnfjörð Bjarmason
2018-05-01 12:06 ` [PATCH v2 12/12] get_short_oid: document & warn if we ignore the type selector Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGZ79kYYFBE+39aeHX22w3ARmujFcgE8YDUQpp6Hps264r0wfQ@mail.gmail.com \
    --to=sbeller@google.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sandals@crustytoothpaste.net \
    --cc=stolee@gmail.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.