All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: me@ttaylorr.com, gitster@pobox.com, l.s.r@web.de,
	szeder.dev@gmail.com, Chris Torek <chris.torek@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v4 17/17] chunk-format: add technical docs
Date: Thu, 18 Feb 2021 14:07:39 +0000	[thread overview]
Message-ID: <84bf6506dc12163b37f46192b3742c8fb234322f.1613657260.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.848.v4.git.1613657259.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/chunk-format.txt      | 116 ++++++++++++++++++
 .../technical/commit-graph-format.txt         |   3 +
 Documentation/technical/pack-format.txt       |   3 +
 3 files changed, 122 insertions(+)
 create mode 100644 Documentation/technical/chunk-format.txt

diff --git a/Documentation/technical/chunk-format.txt b/Documentation/technical/chunk-format.txt
new file mode 100644
index 000000000000..593614fcedab
--- /dev/null
+++ b/Documentation/technical/chunk-format.txt
@@ -0,0 +1,116 @@
+Chunk-based file formats
+========================
+
+Some file formats in Git use a common concept of "chunks" to describe
+sections of the file. This allows structured access to a large file by
+scanning a small "table of contents" for the remaining data. This common
+format is used by the `commit-graph` and `multi-pack-index` files. See
+link:technical/pack-format.html[the `multi-pack-index` format] and
+link:technical/commit-graph-format.html[the `commit-graph` format] for
+how they use the chunks to describe structured data.
+
+A chunk-based file format begins with some header information custom to
+that format. That header should include enough information to identify
+the file type, format version, and number of chunks in the file. From this
+information, that file can determine the start of the chunk-based region.
+
+The chunk-based region starts with a table of contents describing where
+each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
+where C is the number of chunks. Consider the following table:
+
+  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
+  |--------------------|------------------------|
+  | ID[0]              | OFFSET[0]              |
+  | ...                | ...                    |
+  | ID[C]              | OFFSET[C]              |
+  | 0x0000             | OFFSET[C+1]            |
+
+Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
+Each integer is stored in network-byte order.
+
+The chunk identifier `ID[i]` is a label for the data stored within this
+fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
+size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
+and `OFFSET[i]`. This requires that the chunk data appears contiguously
+in the same order as the table of contents.
+
+The final entry in the table of contents must be four zero bytes. This
+confirms that the table of contents is ending and provides the offset for
+the end of the chunk-based data.
+
+Note: The chunk-based format expects that the file contains _at least_ a
+trailing hash after `OFFSET[C+1]`.
+
+Functions for working with chunk-based file formats are declared in
+`chunk-format.h`. Using these methods provide extra checks that assist
+developers when creating new file formats.
+
+Writing chunk-based file formats
+--------------------------------
+
+To write a chunk-based file format, create a `struct chunkfile` by
+calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
+caller is responsible for opening the `hashfile` and writing header
+information so the file format is identifiable before the chunk-based
+format begins.
+
+Then, call `add_chunk()` for each chunk that is intended for write. This
+populates the `chunkfile` with information about the order and size of
+each chunk to write. Provide a `chunk_write_fn` function pointer to
+perform the write of the chunk data upon request.
+
+Call `write_chunkfile()` to write the table of contents to the `hashfile`
+followed by each of the chunks. This will verify that each chunk wrote
+the expected amount of data so the table of contents is correct.
+
+Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
+caller is responsible for finalizing the `hashfile` by writing the trailing
+hash and closing the file.
+
+Reading chunk-based file formats
+--------------------------------
+
+To read a chunk-based file format, the file must be opened as a
+memory-mapped region. The chunk-format API expects that the entire file
+is mapped as a contiguous memory region.
+
+Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
+
+After reading the header information from the beginning of the file,
+including the chunk count, call `read_table_of_contents()` to populate
+the `struct chunkfile` with the list of chunks, their offsets, and their
+sizes.
+
+Extract the data information for each chunk using `pair_chunk()` or
+`read_chunk()`:
+
+* `pair_chunk()` assigns a given pointer with the location inside the
+  memory-mapped file corresponding to that chunk's offset. If the chunk
+  does not exist, then the pointer is not modified.
+
+* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
+  with the appropriate initial pointer and size information. The function
+  is not called if the chunk does not exist. Use this method to read chunks
+  if you need to perform immediate parsing or if you need to execute logic
+  based on the size of the chunk.
+
+After calling these methods, call `free_chunkfile()` to clear the
+`struct chunkfile` data. This will not close the memory-mapped region.
+Callers are expected to own that data for the timeframe the pointers into
+the region are needed.
+
+Examples
+--------
+
+These file formats use the chunk-format API, and can be used as examples
+for future formats:
+
+* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
+  in `commit-graph.c` for how the chunk-format API is used to write and
+  parse the commit-graph file format documented in
+  link:technical/commit-graph-format.html[the commit-graph file format].
+
+* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
+  in `midx.c` for how the chunk-format API is used to write and
+  parse the multi-pack-index file format documented in
+  link:technical/pack-format.html[the multi-pack-index file format].
diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index b6658eff1882..87971c27dd73 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -61,6 +61,9 @@ CHUNK LOOKUP:
       the length using the next chunk position if necessary.) Each chunk
       ID appears at most once.
 
+  The CHUNK LOOKUP matches the table of contents from
+  link:technical/chunk-format.html[the chunk-based file format].
+
   The remaining data in the body is described one chunk at a time, and
   these chunks may be given in any order. Chunks are required unless
   otherwise specified.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index f96b2e605f34..2fb1e60d29ec 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -301,6 +301,9 @@ CHUNK LOOKUP:
 	    (Chunks are provided in file-order, so you can infer the length
 	    using the next chunk position if necessary.)
 
+	The CHUNK LOOKUP matches the table of contents from
+	link:technical/chunk-format.html[the chunk-based file format].
+
 	The remaining data in the body is described one chunk at a time, and
 	these chunks may be given in any order. Chunks are required unless
 	otherwise specified.
-- 
gitgitgadget

  parent reply	other threads:[~2021-02-18 16:54 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-26 16:01 [PATCH 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27  1:53   ` Chris Torek
2021-01-27  2:36     ` Taylor Blau
2021-01-26 16:01 ` [PATCH 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-01-27  2:42   ` Taylor Blau
2021-01-27 13:49     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27  2:47   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27  2:49   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-01-27  3:02   ` Taylor Blau
2021-01-26 16:01 ` [PATCH 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27  3:06   ` Taylor Blau
2021-01-27 13:50     ` Derrick Stolee
2021-01-26 16:01 ` [PATCH 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-01-26 16:01 ` [PATCH 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-01-26 22:37 ` [PATCH 00/17] Refactor chunk-format into an API Junio C Hamano
2021-01-27  2:29 ` Taylor Blau
2021-01-27 15:01 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-04 21:24     ` Junio C Hamano
2021-02-04 22:40       ` Junio C Hamano
2021-02-05 11:37       ` Derrick Stolee
2021-02-05 19:25         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-04 22:59     ` Junio C Hamano
2021-02-05 11:42       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-04 23:40     ` Junio C Hamano
2021-02-05 12:19       ` Derrick Stolee
2021-02-05 19:37         ` Junio C Hamano
2021-02-08 22:26           ` Junio C Hamano
2021-02-09  1:33             ` Derrick Stolee
2021-02-09 20:47               ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 14/17] midx: " Derrick Stolee via GitGitGadget
2021-01-27 15:01   ` [PATCH v2 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05  0:00     ` Junio C Hamano
2021-02-05 10:59       ` Chris Torek
2021-02-05 20:41         ` Junio C Hamano
2021-02-06 20:35           ` Chris Torek
2021-02-05 12:30       ` Derrick Stolee
2021-02-05 19:42         ` Junio C Hamano
2021-02-07 19:50       ` SZEDER Gábor
2021-02-08  5:41         ` Junio C Hamano
2021-01-27 15:01   ` [PATCH v2 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05  0:05     ` Junio C Hamano
2021-02-05 12:31       ` Derrick Stolee
2021-01-27 15:01   ` [PATCH v2 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-05  0:15     ` Junio C Hamano
2021-01-27 16:03   ` [PATCH v2 00/17] Refactor chunk-format into an API Taylor Blau
2021-02-05  2:08   ` Junio C Hamano
2021-02-05  2:27     ` Derrick Stolee
2021-02-05 14:30   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-07 21:13       ` SZEDER Gábor
2021-02-08 13:44         ` Derrick Stolee
2021-02-11 19:43           ` SZEDER Gábor
2021-02-05 14:30     ` [PATCH v3 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-07 20:20       ` SZEDER Gábor
2021-02-08 13:35         ` Derrick Stolee
2021-02-05 14:30     ` [PATCH v3 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-05 14:30     ` [PATCH v3 17/17] chunk-format: add technical docs Derrick Stolee via GitGitGadget
2021-02-18 14:07     ` [PATCH v4 00/17] Refactor chunk-format into an API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 01/17] commit-graph: anonymize data in chunk_write_fn Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 02/17] chunk-format: create chunk format write API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 03/17] commit-graph: use chunk-format " Derrick Stolee via GitGitGadget
2021-02-24 16:52         ` SZEDER Gábor
2021-02-24 17:12           ` Taylor Blau
2021-02-24 17:52             ` Derrick Stolee
2021-02-24 19:44               ` Junio C Hamano
2021-02-18 14:07       ` [PATCH v4 04/17] midx: rename pack_info to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 05/17] midx: use context in write_midx_pack_names() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 06/17] midx: add entries to write_midx_context Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 07/17] midx: add pack_perm " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 08/17] midx: add num_large_offsets " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 09/17] midx: return success/failure in chunk write methods Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 10/17] midx: drop chunk progress during write Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 11/17] midx: use chunk-format API in write_midx_internal() Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 12/17] chunk-format: create read chunk API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 13/17] commit-graph: use chunk-format read API Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 14/17] midx: " Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 15/17] midx: use 64-bit multiplication for chunk sizes Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` [PATCH v4 16/17] chunk-format: restore duplicate chunk checks Derrick Stolee via GitGitGadget
2021-02-18 14:07       ` Derrick Stolee via GitGitGadget [this message]
2021-02-18 21:47         ` [PATCH v4 17/17] chunk-format: add technical docs Junio C Hamano
2021-02-19 12:42           ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=84bf6506dc12163b37f46192b3742c8fb234322f.1613657260.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=chris.torek@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=me@ttaylorr.com \
    --cc=stolee@gmail.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.