All of lore.kernel.org
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: git@vger.kernel.org
Cc: dstolee@microsoft.com, stolee@gmail.com, git@jeffhostetler.com,
	peff@peff.net, gitster@pobox.com, Johannes.Shindelin@gmx.de,
	jrnieder@gmail.com
Subject: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes
Date: Sun,  7 Jan 2018 13:14:42 -0500	[thread overview]
Message-ID: <20180107181459.222909-2-dstolee@microsoft.com> (raw)
In-Reply-To: <20180107181459.222909-1-dstolee@microsoft.com>

Commentary: This file format uses the large offsets from the pack-index
version 2 format, but drops the CRC32 hashes from that format.

Also: I included the HASH footer at the end only because it is already in
the pack and pack-index formats, but not because it is particularly useful
here. If possible, I'd like to remove it and speed up MIDX writes.

-- >8 --

Add a technical documentation file describing the design
for the multi-pack index (MIDX). Includes current limitations
and future work.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 149 +++++++++++++++++++++++++++
 1 file changed, 149 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 0000000000..d31b03dec5
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,149 @@
+Multi-Pack-Index (MIDX) Design Notes
+====================================
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short, with suffix ".midx")
+stores a list of objects and their offsets into multiple pack-
+files. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+A new config setting 'core.midx' must be enabled before writing
+or reading MIDX files.
+
+The MIDX files are updated by the 'midx' builtin with the
+following common parameter combinations:
+
+- 'git midx' gives the hash of the current MIDX head.
+- 'git midx --write --update-head --delete-expired' writes a new
+  MIDX file, points the MIDX head to that file, and deletes the
+  existing MIDX file if out-of-date.
+- 'git midx --read' lists some basic information about the current
+  MIDX head. Used for basic tests.
+- 'git midx --clear' deletes the current MIDX head.
+
+Design Details
+--------------
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.
+
+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+- If an object appears in multiple packfiles, then only one copy
+  is stored in the MIDX. This has a possible performance issue:
+  If an object appears as the delta-base of multiple objects from
+  multiple packs, then cross-pack delta calculations may slow down.
+  This is currently only theoretical and has not been demonstrated
+  to be a measurable issue.
+
+Current Limitations
+-------------------
+
+- MIDX files are managed only by the midx builtin and is not
+  automatically updated on clone or fetch.
+
+- There is no '--verify' option for the midx builtin to verify
+  the contents of the MIDX file against the pack contents.
+
+- Constructing a MIDX file currently requires the single-pack
+  index for every pack being added to the MIDX.
+
+- The fsck builtin does not check MIDX files, but should.
+
+- The repack builtin is not aware of the MIDX files, and may
+  invalidate the MIDX files by deleting existing packfiles. The
+  MIDX may also be extended in the future to store metadata about
+  a packfile that can be used for faster repack commands.
+
+- The naive Git HTTP server advertises lists of packfiles using
+  the file system directly.
+
+Future Work
+-----------
+
+- The current file-format requires between 28 and 36 bytes per
+  object. As the repository grows, the MIDX file can become
+  very large and become a bottleneck when updating the file. To
+  fix this "big write" problem, we can make the MIDX file
+  incremental. Instead of just one MIDX file, we will have a
+  sequence of MIDX files that can be unioned together. Then
+  on write we take the new objects to add and consider how many
+  existing files should be merged into a new file containing
+  the latest objects.
+
+  This list of "base indexes" will be presented as an optional
+  chunk in the MIDX format and contains the OIDs for the base
+  files. Thus, the `midx_head` file only stores the OID for the
+  "tip" MIDX file and then the rest are loaded based on those
+  pointers, such as the following figure:
+
+  [     BIG     ] <- [ MEDIUM ] <- [tiny] <- midx_head
+	 ^___________________________|
+
+  The plan being that every write replaces the "tiny" index,
+  and when that index becomes large enough it merges with the
+  "medium" index and a new tiny index is created in the next
+  write. Very rarely, the "big" index would be updated, causing
+  a slow write.
+
+- After the MIDX feature is sufficiently hardened and widely used,
+  consider making Git more fully depend on the MIDX file. If MIDX
+  is the default, then we can delete the single-pack-indexes from
+  the pack directory. We could also allow thin packs in the pack
+  directory.
+
+- The MIDX could be extended to store a "stable object order" such
+  that adding objects to the order does not change the existing
+  objects. This would enable re-using the reachability bitmaps after
+  repacking and updating the MIDX file.
+
+Related Links
+-------------
+
+[0] https://bugs.chromium.org/p/git/issues/detail?id=6
+    Chromium work item for: Multi-Pack Index (MIDX)
+
+[1] https://public-inbox.org/git/CB5074CF.3AD7A%25joshua.redstone@fb.com/T/#u
+    Subject: Git performance results on a large repository
+    Date: 3 Feb 2012
+
-- 
2.15.0


  reply	other threads:[~2018-01-07 18:15 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-07 18:14 [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-07 18:14 ` Derrick Stolee [this message]
2018-01-08 19:32   ` [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes Jonathan Tan
2018-01-08 20:35     ` Derrick Stolee
2018-01-08 22:06       ` Jonathan Tan
2018-01-07 18:14 ` [RFC PATCH 02/18] midx: specify midx file format Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 03/18] midx: create core.midx config setting Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 04/18] midx: write multi-pack indexes for an object list Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 05/18] midx: create midx builtin with --write mode Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 06/18] midx: add t5318-midx.sh test script Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 07/18] midx: teach midx --write to update midx-head Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 08/18] midx: teach git-midx to read midx file details Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 09/18] midx: find details of nth object in midx Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 10/18] midx: use existing midx when writing Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 11/18] midx: teach git-midx to clear midx files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 12/18] midx: teach git-midx to delete expired files Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 14/18] midx: load midx files when loading packs Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 15/18] midx: use midx for approximate object count Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx() Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 17/18] sha1_name: use midx for abbreviations Derrick Stolee
2018-01-07 18:14 ` [RFC PATCH 18/18] packfile: use midx for object loads Derrick Stolee
2018-01-07 22:42 ` [RFC PATCH 00/18] Multi-pack index (MIDX) Ævar Arnfjörð Bjarmason
2018-01-08  0:08   ` Derrick Stolee
2018-01-08 10:20     ` Jeff King
2018-01-08 10:27       ` Jeff King
2018-01-08 12:28         ` Ævar Arnfjörð Bjarmason
2018-01-08 13:43       ` Johannes Schindelin
2018-01-09  6:50         ` Jeff King
2018-01-09 13:05           ` Johannes Schindelin
2018-01-09 19:51             ` Stefan Beller
2018-01-09 20:12               ` Junio C Hamano
2018-01-09 20:16                 ` Stefan Beller
2018-01-09 21:31                   ` Junio C Hamano
2018-01-10 17:05               ` Johannes Schindelin
2018-01-10 10:57             ` Jeff King
2018-01-08 13:43       ` Derrick Stolee
2018-01-09  7:12         ` Jeff King
2018-01-08 11:43     ` Ævar Arnfjörð Bjarmason
2018-06-06  8:13     ` Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 0/2] unconditional O(1) SHA-1 abbreviation Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 1/2] config.c: use braces on multiple conditional arms Ævar Arnfjörð Bjarmason
2018-06-06 10:27       ` [RFC PATCH 2/2] sha1-name: add core.validateAbbrev & relative core.abbrev Ævar Arnfjörð Bjarmason
2018-06-06 12:04         ` Christian Couder
2018-06-06 11:24       ` [RFC PATCH 00/18] Multi-pack index (MIDX) Derrick Stolee
2018-01-10 18:25 ` Martin Fick
2018-01-10 19:39   ` Derrick Stolee
2018-01-10 21:01     ` Martin Fick

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180107181459.222909-2-dstolee@microsoft.com \
    --to=stolee@gmail.com \
    --cc=Johannes.Shindelin@gmx.de \
    --cc=dstolee@microsoft.com \
    --cc=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.