All of lore.kernel.org
 help / color / mirror / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 6/8] index-format: update preamble to cached tree extension
Date: Wed, 30 Dec 2020 12:00:41 -0800	[thread overview]
Message-ID: <CABPp-BFdn-CBWMPeAKoPvuSqwkeHnAPMZovAEaiNquz_JKaPkA@mail.gmail.com> (raw)
In-Reply-To: <fb9d5468184c4cbb3d80569f685743b9a5b45c8e.1609356414.git.gitgitgadget@gmail.com>

On Wed, Dec 30, 2020 at 11:26 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> I had difficulty in my efforts to learn about the cached tree extension
> based on the documentation and code because I had an incorrect
> assumption about how it behaved. This might be due to some ambiguity in
> the documentation, so this change modifies the beginning of the cached
> tree format by expanding the description of the feature.
>
> My hope is that this documentation clarifies a few things:
>
> 1. There is an in-memory recursive tree structure that is constructed
>    from the extension data. This structure has a few differences, such
>    as where the name is stored.
>
> 2. What does it mean for an entry to be invalid?
>
> 3. When exactly are "new" trees created?
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/index-format.txt | 36 ++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index 69edf46c031..c614e136e24 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -138,12 +138,36 @@ Git index format
>
>  === Cached tree
>
> -  Cached tree extension contains pre-computed hashes for trees that can
> -  be derived from the index. It helps speed up tree object generation
> -  from index for a new commit.
> -
> -  When a path is updated in index, the path must be invalidated and
> -  removed from tree cache.
> +  Since the index does not record entries for directories, the cache
> +  entries cannot describe tree objects that already exist in the object
> +  database for regions of the index that are unchanged from an existing
> +  commit. The cached tree extension stores a recursive tree structure that
> +  describes the trees that already exist and completely match sections of
> +  the cache entries. This speeds up tree object generation from the index
> +  for a new commit by only computing the trees that are "new" to that
> +  commit.
> +
> +  The recursive tree structure uses nodes that store a number of cache
> +  entries, a list of subnodes, and an object ID (OID). The OID references
> +  the exising tree for that node, if it is known to exist. The subnodes
> +  correspond to subdirectories that themselves have cached tree nodes. The
> +  number of cache entries corresponds to the number of cache entries in
> +  the index that describe paths within that tree's directory.
> +
> +  Note that the path for a given tree is part of the parent node in-memory
> +  but is part of the child in the file format. The root tree has an empty
> +  string for its name and its name does not exist in-memory.
> +
> +  When a path is updated in index, Git invalidates all nodes of the
> +  recurisive cached tree corresponding to the parent directories of that
> +  path. We store these tree nodes as being "invalid" by using "-1" as the
> +  number of cache entries. To create trees corresponding to the current
> +  index, Git only walks the invalid tree nodes and uses the cached OIDs
> +  for the valid trees to construct new trees. In this way, Git only
> +  constructs trees on the order of the number of changed paths (and their
> +  depth in the working directory). This comes at a cost of tracking the
> +  full directory structure in the cached tree extension, but this is
> +  generally smaller than the full cache entry list in the index.

Ooh, I really like it; this probably would have helped me.  However,
we'll need to get someone else to take a look at this, because I don't
know enough to say whether any part of it is incorrect, misleading, or
incomplete or whether it's all good.  My knowledge in the area is
limited to moving a function from merge-recursive.c to cache-tree.c in
commit 724dd767b2 ("cache-tree: share code between functions writing
an index as a tree", 2019-08-17), but I seem to recall that I had to
rely on Junio's reviews and guidance to make the minor adaptations
found in that commit.

  reply	other threads:[~2020-12-30 20:02 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-30 19:26 [PATCH 0/8] Cleanups around index operations Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 1/8] tree-walk: report recursion counts Derrick Stolee via GitGitGadget
2020-12-30 19:42   ` Elijah Newren
2020-12-30 19:51     ` Derrick Stolee
2020-12-30 19:26 ` [PATCH 2/8] unpack-trees: add trace2 regions Derrick Stolee via GitGitGadget
2020-12-30 19:45   ` Elijah Newren
2020-12-30 19:26 ` [PATCH 3/8] cache-tree: use trace2 in cache_tree_update() Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 4/8] cache-tree: trace regions for I/O Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 5/8] cache-tree: trace regions for prime_cache_tree Derrick Stolee via GitGitGadget
2020-12-30 19:48   ` Elijah Newren
2020-12-30 19:53     ` Derrick Stolee
2020-12-30 19:26 ` [PATCH 6/8] index-format: update preamble to cached tree extension Derrick Stolee via GitGitGadget
2020-12-30 20:00   ` Elijah Newren [this message]
2020-12-30 19:26 ` [PATCH 7/8] index-format: discuss recursion of cached-tree better Derrick Stolee via GitGitGadget
2020-12-30 19:26 ` [PATCH 8/8] cache-tree: avoid path comparison loop when silent Derrick Stolee via GitGitGadget
2020-12-30 20:14   ` Elijah Newren
2021-01-06  8:55     ` Junio C Hamano
2021-01-06 12:08       ` Derrick Stolee
2020-12-31 12:34   ` René Scharfe
2020-12-31 16:46     ` Derrick Stolee
2021-01-01 13:30       ` René Scharfe
2021-01-02 15:19       ` [PATCH] cache-tree: use ce_namelen() instead of strlen() René Scharfe
2021-01-04  1:26         ` Derrick Stolee
2021-01-05 12:05         ` Junio C Hamano
2021-01-02 15:31       ` [PATCH 8/8] cache-tree: avoid path comparison loop when silent René Scharfe
2020-12-30 20:19 ` [PATCH 0/8] Cleanups around index operations Elijah Newren
2020-12-30 20:24   ` Derrick Stolee
2021-01-04  3:09 ` [PATCH v2 0/9] " Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 1/9] tree-walk: report recursion counts Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 2/9] unpack-trees: add trace2 regions Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 3/9] cache-tree: use trace2 in cache_tree_update() Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 4/9] cache-tree: trace regions for I/O Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 5/9] cache-tree: trace regions for prime_cache_tree Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 6/9] index-format: update preamble to cached tree extension Derrick Stolee via GitGitGadget
2021-01-07  2:10     ` Junio C Hamano
2021-01-07 11:51       ` Derrick Stolee
2021-01-07 20:12         ` Junio C Hamano
2021-01-07 21:26         ` Junio C Hamano
2021-01-04  3:09   ` [PATCH v2 7/9] index-format: discuss recursion of cached-tree better Derrick Stolee via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 8/9] cache-tree: use ce_namelen() instead of strlen() René Scharfe via GitGitGadget
2021-01-04  3:09   ` [PATCH v2 9/9] cache-tree: speed up consecutive path comparisons Derrick Stolee via GitGitGadget
2021-01-07 16:32   ` [PATCH v3 00/10] Cleanups around index operations Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 01/10] tree-walk: report recursion counts Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 02/10] unpack-trees: add trace2 regions Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 03/10] cache-tree: use trace2 in cache_tree_update() Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 04/10] cache-tree: trace regions for I/O Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 05/10] cache-tree: trace regions for prime_cache_tree Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 06/10] index-format: use 'cache tree' over 'cached tree' Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 07/10] index-format: update preamble to cache tree extension Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 08/10] index-format: discuss recursion of cached-tree better Derrick Stolee via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 09/10] cache-tree: use ce_namelen() instead of strlen() René Scharfe via GitGitGadget
2021-01-07 16:32     ` [PATCH v3 10/10] cache-tree: speed up consecutive path comparisons Derrick Stolee via GitGitGadget
2021-01-16  6:58     ` [PATCH v3 00/10] Cleanups around index operations Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BFdn-CBWMPeAKoPvuSqwkeHnAPMZovAEaiNquz_JKaPkA@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.