git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Victoria Dye <vdye@github.com>
To: Glen Choo <chooglen@google.com>,
	Victoria Dye via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: newren@gmail.com, gitster@pobox.com,
	"Derrick Stolee" <derrickstolee@github.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH v3 6/8] read-tree: narrow scope of index expansion for '--prefix'
Date: Thu, 3 Mar 2022 13:19:06 -0800	[thread overview]
Message-ID: <5353fb06-09e4-77bf-554c-1cd750158730@github.com> (raw)
In-Reply-To: <kl6lczj2exbn.fsf@chooglen-macbookpro.roam.corp.google.com>

Glen Choo wrote:
> "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Victoria Dye <vdye@github.com>
>> +static void update_sparsity_for_prefix(const char *prefix,
>> +				       struct index_state *istate)
>> +{
>> +	int prefix_len = strlen(prefix);
>> +	struct strbuf ce_prefix = STRBUF_INIT;
>> +
>> +	if (!istate->sparse_index)
>> +		return;
>> +
>> +	while (prefix_len > 0 && prefix[prefix_len - 1] == '/')
>> +		prefix_len--;
>> +
>> +	if (prefix_len <= 0)
>> +		BUG("Invalid prefix passed to update_sparsity_for_prefix");
>> +
>> +	strbuf_grow(&ce_prefix, prefix_len + 1);
>> +	strbuf_add(&ce_prefix, prefix, prefix_len);
>> +	strbuf_addch(&ce_prefix, '/');
>> +
>> +	/*
>> +	 * If the prefix points to a sparse directory or a path inside a sparse
>> +	 * directory, the index should be expanded. This is accomplished in one
>> +	 * of two ways:
>> +	 * - if the prefix is inside a sparse directory, it will be expanded by
>> +	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
>> +	 * - if the prefix matches an existing sparse directory entry,
>> +	 *   'index_name_pos(...)' will return its index position, triggering
>> +	 *   the 'ensure_full_index(...)' below.
>> +	 */
>> +	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
>> +	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
>> +		ensure_full_index(istate);
>> +
>> +	strbuf_release(&ce_prefix);
>> +}
> 
> Hm, I don't think I follow the rationale for having two different ways
> of expanding the index:
> 
> - If the prefix is inside a sparse directory, we should expand the
>   index.
> - If the prefix matches a sparse directory entry, we should expand the
>   index.
> 

There's an (admittedly subtle if you aren't familiar with sparse index)
distinction between a "sparse directory" index entry and "a directory
outside the sparse-checkout cone with SKIP_WORKTREE enabled on its entries".
Ultimately, that's what necessitates the two checks but, as in [1], I want
to use this as an opportunity to shed some light on what 'unpack_trees(...)'
does.

Taking a step back, why would index expansion matter when you pass a prefix
to 'read-tree'? The answer lies in the tree traversal at the core of
'unpack_trees(...)'; when a prefix is provided, 'unpack_trees' does the
following:

1. Set the "traversal prefix" to the user-provided prefix.
2. Call 'traverse_trees(...)', iterating through the child files/directories
   of the prefix directory (NOTE: this does *not* use the index - it finds
   the child entries by extracting them from the tree).
3. For each child, call 'unpack_callback' which, among other things, looks
   for the child entry as it exists in the index to merge the input trees
   into it.

The problem with sparse directories arises in step 3. Suppose you have the
following repo:

.
├── bar
│   └── f1
├── baz
│   ├── deep
│   │   └── a
│   └── f2
├── foo
└── foo1


where directory 'baz/' is sparse, leading to an index that looks like this
(per 'git ls-files -t --sparse'):

H bar/f1
S baz/
H foo
H foo1

If you provide 'baz/' as a prefix, 'unpack_callback(...)' tries to find
'baz/deep/a' and 'baz/f2' in the index, but they won't be found because
they're "wrapped" in sparse directory 'baz/'. Ultimately, this leads to a
corrupted index with duplicate 'baz/' contents merged in:

H bar/f1
S baz/
S baz/deep/a
S baz/f2
H foo
H foo1

This explains why you need to expand the index at all, but not why you need
to check that the prefix is not in the sparse cone *and* that it (or its
parent directory) doesn't exist in the index. For that, an important bit of
context is that you can have a sparse index with non-sparse directories
outside the cone. This can happen in a number of ways (for example, running
'git update-index --no-skip-worktree' on a file in a sparse directory), but
the important thing is that it is a completely valid state and not entirely
uncommon. 

Using our example above, suppose 'baz/' is partially expanded in the index,
with the following index contents:

H bar/f1
S baz/deep/
S baz/f2
H foo
H foo1

If we use the prefix 'baz/' here, we actually traverse the trees properly:
'baz/deep/' and 'baz/f2' will be found and merged - no index expansion
needed! But if we only checked '!path_in_cone_mode_sparse_checkout(...)', we
would have expanded the index because 'baz/' is outside the sparse cone. 

This presents a problem because index expansion is *extremely* expensive -
we should avoid it whenever possible. That's where checking
'index_name_pos(...)' comes in: if the directory is in the index as a sparse
directory, the position is '>= 0' and 'ensure_full_index(...)' is called; if
the directory is inside an existing sparse directory, the position will be
'< 0' but the index will be expanded implicitly. In every other case, we
avoid expanding the index and proceed with the merge as normal.

Hope this helps!

[1] https://lore.kernel.org/git/dc47f12b-8724-22ef-ed2c-096badfafd76@github.com/

> So it seems like distinguishing between the two cases with
> index_name_pos(...) isn't necessary. I've attached a diff that does
> exactly this, and it passes t1092-sparse-checkout-compatibility.sh as
> far as I can tell. I've also amended the comment in a way that makes
> more sense to me, but I'm not 100% sure if it's accurate.
> 
> I'm also a little averse to using a side effect of index_name_pos() to
> achieve what we really want, so I'd prefer to get rid of the call if we
> can :)
> 
> ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----
> 
> diff --git a/unpack-trees.c b/unpack-trees.c
> index b876caca0d..5b07055605 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1749,17 +1749,11 @@ static void update_sparsity_for_prefix(const char *prefix,
>  	strbuf_addch(&ce_prefix, '/');
>  
>  	/*
> -	 * If the prefix points to a sparse directory or a path inside a sparse
> -	 * directory, the index should be expanded. This is accomplished in one
> -	 * of two ways:
> -	 * - if the prefix is inside a sparse directory, it will be expanded by
> -	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
> -	 * - if the prefix matches an existing sparse directory entry,
> -	 *   'index_name_pos(...)' will return its index position, triggering
> -	 *   the 'ensure_full_index(...)' below.
> +	 * If the prefix points to a sparse directory or a path inside a
> +	 * sparse directory (not within the sparse patterns), the index
> +	 * should be expanded.
>  	 */
> -	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
> -	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
> +	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate))
>  		ensure_full_index(istate);
>  
>  	strbuf_release(&ce_prefix);


  reply	other threads:[~2022-03-03 21:19 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-23 18:25 [PATCH 0/7] Sparse index: integrate with 'read-tree' Victoria Dye via GitGitGadget
2022-02-23 18:25 ` [PATCH 1/7] sparse-index: prevent repo root from becoming sparse Victoria Dye via GitGitGadget
2022-02-24 16:48   ` Derrick Stolee
2022-02-24 21:42     ` Victoria Dye
2022-02-23 18:25 ` [PATCH 2/7] status: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-02-23 18:25 ` [PATCH 3/7] read-tree: expand sparse checkout test coverage Victoria Dye via GitGitGadget
2022-02-23 18:25 ` [PATCH 4/7] read-tree: integrate with sparse index Victoria Dye via GitGitGadget
2022-02-23 18:25 ` [PATCH 5/7] read-tree: narrow scope of index expansion for '--prefix' Victoria Dye via GitGitGadget
2022-02-23 18:25 ` [PATCH 6/7] read-tree: make two-way merge sparse-aware Victoria Dye via GitGitGadget
2022-02-26  8:05   ` Elijah Newren
2022-02-28 18:04     ` Victoria Dye
2022-03-01  2:56       ` Elijah Newren
2022-02-23 18:25 ` [PATCH 7/7] read-tree: make three-way " Victoria Dye via GitGitGadget
2022-02-24 16:59 ` [PATCH 0/7] Sparse index: integrate with 'read-tree' Derrick Stolee
2022-02-24 22:34 ` [PATCH v2 " Victoria Dye via GitGitGadget
2022-02-24 22:34   ` [PATCH v2 1/7] sparse-index: prevent repo root from becoming sparse Victoria Dye via GitGitGadget
2022-02-24 22:34   ` [PATCH v2 2/7] status: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-02-25  7:45     ` Elijah Newren
2022-02-28 23:17       ` Victoria Dye
2022-02-24 22:34   ` [PATCH v2 3/7] read-tree: expand sparse checkout test coverage Victoria Dye via GitGitGadget
2022-02-26  8:41     ` Elijah Newren
2022-02-28 18:14       ` Victoria Dye
2022-02-28 23:09     ` Ævar Arnfjörð Bjarmason
2022-02-28 23:27       ` Victoria Dye
2022-02-28 23:46         ` Ævar Arnfjörð Bjarmason
2022-02-24 22:34   ` [PATCH v2 4/7] read-tree: integrate with sparse index Victoria Dye via GitGitGadget
2022-02-24 22:34   ` [PATCH v2 5/7] read-tree: narrow scope of index expansion for '--prefix' Victoria Dye via GitGitGadget
2022-02-25  8:38     ` Elijah Newren
2022-02-25 20:25       ` Victoria Dye
2022-02-26  7:52         ` Elijah Newren
2022-02-28 18:44           ` Victoria Dye
2022-02-24 22:34   ` [PATCH v2 6/7] read-tree: make two-way merge sparse-aware Victoria Dye via GitGitGadget
2022-02-24 22:34   ` [PATCH v2 7/7] read-tree: make three-way " Victoria Dye via GitGitGadget
2022-02-26  8:46   ` [PATCH v2 0/7] Sparse index: integrate with 'read-tree' Elijah Newren
2022-03-01 20:24   ` [PATCH v3 0/8] " Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 1/8] sparse-index: prevent repo root from becoming sparse Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 2/8] status: fix nested sparse directory diff in sparse index Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 3/8] read-tree: explicitly disallow prefixes with a leading '/' Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 4/8] read-tree: expand sparse checkout test coverage Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 5/8] read-tree: integrate with sparse index Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 6/8] read-tree: narrow scope of index expansion for '--prefix' Victoria Dye via GitGitGadget
2022-03-03 17:54       ` Glen Choo
2022-03-03 21:19         ` Victoria Dye [this message]
2022-03-04 18:47           ` Glen Choo
2022-03-01 20:24     ` [PATCH v3 7/8] read-tree: make two-way merge sparse-aware Victoria Dye via GitGitGadget
2022-03-01 20:24     ` [PATCH v3 8/8] read-tree: make three-way " Victoria Dye via GitGitGadget
2022-03-02  7:22     ` [PATCH v3 0/8] Sparse index: integrate with 'read-tree' Elijah Newren
2022-03-02 13:40       ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5353fb06-09e4-77bf-554c-1cd750158730@github.com \
    --to=vdye@github.com \
    --cc=avarab@gmail.com \
    --cc=chooglen@google.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).