All of lore.kernel.org
 help / color / mirror / Atom feed
* Untracked Cache and --untracked-files=all
@ 2021-06-22  9:25 Tao Klerks
  2021-06-22 16:06 ` Elijah Newren
  0 siblings, 1 reply; 3+ messages in thread
From: Tao Klerks @ 2021-06-22  9:25 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

Hi folks,

I'm hoping for a little help understanding *intent* around a
particular code comment in dir.c, and reaching out to the whole list
because someone (Junio?) said they consider any mail that *doesn't*
copy the list to be spam anyway, and the original author of the
comment in question (Duy Nguyen) is no longer active on git.

Context:
I am trying to explore how to get "--untracked-files=all" to play nice
with the untracked cache, so that windows users using tooling that
sets "--untracked-files=all" can benefit from the same
orders-of-magnitude git status performance improvements as commandline
users.

There is a "naive" approach to this (store the untracked cache in the
index file with whatever dir flags were specified/used in the
recursive walk, and ignore/rewrite that cached data every time the
flags change), and there is presumably a more "comprehensive" approach
(store all the information required in the untracked cache to be able
to satisfy requests with either set of flags - even if this is a
little more expensive on first run).

The main disadvantage of the "naive" approach is that is every time
you flip-flop between "git status" and "git status -u", the untracked
cache is ignored, a recursive directory walk ensues, and the untracked
cache is rewritten to the index file for the next time you rerun,
hopefully with the same flags. However, I would think in most
situations flip-flopping will be less common - more commonly you're
using a tool or workflow that ends up running the same command(s)
repeatedly... At least, that's my thesis. I would put this "store the
untracked cache every time dir flags change" behavior behind a config
switch, anyway.

This "naive" approach *is* rather easy to achieve - you just need to
recreate a new "untracked" structure inside
dir.c#validate_untracked_cache() if you find a mismatch of flags (and
make other small fixes to store the correct flags in that new
"untracked" structure).

The one thing that sticks out after making these changes is a code
comment first introduced in 2015 by Duy Nguyen in ccad261f, explaining
*why* we refuse to use the untracked cache with "-uall":
> * See treat_directory(), case index_nonexistent. Without
> * this flag, we may need to also cache .git file content
> * for the resolve_gitlink_ref() call, which we don't.

I've seen this comment many times over the past few months, and I've
previously always interpreted it as a "correctness" concern.

Looking at the original code in question, however (as of ccad261f,
dir.c#treat_directory(), "resolve_gitlink_ref" call), I don't
understand how correctness could be impacted.

Fast-forward 6 years and all this code has been substantially
overhauled by several folks over the years (most recently and majorly
Elijah Newren), and the "resolve_gitlink_ref()" call is long-gone.

Does this look familiar to anyone? Is there any remaining obvious
reason to be wary of storing the untracked cache structure produced
with '-uall'?

Thanks,
Tao

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Untracked Cache and --untracked-files=all
  2021-06-22  9:25 Untracked Cache and --untracked-files=all Tao Klerks
@ 2021-06-22 16:06 ` Elijah Newren
  2021-06-23  6:42   ` Tao Klerks
  0 siblings, 1 reply; 3+ messages in thread
From: Elijah Newren @ 2021-06-22 16:06 UTC (permalink / raw)
  To: Tao Klerks; +Cc: git

Hi,

On Tue, Jun 22, 2021 at 2:25 AM Tao Klerks <tao@klerks.biz> wrote:
>
> Hi folks,
>
> I'm hoping for a little help understanding *intent* around a
> particular code comment in dir.c, and reaching out to the whole list

Oh, boy, dir.c.  We should probably put a simple warning at the top of
the file, something like "Abandon all hope, all ye who enter here".

> because someone (Junio?) said they consider any mail that *doesn't*
> copy the list to be spam anyway, and the original author of the
> comment in question (Duy Nguyen) is no longer active on git.
>
> Context:
> I am trying to explore how to get "--untracked-files=all" to play nice
> with the untracked cache, so that windows users using tooling that
> sets "--untracked-files=all" can benefit from the same
> orders-of-magnitude git status performance improvements as commandline
> users.
>
> There is a "naive" approach to this (store the untracked cache in the
> index file with whatever dir flags were specified/used in the
> recursive walk, and ignore/rewrite that cached data every time the
> flags change), and there is presumably a more "comprehensive" approach
> (store all the information required in the untracked cache to be able
> to satisfy requests with either set of flags - even if this is a
> little more expensive on first run).
>
> The main disadvantage of the "naive" approach is that is every time
> you flip-flop between "git status" and "git status -u", the untracked
> cache is ignored, a recursive directory walk ensues, and the untracked
> cache is rewritten to the index file for the next time you rerun,
> hopefully with the same flags. However, I would think in most
> situations flip-flopping will be less common - more commonly you're
> using a tool or workflow that ends up running the same command(s)
> repeatedly... At least, that's my thesis. I would put this "store the
> untracked cache every time dir flags change" behavior behind a config
> switch, anyway.
>
> This "naive" approach *is* rather easy to achieve - you just need to
> recreate a new "untracked" structure inside
> dir.c#validate_untracked_cache() if you find a mismatch of flags (and
> make other small fixes to store the correct flags in that new
> "untracked" structure).
>
> The one thing that sticks out after making these changes is a code
> comment first introduced in 2015 by Duy Nguyen in ccad261f, explaining
> *why* we refuse to use the untracked cache with "-uall":
> > * See treat_directory(), case index_nonexistent. Without
> > * this flag, we may need to also cache .git file content
> > * for the resolve_gitlink_ref() call, which we don't.
>
> I've seen this comment many times over the past few months, and I've
> previously always interpreted it as a "correctness" concern.
>
> Looking at the original code in question, however (as of ccad261f,
> dir.c#treat_directory(), "resolve_gitlink_ref" call), I don't
> understand how correctness could be impacted.
>
> Fast-forward 6 years and all this code has been substantially
> overhauled by several folks over the years (most recently and majorly
> Elijah Newren), and the "resolve_gitlink_ref()" call is long-gone.

I didn't remember removing any resolve_gitlink_ref() calls, but a
quick search (`git log -Sresolve_gitlink_ref -p -- dir.c`) suggests it
was removed by b22827045e0b (dir: do not traverse repositories with no
commits, 2019-04-09).  A brief look at that commit message certainly
suggests things have changed in a way that _might_ make the original
comment irrelevant.

> Does this look familiar to anyone? Is there any remaining obvious
> reason to be wary of storing the untracked cache structure produced
> with '-uall'?

Untracked cache is its own beast.  I fought with it a bunch, but it's
been a while.  (See e.g.
https://lore.kernel.org/git/CABPp-BEWhavmtDi-ahT+QMWtD6Fe-Ey7cn_82nbetWEQJL=hRA@mail.gmail.com/)

I can say that untracked traversal had all sorts of weirdness and
breakage, which could have made untracked cache harder to make right
and consistent particularly with -uall.  See for example 8d92fb292706
(dir: replace exponential algorithm with a linear one, 2020-04-01) and
perhaps aa6e1b21e5de (dir: avoid unnecessary traversal into ignored
directory, 2021-05-12).  In particular, the exponential traversal
(paths at depth N might be traversed 2^N times) from the former commit
combined with the fact that later traversals often returned a
different status for a path than the first traversal certainly could
have caused some additional weirdness for the untracked cache.

It may be that some combination of Kyle's removal of the
resolve_gitlink_ref() and my removal of the exponential traversal of
untracked paths make your idea safe...but this is dir.c and
untracked-cache, so who really knows?


Not sure if that helps, but I hope one or two pointers I provided are
somehow meaningful to you.

Elijah

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Untracked Cache and --untracked-files=all
  2021-06-22 16:06 ` Elijah Newren
@ 2021-06-23  6:42   ` Tao Klerks
  0 siblings, 0 replies; 3+ messages in thread
From: Tao Klerks @ 2021-06-23  6:42 UTC (permalink / raw)
  To: Elijah Newren; +Cc: git

Hi Elijah,

On Tue, Jun 22, 2021 at 6:06 PM Elijah Newren <newren@gmail.com> wrote:
> It may be that some combination of Kyle's removal of the
> resolve_gitlink_ref() and my removal of the exponential traversal of
> untracked paths make your idea safe...but this is dir.c and
> untracked-cache, so who really knows?
>
> Not sure if that helps, but I hope one or two pointers I provided are
> somehow meaningful to you.

It is, thank you! I had already looked at those commits you mention,
and my impression is indeed that any -uall incompatibilities have been
addressed as a result of the major restructurings and fixes.

After this exchange I prepared a patch that seems to be "right" from
my perspective - it aligns the untracked cache flags with the config,
so if your command-line "-uXXX" choice matches your config you get
untracked cache, if it does not you get it disabled for that run (ie,
in windows under some circumstances, surprisingly bad performance).

This means people who want -uall to work well/performantly (because
they prefer it, or because their tooling requires it) can now have it
do so, simply by setting "git config status.showuntrackedfiles all".

This seems much better than a new temporary config, it is not subject
to "flip-flopping" issues like my previously described "naive"
approach was, and it seems much more achievable than substantially
changing the untracked cache structure & correspondingly adjusting the
untracked directory iteration process to support both -unormal and
-uall simultaneously within a single cache structure.

The patch passes all (working) tests, including the t7603 untracked
cache series I know you worked on, and seems to do exactly what I
intend it to - but I'm new here, so there are probably many problems;
hopefully this is interesting enough for people to take the time to
point them out.

The pull request in GitGitGadget is still requesting some form of
approval before I can submit the patch to the mailing list:
https://github.com/gitgitgadget/git/pull/985

Thanks,
Tao

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-23  6:42 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-22  9:25 Untracked Cache and --untracked-files=all Tao Klerks
2021-06-22 16:06 ` Elijah Newren
2021-06-23  6:42   ` Tao Klerks

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.