From: 胡哲宁 <adlternative@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Git List <git@vger.kernel.org>, Eric Sunshine <sunshine@sunshineco.com>
Subject: Re: [PATCH v3] ls-files.c: add --dedup option
Date: Sun, 17 Jan 2021 11:45:25 +0800 [thread overview]
Message-ID: <CAOLTT8RQWm-tpkj1aO1rPTCJApP5i+niQtNm_zMycSo5YT0B_w@mail.gmail.com> (raw)
In-Reply-To: <xmqqczy7vwub.fsf@gitster.c.googlers.com>
Junio, thank you for your patience to review
my patch and guide me how to modify it.
Junio C Hamano <gitster@pobox.com> 于2021年1月15日周五 上午8:59写道:
>
> "阿德烈 via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > In order to provide users a better experience
> > when viewing information about files in the index
> > and the working tree, the `--dedup` option will suppress
> > some duplicate options under some conditions.
> >
> > In a merge conflict, one item of "git ls-files" output may
> > appear multiple times. For example,now the file `a.c` has
> > a conflict,`a.c` will appear three times in the output of
> > "git ls-files".We can use "git ls-files --dedup" to output
> > `a.c` only one time.(unless `--stage` or `--unmerged` is
> > used to view all the detailed information in the index)
>
> Unlike these option names we see in the description, "dedup" is not
> a full word. Perhaps spell it fully "--deduplicate" while letting
> parse-options machinery to accept unique prefix (including
> "--dedup"?
>
Ok i have modified "--dedup" to "--deduplicate".
> > In addition, if you use both `--delete` and `--modify` in
> > the same time, The `--dedup` option can also suppress modified
>
> "at the same time", I think.
>
My poor English grammar :-)
> > entries output.
>
> [let's call this point "point A"]
>
> > `--dedup` option relevant descriptions in
> > `Documentation/git-ls-files.txt`,
>
> I am not sure what this means.
>
> > the test script in `t/t3012-ls-files-dedup.sh`
> > prove the correctness of the `--dedup` option.
>
> No amount of tests "proves" any correctness, but that is OK. I
> think you meant to say "a few tests have been added to t3012 to
> protect the new feature from future breakage" or something like
> that.
>
Alright, I understand!
> In any case, I think everything after "point A" and before your sign
> off does not belong to the log message. The diffstat shows that
> documentation and tests have been added already.
>
> > +--dedup::
> > + Suppress duplicate entries when conflict happen
>
> "conflict happen" -> "there are unmerged paths", as the term
> "unmerged" is already shown to readers of "ls-files --help".
>
Well, maybe I'm not good enough with these proper nouns.
> > + or `--deleted` and `--modified` are combined.
>
> I somehow thought that you refrained from deduping when you are
> showing the stages with "ls-files -u" and "ls-files -s", or you are
> showing status with "ls-files -t", because you will otherwise lose
> information. In other words, showing only one cache entry out of
> many that share the same name makes sense only when we are showing
> name and nothing else.
>
You are right, "--deduplicate" should only work on duplicate file names,
so "ls-files -t" also needs to be corrected.
Well,This is true a bug I haven't notice.
> Having said all that, I suspect that we may be much better off if we
> can somehow merge the two loops into one. You may be dedup adjacent
> entries in each loop separately with the approach taken by this
> patch, but I do not think the patch would work to deduplicate across
> two loops. For example, what happens if you do this?
>
> $ git reset --hard
> $ echo >>builtin/ls-files.c
> $ git ls-files -c -m builtin/ls-files.c
> $ git ls-files -t -c -m builtin/ls-files.c
>
> I think you see the path twice in the output, with or without your
> --dedup option (remember what I said about proving, by the way? ;-)).
>
Yeah,This is because I may have missed the -c option with other options
at the same time.
Here I may disagree with your point of view:
if (errno != E_NOENT)
error_errno("cannot lstat '%s'", fullname.buf);
With this sentence included, the patch will fail the test:
t/t3010-ls-files-killed-modified.sh.
the errno maybe ENOTDIR when you try to lstat a file`r` with `lstat("r/f",&st);`
So I temporarily removed the judgment of errno.
> #2: consolidate two for loops into one.
>
> The two loops have slightly different condition to skip a ce,
> and different logic on what tag each path is shown with. When
> --cached and --modified or --deleted are asked for at the same
> time, we'd show them multiple times (this is done inside the
> loop for each ce)
>
> if (show_cached || show_stage)
> show_ce(... ce_stage(ce) ? tag_unmerged : ...);
> err = lstat(fullname.buf, &st);
> if (err) {
> /* deleted? */
> ... code that corresponds to the
> ... illustration in #1 above come here.
> } else if (...)
> show_ce(..., tag_modified);
>
> This changes the semantics. The original iterates the index
> twice, so you may see the same entry from --cached once and
> then again from --modified. The updated one still will show
> the same entry twice but next to each other.
>
Well,This does change the semantics. I think people who used two
for loops before may want to separate different outputs.
Now, if you don’t use "--deduplicate", You may see six consecutive
items under a combination of multiple options.
> #3: optionally deduplicate.
>
> Once we have a single loop, deduplicationg based on names is
> trivial, as we seen before.
>
>
Indeed so.
> Hmm?
THANKS.
Junio C Hamano <gitster@pobox.com> 于2021年1月15日周五 上午8:59写道:
>
> "阿德烈 via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: ZheNing Hu <adlternative@gmail.com>
> >
> > In order to provide users a better experience
> > when viewing information about files in the index
> > and the working tree, the `--dedup` option will suppress
> > some duplicate options under some conditions.
> >
> > In a merge conflict, one item of "git ls-files" output may
> > appear multiple times. For example,now the file `a.c` has
> > a conflict,`a.c` will appear three times in the output of
> > "git ls-files".We can use "git ls-files --dedup" to output
> > `a.c` only one time.(unless `--stage` or `--unmerged` is
> > used to view all the detailed information in the index)
>
> Unlike these option names we see in the description, "dedup" is not
> a full word. Perhaps spell it fully "--deduplicate" while letting
> parse-options machinery to accept unique prefix (including
> "--dedup"?
>
> > In addition, if you use both `--delete` and `--modify` in
> > the same time, The `--dedup` option can also suppress modified
>
> "at the same time", I think.
>
> > entries output.
>
> [let's call this point "point A"]
>
> > `--dedup` option relevant descriptions in
> > `Documentation/git-ls-files.txt`,
>
> I am not sure what this means.
>
> > the test script in `t/t3012-ls-files-dedup.sh`
> > prove the correctness of the `--dedup` option.
>
> No amount of tests "proves" any correctness, but that is OK. I
> think you meant to say "a few tests have been added to t3012 to
> protect the new feature from future breakage" or something like
> that.
>
> In any case, I think everything after "point A" and before your sign
> off does not belong to the log message. The diffstat shows that
> documentation and tests have been added already.
>
> > +--dedup::
> > + Suppress duplicate entries when conflict happen
>
> "conflict happen" -> "there are unmerged paths", as the term
> "unmerged" is already shown to readers of "ls-files --help".
>
> > + or `--deleted` and `--modified` are combined.
>
> I somehow thought that you refrained from deduping when you are
> showing the stages with "ls-files -u" and "ls-files -s", or you are
> showing status with "ls-files -t", because you will otherwise lose
> information. In other words, showing only one cache entry out of
> many that share the same name makes sense only when we are showing
> name and nothing else.
>
> Has that been changed from the previous rounds?
>
> > diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> > index c8eae899b82..bc4eded19ab 100644
> > --- a/builtin/ls-files.c
> > +++ b/builtin/ls-files.c
> > @@ -316,6 +318,20 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> > for (i = 0; i < repo->index->cache_nr; i++) {
> > const struct cache_entry *ce = repo->index->cache[i];
> >
> > + if (show_cached && delete_dup) {
> > + switch (ce_stage(ce)) {
> > + case 0:
> > + default:
> > + break;
>
> This part looks somewhat strange for two reasons:
>
> - The code enumerates ALL the possible stage numbers from 0 to 3;
> if we were to have "default", I'd expect it would be a separate
> switch arm from the possible values that calls out an programming
> error, e.g. BUG("at stage #%d???", ce_stage(ce)). Simply removing
> the "default" arm would be another way out of this strangeness.
>
> - When we see a stage #0 entry, we know we will not have higher
> stage entries with the same name. Not clearing last_stage here
> feels wrong, as the primary reason why last_stage variable is
> used is to remember the last ce that was shown, so that other
> entries with the same name can be skipped.
>
> By the way, "last_shown_ce" may be a much better name for the
> variable, as you do not really care what stage number the ce you
> showed last was at (you care about its name).
>
> Also, I do not see a good reason why the last_shown_ce variable
> should have lifetime longer than the block that contains this for()
> loop (and the other for loop for deleted/modified codepath we see
> later). Especially since you initialize the variable that you made
> visible to the entire function to NULL before entering the first for
> loop, but you do not set it back to NULL before entering the second
> for loop, it is inviting a subtle bug. You may have been given
> show_cached and show_modified at the same time, so you enter the
> first loop and have shown the first stage of the last conflicted
> path, whose cache entry is left in the last_stage variable. Since
> the variable has longer lifespan than necessary, when the second
> loop is entered, it still points at the cache entry for the highest
> stage of the last conflicted path. That is because the code forgets
> to clear it to NULL before entering the second for loop.
>
> Having said all that, I suspect that we may be much better off if we
> can somehow merge the two loops into one. You may be dedup adjacent
> entries in each loop separately with the approach taken by this
> patch, but I do not think the patch would work to deduplicate across
> two loops. For example, what happens if you do this?
>
> $ git reset --hard
> $ echo >>builtin/ls-files.c
> $ git ls-files -c -m builtin/ls-files.c
> $ git ls-files -t -c -m builtin/ls-files.c
>
> I think you see the path twice in the output, with or without your
> --dedup option (remember what I said about proving, by the way? ;-)).
>
> Once we successfully merged two loops into one, the part that shows
> tracked paths in the function would have only one loop, and it would
> become a lot cleaner to add the logic to "skip showing the ce if it
> has the same name as the previously shown one, only when doing so
> won't lose information", by doing something like this:
>
> static void show_files(....)
> {
> /* show_others || show_killed done here */
> ...
>
> /* leave early if not showing anything */
> if (! (show_cached || show_stage || show_deleted || show_modified))
> return;
>
> last_shown_ce = NULL;
> for (i = 0; i < repo->index->cache_nr; i++) {
> const struct cache_entry *ce = repo->index->cache[i];
>
> if (skipping_duplicates && last_shown_ce)
> if (!strcmp(ce->name, last_shown_ce->name))
> continue;
>
> construct_fullname();
>
> /* various reasons to skip the entry tested */
> if (showing ignored directory and ce is excluded)
> continue;
> if (show_unmerged && !ce_stage(ce))
> continue;
> if (ce->ce_flags & CE_UPDATE)
> continue;
> ... other reasons may appear here ...
>
> /* now we are committed to show it */
> last_shown_ce = ce;
>
> ... various different ways to show ce come here ...
> show_ce(...);
> }
> }
>
> where "skipping_duplicates" would be set when "--deduplicate" is
> asked and we are not showing information other than the pathname
> via various options e.g. the tags (-t) or stages (-s/-u).
>
> > + if (delete_dup && show_deleted && show_modified && err)
> > show_ce(repo, dir, ce, fullname.buf, tag_removed);
>
> I actually think the original code that is still shown here ...
>
> > + else {
> > + if (show_deleted && err)
> > + show_ce(repo, dir, ce, fullname.buf, tag_removed);
> > + if (show_modified && ie_modified(repo->index, ce, &st, 0))
>
> ... about modified file is buggy. If lstat() failed, then &st has
> no useful information, so it is wrong to feed it to ie_modified().
>
> Perhaps a three-patch series that is structured like this may be in
> order?
>
> #1: bugfix for --deleted and --modified.
>
> err = lstat(fullname.buf, &st);
> if (err) {
> /* deleted? */
> if (errno != E_NOENT)
> error_errno("cannot lstat '%s'", fullname.buf);
> else {
> if (show_deleted)
> show_ce(..., tag_removed);
> if (show_modified)
> show_ce(..., tag_modified);
> }
> } else if (show_modified && ie_modified(...))
> show_ce(..., tag_modified);
>
> This hopefully should not change the semantics. If you ask
> --deleted and --modified, a deleted path would be listed twice.
>
> #2: consolidate two for loops into one.
>
> The two loops have slightly different condition to skip a ce,
> and different logic on what tag each path is shown with. When
> --cached and --modified or --deleted are asked for at the same
> time, we'd show them multiple times (this is done inside the
> loop for each ce)
>
> if (show_cached || show_stage)
> show_ce(... ce_stage(ce) ? tag_unmerged : ...);
> err = lstat(fullname.buf, &st);
> if (err) {
> /* deleted? */
> ... code that corresponds to the
> ... illustration in #1 above come here.
> } else if (...)
> show_ce(..., tag_modified);
>
> This changes the semantics. The original iterates the index
> twice, so you may see the same entry from --cached once and
> then again from --modified. The updated one still will show
> the same entry twice but next to each other.
>
> #3: optionally deduplicate.
>
> Once we have a single loop, deduplicationg based on names is
> trivial, as we seen before.
>
>
> Hmm?
next prev parent reply other threads:[~2021-01-17 3:45 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-06 8:53 [PATCH] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-07 6:10 ` Eric Sunshine
2021-01-07 6:40 ` Junio C Hamano
2021-01-08 14:36 ` [PATCH v2 0/2] " 阿德烈 via GitGitGadget
2021-01-08 14:36 ` [PATCH v2 1/2] " ZheNing Hu via GitGitGadget
2021-01-08 14:36 ` [PATCH v2 2/2] builtin:ls-files.c:add " ZheNing Hu via GitGitGadget
2021-01-14 6:38 ` Eric Sunshine
2021-01-14 8:17 ` 胡哲宁
2021-01-14 12:22 ` [PATCH v3] ls-files.c: add " 阿德烈 via GitGitGadget
2021-01-15 0:59 ` Junio C Hamano
2021-01-17 3:45 ` 胡哲宁 [this message]
2021-01-17 4:37 ` Junio C Hamano
2021-01-16 7:13 ` Eric Sunshine
2021-01-17 3:49 ` 胡哲宁
2021-01-17 5:11 ` Eric Sunshine
2021-01-17 23:04 ` Junio C Hamano
2021-01-18 14:59 ` Eric Sunshine
2021-01-17 4:02 ` [PATCH v4 0/3] builtin/ls-files.c:add git ls-file " 阿德烈 via GitGitGadget
2021-01-17 4:02 ` [PATCH v4 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-17 6:22 ` Junio C Hamano
2021-01-17 4:02 ` [PATCH v4 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-17 4:02 ` [PATCH v4 3/3] ls-files: add --deduplicate option ZheNing Hu via GitGitGadget
2021-01-17 6:25 ` Junio C Hamano
2021-01-17 23:34 ` Junio C Hamano
2021-01-18 4:09 ` 胡哲宁
2021-01-18 6:05 ` 胡哲宁
2021-01-18 21:31 ` Junio C Hamano
2021-01-19 2:56 ` 胡哲宁
2021-01-19 6:30 ` [PATCH v5 0/3] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-19 6:30 ` [PATCH v5 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-20 20:26 ` Junio C Hamano
2021-01-21 10:02 ` 胡哲宁
2021-01-19 6:30 ` [PATCH v5 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-20 20:27 ` Junio C Hamano
2021-01-21 11:05 ` 胡哲宁
2021-01-19 6:30 ` [PATCH v5 3/3] ls-files.c: add --deduplicate option ZheNing Hu via GitGitGadget
2021-01-20 21:26 ` Junio C Hamano
2021-01-21 11:00 ` 胡哲宁
2021-01-21 20:45 ` Junio C Hamano
2021-01-22 9:50 ` 胡哲宁
2021-01-22 16:04 ` Johannes Schindelin
2021-01-22 18:02 ` Junio C Hamano
2021-03-19 13:54 ` GitGitGadget and `next`, was " Johannes Schindelin
2021-03-19 18:11 ` Junio C Hamano
2021-01-23 8:20 ` 胡哲宁
2021-01-22 15:46 ` [PATCH v6] " ZheNing Hu
2021-01-22 20:52 ` Junio C Hamano
2021-01-23 8:27 ` 胡哲宁
2021-01-23 10:20 ` [PATCH v6 0/3] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-23 10:20 ` [PATCH v6 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-23 17:55 ` Junio C Hamano
2021-01-23 10:20 ` [PATCH v6 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-23 19:50 ` Junio C Hamano
2021-01-23 10:20 ` [PATCH v6 3/3] ls-files.c: add --deduplicate option ZheNing Hu via GitGitGadget
2021-01-23 19:51 ` Junio C Hamano
2021-01-23 19:53 ` [PATCH v7 1/3] ls_files.c: bugfix for --deleted and --modified Junio C Hamano
2021-01-23 19:53 ` [PATCH v7 2/3] ls_files.c: consolidate two for loops into one Junio C Hamano
2021-01-23 19:53 ` [PATCH v7 3/3] ls-files.c: add --deduplicate option Junio C Hamano
2021-01-24 10:54 ` [PATCH v7 0/3] builtin/ls-files.c:add git ls-file --dedup option 阿德烈 via GitGitGadget
2021-01-24 10:54 ` [PATCH v7 1/3] ls_files.c: bugfix for --deleted and --modified ZheNing Hu via GitGitGadget
2021-01-24 22:04 ` Junio C Hamano
2021-01-25 6:05 ` 胡哲宁
2021-01-25 19:05 ` Junio C Hamano
2021-01-24 10:54 ` [PATCH v7 2/3] ls_files.c: consolidate two for loops into one ZheNing Hu via GitGitGadget
2021-01-24 10:54 ` [PATCH v7 3/3] ls-files.c: add --deduplicate option ZheNing Hu via GitGitGadget
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOLTT8RQWm-tpkj1aO1rPTCJApP5i+niQtNm_zMycSo5YT0B_w@mail.gmail.com \
--to=adlternative@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=sunshine@sunshineco.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).